1. 什么是clustering
聚类是一种非常有用的无监督学习技术,它的主要目的是发现数据的内在结构和模式。在许多实际应用中,我们可能没有明确的目标变量或预测目标,但我们仍希望了解数据的组织方式,或者找出数据中的特定模式或组。这就是聚类的价值所在。
尽管聚类是无监督的(即我们不需要任何标签或目标变量来引导学习过程),但这并不意味着聚类是无目的的。事实上,聚类可以有很多具体的目标和应用,包括但不限于以下几个方面:
-
探索性数据分析:聚类可以帮助我们理解数据的结构和模式,例如,哪些数据点是相似的,哪些数据点是不同的,数据中是否存在特定的群体或者异常值等。
-
预处理和特征工程:聚类的结果可以作为其他机器学习任务的输入。例如,我们可以使用聚类标签作为新的特征,或者使用聚类来减少数据的维度。
-
市场细分:在商业和市场研究中,聚类常被用于识别不同的客户群体,以便进行更精细的目标营销。
-
异常检测:聚类也可以用于异常检测,即识别那些与大多数数据点显著不同的数据点。
聚类是一种强大且多用途的工具,可以帮助我们更好地理解和利用数据。
下面看一下,如何实现通过聚类找到异常的数据。
2.例子代码的解析
学习代码来源:
openai-cookbook/examples/Clustering.ipynb
使用的数据时fine_food_reviews_with_embeddings_1k.csv,这是一些对商家评论和打分的数据。有如下数据项。
from sklearn.cluster import KMeans
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42) kmeans.fit(matrix)
labels = kmeans.labels_ df["Cluster"] = labels df.groupby("Cluster").Score.mean().sort_values()
上述代码使用了一种叫做K-means的无监督机器学习算法来对数据进行聚类。K-means算法的主要思想是将n个数据点分配到k个簇中,使得每个数据点到其所属簇的质心(即簇内所有点的均值)的平方距离之和最小。这个过程通常通过迭代的方式进行,直到簇的质心不再变化或者变化非常小。
具体来说,K-means算法首先选择k个初始质心。这些初始质心可以是随机选择的,也可以使用一些特定的方法如"k-means++"来选择。然后,算法通过以下两步交替进行:
-
分配步骤:对于每个数据点,计算其到所有质心的距离,并将其分配给最近的质心所对应的簇。
-
更新步骤:对于每个簇,计算簇内所有数据点的均值,然后将这个均值设置为新的质心。
这两步一直迭代,直到质心不再变化或者达到预设的最大迭代次数。
在上述代码中,通过调用kmeans.fit(matrix)
,实现了上述迭代过程。执行完这个函数后,KMeans对象的labels_
属性就包含了每个数据点的聚类标签。将这些标签添加到原始数据框,便于后续的分析和可视化。
最后一行代码是对每个聚类(簇)的"Score"列取均值并排序。这可以帮助我们理解每个聚类的特性,例如在本例中,可以理解不同聚类的平均评分是否有显著差异。
2. 聚类结果分析
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200)
vis_dims2 = tsne.fit_transform(matrix)
x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]
for category, color in enumerate(["purple", "green", "red", "blue"]):
xs = np.array(x)[df.Cluster == category]
ys = np.array(y)[df.Cluster == category]
plt.scatter(xs, ys, color=color, alpha=0.3)
avg_x = xs.mean()
avg_y = ys.mean()
plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")
根据t-SNE的结果,数据点被映射到了一个二维空间中,并以此进行可视化。在这个运行中,绿色的簇(即#1簇)看起来与其他簇有很大的不同。接下来,我们可以查看每个簇中的一些样本,以更好地理解这些簇的特性。
聚类中的文本样本 & 命名聚类
我们将展示每个聚类中的随机样本。我们将使用 text-davinci-003 基于每个聚类中的5个评论的随机样本来命名聚类。
import openai
# Reading a review which belong to each group.
rev_per_cluster = 5
for i in range(n_clusters):
print(f"Cluster {i} Theme:", end=" ")
reviews = "\n".join(
df[df.Cluster == i]
.combined.str.replace("Title: ", "")
.str.replace("\n\nContent: ", ": ")
.sample(rev_per_cluster, random_state=42)
.values
)
response = openai.Completion.create(
engine="text-davinci-003",
prompt=f'What do the following customer reviews have in common?\n\nCustomer reviews:\n"""\n{reviews}\n"""\n\nTheme:',
temperature=0,
max_tokens=64,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
print(response["choices"][0]["text"].replace("\n", ""))
sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)
for j in range(rev_per_cluster):
print(sample_cluster_rows.Score.values[j], end=", ")
print(sample_cluster_rows.Summary.values[j], end=": ")
print(sample_cluster_rows.Text.str[:70].values[j])
print("-" * 100)
Cluster 0 Theme: All of the reviews are positive and the customers are satisfied with the product they purchased. 5, Loved these gluten free healthy bars, saved $$ ordering on Amazon: These Kind Bars are so good and healthy & gluten free. My daughter ca 1, Should advertise coconut as an ingredient more prominently: First, these should be called Mac - Coconut bars, as Coconut is the #2 5, very good!!: just like the runts<br />great flavor, def worth getting<br />I even o 5, Excellent product: After scouring every store in town for orange peels and not finding an 5, delicious: Gummi Frogs have been my favourite candy that I have ever tried. of co ---------------------------------------------------------------------------------------------------- Cluster 1 Theme: All of the reviews are about pet food. 2, Messy and apparently undelicious: My cat is not a huge fan. Sure, she'll lap up the gravy, but leaves th 4, The cats like it: My 7 cats like this food but it is a little yucky for the human. Piece 5, cant get enough of it!!!: Our lil shih tzu puppy cannot get enough of it. Everytime she sees the 1, Food Caused Illness: I switched my cats over from the Blue Buffalo Wildnerness Food to this 5, My furbabies LOVE these!: Shake the container and they come running. Even my boy cat, who isn't ---------------------------------------------------------------------------------------------------- Cluster 2 Theme: All of the reviews are positive and express satisfaction with the product. 5, Fog Chaser Coffee: This coffee has a full body and a rich taste. The price is far below t 5, Excellent taste: This is to me a great coffee, once you try it you will enjoy it, this 4, Good, but not Wolfgang Puck good: Honestly, I have to admit that I expected a little better. That's not 5, Just My Kind of Coffee: Coffee Masters Hazelnut coffee used to be carried in a local coffee/pa 5, Rodeo Drive is Crazy Good Coffee!: Rodeo Drive is my absolute favorite and I'm ready to order more! That ---------------------------------------------------------------------------------------------------- Cluster 3 Theme: All of the reviews are about food or drink products. 5, Wonderful alternative to soda pop: This is a wonderful alternative to soda pop. It's carbonated for thos 5, So convenient, for so little!: I needed two vanilla beans for the Love Goddess cake that my husbands 2, bot very cheesy: Got this about a month ago.first of all it smells horrible...it tastes 5, Delicious!: I am not a huge beer lover. I do enjoy an occasional Blue Moon (all o 3, Just ok: I bought this brand because it was all they had at Ranch 99 near us. I ----------------------------------------------------------------------------------------------------
可以看到通过openai提供的模型对客户评价进行类总结和说明。