在Python中使用OpenAI进行K-means聚类
我们使用一个简单的k-means算法来演示如何进行聚类。聚类可以帮助发现数据中有价值的隐藏分组。数据集是在Get_embeddings_from_dataset Notebook中创建的。
# 导入
import numpy as np
import pandas as pd
from ast import literal_eval
# 加载数据
datafile_path = "./data/fine_food_reviews_with_embeddings_1k.csv"
df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(literal_eval).apply(np.array) # 将字符串转换为NumPy数组
matrix = np.vstack(df.embedding.values)
matrix.shape
(1000, 1536)
1. 使用K-means算法找到聚类中心
我们展示了K-means的最简单用法。您可以选择最适合您用例的聚类数。
from sklearn.cluster import KMeans
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels
df.groupby("Cluster").Score.mean().sort_values()
/opt/homebrew/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
Cluster
0 4.105691
1 4.191176
2 4.215613
3 4.306590
Name: Score, dtype: float64
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200)
vis_dims2 = tsne.fit_transform(matrix)
x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]
for category, color in enumerate(["purple", "green", "red", "blue"]):
xs = np.array(x)[df.Cluster == category]
ys = np.array(y)[df.Cluster == category]
plt.scatter(xs, ys, color=color, alpha=0.3)
avg_x = xs.mean()
avg_y = ys.mean()
plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")
Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')
在二维投影中对聚类进行可视化。在这个运行中,绿色的聚类(#1)看起来与其他聚类非常不同。让我们看一下每个聚类的一些样本。