跳到主要内容

在3D中可视化嵌入向量

nbviewer

该示例使用PCA将嵌入的维度从1536降低到3。然后我们可以在一个三维图中可视化数据点。小数据集dbpedia_samples.jsonl是通过从DBpedia验证数据集中随机抽取200个样本而筛选出的。

1. 加载数据集和查询嵌入向量

import pandas as pd
samples = pd.read_json("data/dbpedia_samples.jsonl", lines=True)
categories = sorted(samples["category"].unique())
print("Categories of DBpedia samples:", samples["category"].value_counts())
samples.head()

Categories of DBpedia samples: Artist                    21
Film 19
Plant 19
OfficeHolder 18
Company 17
NaturalPlace 16
Athlete 16
Village 12
WrittenWork 11
Building 11
Album 11
Animal 11
EducationalInstitution 10
MeanOfTransportation 8
Name: category, dtype: int64
text category
0 Morada Limited is a textile company based in ... Company
1 The Armenian Mirror-Spectator is a newspaper ... WrittenWork
2 Mt. Kinka (金華山 Kinka-zan) also known as Kinka... NaturalPlace
3 Planning the Play of a Bridge Hand is a book ... WrittenWork
4 Wang Yuanping (born 8 December 1976) is a ret... Athlete
from utils.embeddings_utils import get_embeddings
# 注意:以下代码将向 /embeddings 发送一个批次大小为 200 的查询。
matrix = get_embeddings(samples["text"].to_list(), model="text-embedding-3-small")

2. 减少嵌入维度

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
vis_dims = pca.fit_transform(matrix)
samples["embed_vis"] = vis_dims.tolist()

3. 绘制降维后的嵌入向量

%matplotlib widget
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(projection='3d')
cmap = plt.get_cmap("tab20")

# 分别绘制每个样本类别,以便我们可以设置标签名称。
for i, cat in enumerate(categories):
sub_matrix = np.array(samples[samples["category"] == cat]["embed_vis"].to_list())
x=sub_matrix[:, 0]
y=sub_matrix[:, 1]
z=sub_matrix[:, 2]
colors = [cmap(i/len(categories))] * len(sub_matrix)
ax.scatter(x, y, zs=z, zdir='z', c=colors, label=cat)

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
ax.legend(bbox_to_anchor=(1.1, 1))