这个笔记本演示了一种定制OpenAI嵌入到特定任务的方法。

输入是训练数据，格式为[text_1, text_2, label]，其中label为+1表示这些对是相似的，-1表示这些对是不相似的。

输出是一个矩阵，您可以用它来乘以您的嵌入。这个乘法的结果是一个“定制嵌入”，它将更好地强调与您的用例相关的文本方面。在二元分类用例中，我们看到错误率下降了多达50%。

在下面的示例中，我使用了从SNLI语料库中挑选的1,000个句子对。每对句子在逻辑上是蕴涵的（即，一个句子暗示另一个句子）。这些对是我们的正例（label

1）。我们通过组合不同对的句子来生成合成的负例，这些句子被认为不是逻辑上蕴涵的（label = -1）。

对于聚类用例，您可以通过创建来自相同簇中的文本对来生成正例，并通过创建来自不同簇中的句子对来生成负例。

对于其他数据集，我们已经看到只需大约100个训练示例就可以看到相当不错的改进。当然，使用更多示例性能会更好。

0. 导入模块

# 导入
from typing import List, Tuple  # 对于类型提示

import numpy as np  # 用于操作数组
import pandas as pd  # 用于在数据框中操作数据
import pickle  # 用于保存嵌入缓存
import plotly.express as px  # 对于情节
import random  # 用于生成运行ID
from sklearn.model_selection import train_test_split  # 用于划分训练与测试数据
import torch  # 对于矩阵优化

from utils.embeddings_utils import get_embedding, cosine_similarity  # 对于嵌入式系统

1. 输入

大多数输入都在这里。需要更改的关键内容包括从哪里加载数据集，将嵌入的缓存保存到哪里，以及要使用哪种嵌入引擎。

根据数据的格式，您可能需要重写process_input_data函数。

# 输入参数
embedding_cache_path = "data/snli_embedding_cache.pkl"  # 嵌入数据将在此处保存/加载。
default_embedding_engine = "text-embedding-3-small"
num_pairs_to_embed = 1000  # 1000这个数字是随意选定的。
local_dataset_path = "data/snli_1.0_train_2k.csv"  # 下载链接：https://nlp.stanford.edu/projects/snli/


def process_input_data(df: pd.DataFrame) -> pd.DataFrame:
    # 您可以自定义此项以预处理您自己的数据集。
    # 输出应为一个包含3列的数据框：text_1、text_2、label（1表示相似，-1表示不相似）
    df["label"] = df["gold_label"]
    df = df[df["label"].isin(["entailment"])]
    df["label"] = df["label"].apply(lambda x: {"entailment": 1, "contradiction": -1}[x])
    df = df.rename(columns={"sentence1": "text_1", "sentence2": "text_2"})
    df = df[["text_1", "text_2", "label"]]
    df = df.head(num_pairs_to_embed)
    return df

2. 加载和处理输入数据

# 加载数据
df = pd.read_csv(local_dataset_path)

# 处理输入数据
df = process_input_data(df)  # 这表明训练数据仅包含正样本。

# 查看数据
df.head()

/var/folders/r4/x3kdvs816995fnnph2gdpwp40000gn/T/ipykernel_17509/1977422881.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["label"] = df["label"].apply(lambda x: {"entailment": 1, "contradiction": -1}[x])

	text_1	text_2	label
2	A person on a horse jumps over a broken down a...	A person is outdoors, on a horse.	1
4	Children smiling and waving at camera	There are children present	1
7	A boy is jumping on skateboard in the middle o...	The boy does a skateboarding trick.	1
14	Two blond women are hugging one another.	There are women showing affection.	1
17	A few people in a restaurant setting, one of t...	The diners are at a restaurant.	1

3. 将数据分割为训练集和测试集

请注意，在生成合成的负例或正例之前，将数据分割为训练集和测试集是很重要的。您不希望训练数据中的任何文本字符串出现在测试数据中。如果有污染，测试指标看起来会比实际在生产环境中的情况要好。

# 将数据分割为训练集和测试集
test_fraction = 0.5  # 0.5这个数值相当随意。
random_seed = 123  # 随机种子是任意的，但在确保可重复性方面非常有用。
train_df, test_df = train_test_split(
    df, test_size=test_fraction, stratify=df["label"], random_state=random_seed
)
train_df.loc[:, "dataset"] = "train"
test_df.loc[:, "dataset"] = "test"

4. 生成合成负样本

这是另一部分代码，您需要修改以匹配您的用例。

如果您的数据中有正样本和负样本，则可以跳过此部分。

如果您的数据只有正样本，则可以保持大部分不变，仅生成负样本。

如果您有多类数据，您将希望生成正样本和负样本。正样本可以是共享标签的文本对，而负样本可以是不共享标签的文本对。

最终输出应该是一个带有文本对的数据框，其中每对都标记为-1或1。

# 生成负片
def dataframe_of_negatives(dataframe_of_positives: pd.DataFrame) -> pd.DataFrame:
    """返回由正样本对元素组合生成的负样本对数据框。"""
    texts = set(dataframe_of_positives["text_1"].values) | set(
        dataframe_of_positives["text_2"].values
    )
    all_pairs = {(t1, t2) for t1 in texts for t2 in texts if t1 < t2}
    positive_pairs = set(
        tuple(text_pair)
        for text_pair in dataframe_of_positives[["text_1", "text_2"]].values
    )
    negative_pairs = all_pairs - positive_pairs
    df_of_negatives = pd.DataFrame(list(negative_pairs), columns=["text_1", "text_2"])
    df_of_negatives["label"] = -1
    return df_of_negatives

negatives_per_positive = (
    1  # 它也能在更高的数值下工作，但处理更多数据时速度会变慢。
)
# 生成训练数据集的负样本
train_df_negatives = dataframe_of_negatives(train_df)
train_df_negatives["dataset"] = "train"
# 为测试数据集生成负样本
test_df_negatives = dataframe_of_negatives(test_df)
test_df_negatives["dataset"] = "test"
# 采样负样本并与正样本结合
train_df = pd.concat(
    [
        train_df,
        train_df_negatives.sample(
            n=len(train_df) * negatives_per_positive, random_state=random_seed
        ),
    ]
)
test_df = pd.concat(
    [
        test_df,
        test_df_negatives.sample(
            n=len(test_df) * negatives_per_positive, random_state=random_seed
        ),
    ]
)

df = pd.concat([train_df, test_df])

5. 计算嵌入和余弦相似度

在这里，我创建一个缓存来保存嵌入。这样做很方便，这样你就不必再次付费如果你想重新运行代码。

# 建立一个嵌入向量缓存，以避免重复计算
# 缓存是一个字典，键为元组（文本, 引擎），值为嵌入向量。
try:
    with open(embedding_cache_path, "rb") as f:
        embedding_cache = pickle.load(f)
except FileNotFoundError:
    precomputed_embedding_cache_path = "https://cdn.openai.com/API/examples/data/snli_embedding_cache.pkl"
    embedding_cache = pd.read_pickle(precomputed_embedding_cache_path)


# 此功能将从缓存中获取嵌入信息，并在之后将其保存回缓存中。
def get_embedding_with_cache(
    text: str,
    engine: str = default_embedding_engine,
    embedding_cache: dict = embedding_cache,
    embedding_cache_path: str = embedding_cache_path,
) -> list:
    if (text, engine) not in embedding_cache.keys():
        # 如果不在缓存中，则调用API获取嵌入。
        embedding_cache[(text, engine)] = get_embedding(text, engine)
        # 每次更新后将嵌入缓存保存到磁盘
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(text, engine)]


# 创建嵌入列
for column in ["text_1", "text_2"]:
    df[f"{column}_embedding"] = df[column].apply(get_embedding_with_cache)

# 创建嵌入向量之间的余弦相似度列
df["cosine_similarity"] = df.apply(
    lambda row: cosine_similarity(row["text_1_embedding"], row["text_2_embedding"]),
    axis=1,
)

6. 绘制余弦相似度分布图

在这里，我们使用余弦相似度来衡量文本的相似性。根据我们的经验，大多数距离函数（L1、L2、余弦相似度）的效果都差不多。请注意，我们的嵌入已经被归一化为长度1，因此余弦相似度等同于点积。

这些图表展示了相似对和不相似对的余弦相似度分布之间的重叠程度。如果重叠程度很高，这意味着有些不相似对的余弦相似度大于某些相似对的余弦相似度。

我计算的准确性是一个简单规则的准确性，该规则在余弦相似度高于某个阈值X时预测为“相似（1）”，否则预测为“不相似（0）”。

# 计算预测标签为1且相似度大于x时的准确率（及其标准误差）。
# 通过从-1到1以0.01为步长进行扫描，对x进行优化。
def accuracy_and_se(cosine_similarity: float, labeled_similarity: int) -> Tuple[float]:
    accuracies = []
    for threshold_thousandths in range(-1000, 1000, 1):
        threshold = threshold_thousandths / 1000
        total = 0
        correct = 0
        for cs, ls in zip(cosine_similarity, labeled_similarity):
            total += 1
            if cs > threshold:
                prediction = 1
            else:
                prediction = -1
            if prediction == ls:
                correct += 1
        accuracy = correct / total
        accuracies.append(accuracy)
    a = max(accuracies)
    n = len(cosine_similarity)
    standard_error = (a * (1 - a) / n) ** 0.5  # 二项分布的标准误差
    return a, standard_error


# 确保训练集和测试集是平衡的
px.histogram(
    df,
    x="cosine_similarity",
    color="label",
    barmode="overlay",
    width=500,
    facet_row="dataset",
).show()

for dataset in ["train", "test"]:
    data = df[df["dataset"] == dataset]
    a, se = accuracy_and_se(data["cosine_similarity"], data["label"])
    print(f"{dataset} accuracy: {a:0.1%} ± {1.96 * se:0.1%}")

Unable to display output for mime type(s): application/vnd.plotly.v1+json

train accuracy: 89.1% ± 2.4%
test accuracy: 88.8% ± 2.4%

7. 使用提供的训练数据优化矩阵

def embedding_multiplied_by_matrix(
    embedding: List[float], matrix: torch.tensor
) -> np.array:
    embedding_tensor = torch.tensor(embedding).float()
    modified_embedding = embedding_tensor @ matrix
    modified_embedding = modified_embedding.detach().numpy()
    return modified_embedding


# 计算自定义嵌入向量及新的余弦相似度
def apply_matrix_to_embeddings_dataframe(matrix: torch.tensor, df: pd.DataFrame):
    for column in ["text_1_embedding", "text_2_embedding"]:
        df[f"{column}_custom"] = df[column].apply(
            lambda x: embedding_multiplied_by_matrix(x, matrix)
        )
    df["cosine_similarity_custom"] = df.apply(
        lambda row: cosine_similarity(
            row["text_1_embedding_custom"], row["text_2_embedding_custom"]
        ),
        axis=1,
    )

def optimize_matrix(
    modified_embedding_length: int = 2048,  # 在我短暂的实验中，尺寸越大效果越好（2048是巴贝奇编码的长度）。
    batch_size: int = 100,
    max_epochs: int = 100,
    learning_rate: float = 100.0,  # 似乎在接近批量大小时效果最佳 - 欢迎尝试一系列的值。
    dropout_fraction: float = 0.0,  # 在我的测试中，dropout 帮助提升了几个百分点（绝对不是必需的）。
    df: pd.DataFrame = df,
    print_progress: bool = True,
    save_results: bool = True,
) -> torch.tensor:
    """返回经过优化以最小化训练数据损失的矩阵。"""
    run_id = random.randint(0, 2 ** 31 - 1)  # （范围是任意的）
    # 将数据框转换为张量
    # e代表嵌入，s代表相似度标签
    def tensors_from_dataframe(
        df: pd.DataFrame,
        embedding_column_1: str,
        embedding_column_2: str,
        similarity_label_column: str,
    ) -> Tuple[torch.tensor]:
        e1 = np.stack(np.array(df[embedding_column_1].values))
        e2 = np.stack(np.array(df[embedding_column_2].values))
        s = np.stack(np.array(df[similarity_label_column].astype("float").values))

        e1 = torch.from_numpy(e1).float()
        e2 = torch.from_numpy(e2).float()
        s = torch.from_numpy(s).float()

        return e1, e2, s

    e1_train, e2_train, s_train = tensors_from_dataframe(
        df[df["dataset"] == "train"], "text_1_embedding", "text_2_embedding", "label"
    )
    e1_test, e2_test, s_test = tensors_from_dataframe(
        df[df["dataset"] == "test"], "text_1_embedding", "text_2_embedding", "label"
    )

    # 创建数据集和加载器
    dataset = torch.utils.data.TensorDataset(e1_train, e2_train, s_train)
    train_loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=True
    )

    # 定义模型（投影嵌入的相似度）
    def model(embedding_1, embedding_2, matrix, dropout_fraction=dropout_fraction):
        e1 = torch.nn.functional.dropout(embedding_1, p=dropout_fraction)
        e2 = torch.nn.functional.dropout(embedding_2, p=dropout_fraction)
        modified_embedding_1 = e1 @ matrix  # @ 表示矩阵乘法
        modified_embedding_2 = e2 @ matrix
        similarity = torch.nn.functional.cosine_similarity(
            modified_embedding_1, modified_embedding_2
        )
        return similarity

    # 定义损失函数以最小化
    def mse_loss(predictions, targets):
        difference = predictions - targets
        return torch.sum(difference * difference) / difference.numel()

    # 初始化投影矩阵
    embedding_length = len(df["text_1_embedding"].values[0])
    matrix = torch.randn(
        embedding_length, modified_embedding_length, requires_grad=True
    )

    epochs, types, losses, accuracies, matrices = [], [], [], [], []
    for epoch in range(1, 1 + max_epochs):
        # 遍历训练数据加载器
        for a, b, actual_similarity in train_loader:
            # 生成预测
            predicted_similarity = model(a, b, matrix)
            # 获取损失并执行反向传播
            loss = mse_loss(predicted_similarity, actual_similarity)
            loss.backward()
            # 更新权重
            with torch.no_grad():
                matrix -= matrix.grad * learning_rate
                # 将梯度置零
                matrix.grad.zero_()
        # 计算测试损失
        test_predictions = model(e1_test, e2_test, matrix)
        test_loss = mse_loss(test_predictions, s_test)

        # 计算自定义嵌入向量及新的余弦相似度
        apply_matrix_to_embeddings_dataframe(matrix, df)

        # 计算测试准确率
        for dataset in ["train", "test"]:
            data = df[df["dataset"] == dataset]
            a, se = accuracy_and_se(data["cosine_similarity_custom"], data["label"])

            # 记录每个周期的结果
            epochs.append(epoch)
            types.append(dataset)
            losses.append(loss.item() if dataset == "train" else test_loss.item())
            accuracies.append(a)
            matrices.append(matrix.detach().numpy())

            # 可选择打印准确率
            if print_progress is True:
                print(
                    f"Epoch {epoch}/{max_epochs}: {dataset} accuracy: {a:0.1%} ± {1.96 * se:0.1%}"
                )

    data = pd.DataFrame(
        {"epoch": epochs, "type": types, "loss": losses, "accuracy": accuracies}
    )
    data["run_id"] = run_id
    data["modified_embedding_length"] = modified_embedding_length
    data["batch_size"] = batch_size
    data["max_epochs"] = max_epochs
    data["learning_rate"] = learning_rate
    data["dropout_fraction"] = dropout_fraction
    data[
        "matrix"
    ] = matrices  # 保存每一个矩阵可能会占用大量空间；随时可以删除或修改。
    if save_results is True:
        data.to_csv(f"{run_id}_optimization_results.csv", index=False)

    return data

# 示例超参数搜索
# 我建议在初步探索时从 max_epochs=10 开始。
results = []
max_epochs = 30
dropout_fraction = 0.2
for batch_size, learning_rate in [(10, 10), (100, 100), (1000, 1000)]:
    result = optimize_matrix(
        batch_size=batch_size,
        learning_rate=learning_rate,
        max_epochs=max_epochs,
        dropout_fraction=dropout_fraction,
        save_results=False,
    )
    results.append(result)

Epoch 1/30: train accuracy: 89.1% ± 2.4%
Epoch 1/30: test accuracy: 88.4% ± 2.4%
Epoch 2/30: train accuracy: 89.5% ± 2.3%
Epoch 2/30: test accuracy: 88.8% ± 2.4%
Epoch 3/30: train accuracy: 90.6% ± 2.2%
Epoch 3/30: test accuracy: 89.3% ± 2.3%
Epoch 4/30: train accuracy: 91.2% ± 2.2%
Epoch 4/30: test accuracy: 89.7% ± 2.3%
Epoch 5/30: train accuracy: 91.5% ± 2.1%
Epoch 5/30: test accuracy: 90.0% ± 2.3%
Epoch 6/30: train accuracy: 91.9% ± 2.1%
Epoch 6/30: test accuracy: 90.4% ± 2.2%
Epoch 7/30: train accuracy: 92.2% ± 2.0%
Epoch 7/30: test accuracy: 90.7% ± 2.2%
Epoch 8/30: train accuracy: 92.7% ± 2.0%
Epoch 8/30: test accuracy: 90.9% ± 2.2%
Epoch 9/30: train accuracy: 92.7% ± 2.0%
Epoch 9/30: test accuracy: 91.0% ± 2.2%
Epoch 10/30: train accuracy: 93.0% ± 1.9%
Epoch 10/30: test accuracy: 91.6% ± 2.1%
Epoch 11/30: train accuracy: 93.1% ± 1.9%
Epoch 11/30: test accuracy: 91.8% ± 2.1%
Epoch 12/30: train accuracy: 93.4% ± 1.9%
Epoch 12/30: test accuracy: 92.1% ± 2.0%
Epoch 13/30: train accuracy: 93.6% ± 1.9%
Epoch 13/30: test accuracy: 92.4% ± 2.0%
Epoch 14/30: train accuracy: 93.7% ± 1.8%
Epoch 14/30: test accuracy: 92.7% ± 2.0%
Epoch 15/30: train accuracy: 93.7% ± 1.8%
Epoch 15/30: test accuracy: 92.7% ± 2.0%
Epoch 16/30: train accuracy: 94.0% ± 1.8%
Epoch 16/30: test accuracy: 93.0% ± 1.9%
Epoch 17/30: train accuracy: 94.0% ± 1.8%
Epoch 17/30: test accuracy: 93.0% ± 1.9%
Epoch 18/30: train accuracy: 94.2% ± 1.8%
Epoch 18/30: test accuracy: 93.1% ± 1.9%
Epoch 19/30: train accuracy: 94.2% ± 1.8%
Epoch 19/30: test accuracy: 93.1% ± 1.9%
Epoch 20/30: train accuracy: 94.3% ± 1.8%
Epoch 20/30: test accuracy: 93.0% ± 1.9%
Epoch 21/30: train accuracy: 94.5% ± 1.7%
Epoch 21/30: test accuracy: 93.1% ± 1.9%
Epoch 22/30: train accuracy: 94.5% ± 1.7%
Epoch 22/30: test accuracy: 93.3% ± 1.9%
Epoch 23/30: train accuracy: 94.6% ± 1.7%
Epoch 23/30: test accuracy: 93.3% ± 1.9%
Epoch 24/30: train accuracy: 94.6% ± 1.7%
Epoch 24/30: test accuracy: 93.3% ± 1.9%
Epoch 25/30: train accuracy: 94.8% ± 1.7%
Epoch 25/30: test accuracy: 93.3% ± 1.9%
Epoch 26/30: train accuracy: 94.8% ± 1.7%
Epoch 26/30: test accuracy: 93.4% ± 1.9%
Epoch 27/30: train accuracy: 94.8% ± 1.7%
Epoch 27/30: test accuracy: 93.4% ± 1.9%
Epoch 28/30: train accuracy: 94.9% ± 1.7%
Epoch 28/30: test accuracy: 93.4% ± 1.9%
Epoch 29/30: train accuracy: 94.9% ± 1.7%
Epoch 29/30: test accuracy: 93.4% ± 1.9%
Epoch 30/30: train accuracy: 94.9% ± 1.7%
Epoch 30/30: test accuracy: 93.3% ± 1.9%
Epoch 1/30: train accuracy: 89.7% ± 2.3%
Epoch 1/30: test accuracy: 89.1% ± 2.4%
Epoch 2/30: train accuracy: 89.8% ± 2.3%
Epoch 2/30: test accuracy: 89.9% ± 2.3%
Epoch 3/30: train accuracy: 90.3% ± 2.2%
Epoch 3/30: test accuracy: 90.0% ± 2.3%
Epoch 4/30: train accuracy: 91.0% ± 2.2%
Epoch 4/30: test accuracy: 90.3% ± 2.2%
Epoch 5/30: train accuracy: 91.3% ± 2.1%
Epoch 5/30: test accuracy: 90.3% ± 2.2%
Epoch 6/30: train accuracy: 91.8% ± 2.1%
Epoch 6/30: test accuracy: 90.4% ± 2.2%
Epoch 7/30: train accuracy: 92.4% ± 2.0%
Epoch 7/30: test accuracy: 91.0% ± 2.2%
Epoch 8/30: train accuracy: 92.8% ± 2.0%
Epoch 8/30: test accuracy: 91.3% ± 2.1%
Epoch 9/30: train accuracy: 93.1% ± 1.9%
Epoch 9/30: test accuracy: 91.6% ± 2.1%
Epoch 10/30: train accuracy: 93.4% ± 1.9%
Epoch 10/30: test accuracy: 91.9% ± 2.1%
Epoch 11/30: train accuracy: 93.4% ± 1.9%
Epoch 11/30: test accuracy: 91.8% ± 2.1%
Epoch 12/30: train accuracy: 93.6% ± 1.9%
Epoch 12/30: test accuracy: 92.1% ± 2.0%
Epoch 13/30: train accuracy: 93.7% ± 1.8%
Epoch 13/30: test accuracy: 92.4% ± 2.0%
Epoch 14/30: train accuracy: 93.7% ± 1.8%
Epoch 14/30: test accuracy: 92.5% ± 2.0%
Epoch 15/30: train accuracy: 93.9% ± 1.8%
Epoch 15/30: test accuracy: 92.8% ± 2.0%
Epoch 16/30: train accuracy: 94.0% ± 1.8%
Epoch 16/30: test accuracy: 92.8% ± 2.0%
Epoch 17/30: train accuracy: 94.0% ± 1.8%
Epoch 17/30: test accuracy: 92.8% ± 2.0%
Epoch 18/30: train accuracy: 94.2% ± 1.8%
Epoch 18/30: test accuracy: 92.8% ± 2.0%
Epoch 19/30: train accuracy: 94.2% ± 1.8%
Epoch 19/30: test accuracy: 92.8% ± 2.0%
Epoch 20/30: train accuracy: 94.2% ± 1.8%
Epoch 20/30: test accuracy: 93.1% ± 1.9%
Epoch 21/30: train accuracy: 94.3% ± 1.8%
Epoch 21/30: test accuracy: 93.3% ± 1.9%
Epoch 22/30: train accuracy: 94.3% ± 1.8%
Epoch 22/30: test accuracy: 93.3% ± 1.9%
Epoch 23/30: train accuracy: 94.5% ± 1.7%
Epoch 23/30: test accuracy: 93.3% ± 1.9%
Epoch 24/30: train accuracy: 94.5% ± 1.7%
Epoch 24/30: test accuracy: 93.3% ± 1.9%
Epoch 25/30: train accuracy: 94.6% ± 1.7%
Epoch 25/30: test accuracy: 93.4% ± 1.9%
Epoch 26/30: train accuracy: 94.6% ± 1.7%
Epoch 26/30: test accuracy: 93.3% ± 1.9%
Epoch 27/30: train accuracy: 94.6% ± 1.7%
Epoch 27/30: test accuracy: 93.4% ± 1.9%
Epoch 28/30: train accuracy: 94.8% ± 1.7%
Epoch 28/30: test accuracy: 93.4% ± 1.9%
Epoch 29/30: train accuracy: 94.8% ± 1.7%
Epoch 29/30: test accuracy: 93.3% ± 1.9%
Epoch 30/30: train accuracy: 94.8% ± 1.7%
Epoch 30/30: test accuracy: 93.4% ± 1.9%
Epoch 1/30: train accuracy: 90.7% ± 2.2%
Epoch 1/30: test accuracy: 89.9% ± 2.3%
Epoch 2/30: train accuracy: 90.9% ± 2.2%
Epoch 2/30: test accuracy: 90.3% ± 2.2%
Epoch 3/30: train accuracy: 91.6% ± 2.1%
Epoch 3/30: test accuracy: 90.3% ± 2.2%
Epoch 4/30: train accuracy: 92.2% ± 2.0%
Epoch 4/30: test accuracy: 90.7% ± 2.2%
Epoch 5/30: train accuracy: 92.4% ± 2.0%
Epoch 5/30: test accuracy: 91.3% ± 2.1%
Epoch 6/30: train accuracy: 92.5% ± 2.0%
Epoch 6/30: test accuracy: 91.8% ± 2.1%
Epoch 7/30: train accuracy: 93.0% ± 1.9%
Epoch 7/30: test accuracy: 92.2% ± 2.0%
Epoch 8/30: train accuracy: 93.1% ± 1.9%
Epoch 8/30: test accuracy: 92.7% ± 2.0%
Epoch 9/30: train accuracy: 93.3% ± 1.9%
Epoch 9/30: test accuracy: 92.5% ± 2.0%
Epoch 10/30: train accuracy: 93.4% ± 1.9%
Epoch 10/30: test accuracy: 92.7% ± 2.0%
Epoch 11/30: train accuracy: 93.6% ± 1.9%
Epoch 11/30: test accuracy: 92.8% ± 2.0%
Epoch 12/30: train accuracy: 93.7% ± 1.8%
Epoch 12/30: test accuracy: 92.8% ± 2.0%
Epoch 13/30: train accuracy: 94.0% ± 1.8%
Epoch 13/30: test accuracy: 93.0% ± 1.9%
Epoch 14/30: train accuracy: 93.9% ± 1.8%
Epoch 14/30: test accuracy: 93.0% ± 1.9%
Epoch 15/30: train accuracy: 94.2% ± 1.8%
Epoch 15/30: test accuracy: 93.0% ± 1.9%
Epoch 16/30: train accuracy: 94.2% ± 1.8%
Epoch 16/30: test accuracy: 93.0% ± 1.9%
Epoch 17/30: train accuracy: 94.3% ± 1.8%
Epoch 17/30: test accuracy: 93.0% ± 1.9%
Epoch 18/30: train accuracy: 94.5% ± 1.7%
Epoch 18/30: test accuracy: 93.1% ± 1.9%
Epoch 19/30: train accuracy: 94.5% ± 1.7%
Epoch 19/30: test accuracy: 93.1% ± 1.9%
Epoch 20/30: train accuracy: 94.6% ± 1.7%
Epoch 20/30: test accuracy: 93.3% ± 1.9%
Epoch 21/30: train accuracy: 94.8% ± 1.7%
Epoch 21/30: test accuracy: 93.3% ± 1.9%
Epoch 22/30: train accuracy: 94.8% ± 1.7%
Epoch 22/30: test accuracy: 93.4% ± 1.9%
Epoch 23/30: train accuracy: 94.8% ± 1.7%
Epoch 23/30: test accuracy: 93.4% ± 1.9%
Epoch 24/30: train accuracy: 94.8% ± 1.7%
Epoch 24/30: test accuracy: 93.4% ± 1.9%
Epoch 25/30: train accuracy: 94.8% ± 1.7%
Epoch 25/30: test accuracy: 93.4% ± 1.9%
Epoch 26/30: train accuracy: 94.9% ± 1.7%
Epoch 26/30: test accuracy: 93.6% ± 1.9%
Epoch 27/30: train accuracy: 94.9% ± 1.7%
Epoch 27/30: test accuracy: 93.6% ± 1.9%
Epoch 28/30: train accuracy: 94.9% ± 1.7%
Epoch 28/30: test accuracy: 93.6% ± 1.9%
Epoch 29/30: train accuracy: 95.1% ± 1.6%
Epoch 29/30: test accuracy: 93.6% ± 1.9%
Epoch 30/30: train accuracy: 95.1% ± 1.6%
Epoch 30/30: test accuracy: 93.6% ± 1.9%

runs_df = pd.concat(results)

# 绘制训练损失和测试损失随时间变化的图表
px.line(
    runs_df,
    line_group="run_id",
    x="epoch",
    y="loss",
    color="type",
    hover_data=["batch_size", "learning_rate", "dropout_fraction"],
    facet_row="learning_rate",
    facet_col="batch_size",
    width=500,
).show()

# 随着时间的推移绘制精度图
px.line(
    runs_df,
    line_group="run_id",
    x="epoch",
    y="accuracy",
    color="type",
    hover_data=["batch_size", "learning_rate", "dropout_fraction"],
    facet_row="learning_rate",
    facet_col="batch_size",
    width=500,
).show()

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Unable to display output for mime type(s): application/vnd.plotly.v1+json

8. 绘制训练过程中找到的最佳矩阵的前后对比图

矩阵越好，它就能更清晰地区分相似和不相似的对。

# 将最佳运行的结果应用于原始数据
best_run = runs_df.sort_values(by="accuracy", ascending=False).iloc[0]
best_matrix = best_run["matrix"]
apply_matrix_to_embeddings_dataframe(best_matrix, df)

# 定制前情节相似度分布
px.histogram(
    df,
    x="cosine_similarity",
    color="label",
    barmode="overlay",
    width=500,
    facet_row="dataset",
).show()

test_df = df[df["dataset"] == "test"]
a, se = accuracy_and_se(test_df["cosine_similarity"], test_df["label"])
print(f"Test accuracy: {a:0.1%} ± {1.96 * se:0.1%}")

# 定制化后的剧情相似度分布
px.histogram(
    df,
    x="cosine_similarity_custom",
    color="label",
    barmode="overlay",
    width=500,
    facet_row="dataset",
).show()

a, se = accuracy_and_se(test_df["cosine_similarity_custom"], test_df["label"])
print(f"Test accuracy after customization: {a:0.1%} ± {1.96 * se:0.1%}")

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Test accuracy: 88.8% ± 2.4%

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Test accuracy after customization: 93.6% ± 1.9%

best_matrix  # 这就是你可以将嵌入向量乘以的倍数。

array([[-1.2566795e+00, -1.5297449e+00, -1.3271648e-01, ...,
        -1.2859761e+00, -5.3254390e-01,  4.8364732e-01],
       [-1.4826347e+00,  9.2656955e-02, -4.2437232e-01, ...,
         1.1872858e+00, -1.0831847e+00, -1.0683593e+00],
       [-2.2029283e+00, -1.9703420e+00,  3.1125939e-01, ...,
         2.2947595e+00,  5.5780332e-03, -6.0171342e-01],
       ...,
       [-1.1019799e-01,  1.3599515e+00, -4.7677776e-01, ...,
         6.5626711e-01,  7.2359240e-01,  3.0733588e+00],
       [ 1.6624762e-03,  4.2648423e-01, -1.1380885e+00, ...,
         8.7202555e-01,  9.3173909e-01, -1.6760436e+00],
       [ 7.7449006e-01,  4.9213606e-01,  3.5407653e-01, ...,
         1.3460466e+00, -1.9509128e-01,  7.7514690e-01]], dtype=float32)

在下面的示例中，我使用了从SNLI语料库中挑选的1,000个句子对。每对句子在逻辑上是蕴涵的（即，一个句子暗示另一个句子）。这些对是我们的正例（label

0. 导入模块

1. 输入​

2. 加载和处理输入数据​

3. 将数据分割为训练集和测试集​

4. 生成合成负样本​

5. 计算嵌入和余弦相似度​

6. 绘制余弦相似度分布图​

7. 使用提供的训练数据优化矩阵​

8. 绘制训练过程中找到的最佳矩阵的前后对比图​