快速入门¶

句子转换器¶

句子转换器（又名双编码器）模型的特点：

计算给定 文本或图像 的 固定大小向量表示（嵌入）。
嵌入计算通常是**高效的**，嵌入相似度计算是**非常快的**。
适用于**广泛的任务**，例如语义文本相似性、语义搜索、聚类、分类、释义挖掘等。
通常作为 两步检索过程的第一步 使用，其中使用 Cross-Encoder（又名重排序器）模型对来自双编码器的 top-k 结果进行重新排序。

一旦你安装了 Sentence Transformers，你就可以轻松使用 Sentence Transformer 模型：

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

使用 SentenceTransformer("all-MiniLM-L6-v2")，我们选择加载哪个 Sentence Transformer 模型。在这个例子中，我们加载了 all-MiniLM-L6-v2，这是一个在超过 10 亿对训练数据集上微调的 MiniLM 模型。通过使用 SentenceTransformer.similarity()，我们计算所有句子对之间的相似度。正如预期的那样，前两句之间的相似度（0.6660）高于第一句与第三句之间的相似度（0.1046）或第二句与第三句之间的相似度（0.1411）。

微调 Sentence Transformer 模型非常简单，只需要几行代码。更多信息，请参阅训练概述部分。

交叉编码器¶

交叉编码器（又称重排序器）模型的特点：

计算给定 文本对 的 相似度分数。
通常提供比句子转换器（又名双编码器）模型**更优越的性能**。
通常比 Sentence Transformer 模型更慢，因为它需要为每一对进行计算，而不是每个文本。
由于前两个特性，交叉编码器常用于**重新排序Sentence Transformer模型的前k个结果**。

Cross Encoder（又名重排序器）模型的使用方法与 Sentence Transformers 类似：

from sentence_transformers.cross_encoder import CrossEncoder

# 1. Load a pretrained CrossEncoder model
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")

# We want to compute the similarity between the query sentence...
query = "A man is eating pasta."

# ... and all sentences in the corpus
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

# 2. We rank all sentences in the corpus for the query
ranks = model.rank(query, corpus)

# Print the scores
print("Query: ", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")
"""
Query:  A man is eating pasta.
0.67    A man is eating food.
0.34    A man is eating a piece of bread.
0.08    A man is riding a horse.
0.07    A man is riding a white horse on an enclosed ground.
0.01    The girl is carrying a baby.
0.01    Two men pushed carts through the woods.
0.01    A monkey is playing drums.
0.01    A woman is playing violin.
0.01    A cheetah is running behind its prey.
"""

# 3. Alternatively, you can also manually compute the score between two sentences
import numpy as np

sentence_combinations = [[query, sentence] for sentence in corpus]
scores = model.predict(sentence_combinations)

# Sort the scores in decreasing order to get the corpus indices
ranked_indices = np.argsort(scores)[::-1]
print("Scores:", scores)
print("Indices:", ranked_indices)
"""
Scores: [0.6732372, 0.34102544, 0.00542465, 0.07569341, 0.00525378, 0.00536814, 0.06676237, 0.00534825, 0.00516717]
Indices: [0 1 3 6 2 5 7 4 8]
"""

使用 CrossEncoder("cross-encoder/stsb-distilroberta-base") 我们选择加载哪个 CrossEncoder 模型。在这个例子中，我们加载 cross-encoder/stsb-distilroberta-base，这是一个在 STS Benchmark 数据集上微调过的 DistilRoBERTa 模型。

下一步¶

考虑接下来阅读以下部分之一：