释义挖掘

释义挖掘的任务是在大量句子语料库中找到释义(具有相同/相似含义的文本)。在语义文本相似性中,我们看到了在一个句子列表中寻找释义的简化版本。那里介绍的方法采用了暴力评分和排序所有对的方式。

然而,由于这种方式具有二次时间复杂度,它无法扩展到包含大量(10,000 及以上)句子的集合。对于更大的集合,可以使用 paraphrase_mining() 函数:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import paraphrase_mining

model = SentenceTransformer("all-MiniLM-L6-v2")

# 单个句子列表 - 可能包含数万个句子
sentences = [
    "The cat sits outside",
    "A man is playing guitar",
    "I love pasta",
    "The new movie is awesome",
    "The cat plays in the garden",
    "A woman watches TV",
    "The new movie is so great",
    "Do you like pizza?",
]

paraphrases = paraphrase_mining(model, sentences)

for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))

paraphrase_mining() 接受以下参数:

sentence_transformers.util.paraphrase_mining(model, sentences: list[str], show_progress_bar: bool = False, batch_size: int = 32, query_chunk_size: int = 5000, corpus_chunk_size: int = 100000, max_pairs: int = 500000, top_k: int = 100, score_function: ~typing.Callable[[~torch.Tensor, ~torch.Tensor], ~torch.Tensor] = <function cos_sim>) list[list[float | int]][源代码][源代码]

Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

参数:
  • model (SentenceTransformer) -- SentenceTransformer model for embedding computation

  • sentences (List[str]) -- A list of strings (texts or sentences)

  • show_progress_bar (bool, optional) -- Plotting of a progress bar. Defaults to False.

  • batch_size (int, optional) -- Number of texts that are encoded simultaneously by the model. Defaults to 32.

  • query_chunk_size (int, optional) -- Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). Defaults to 5000.

  • corpus_chunk_size (int, optional) -- Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). Defaults to 100000.

  • max_pairs (int, optional) -- Maximal number of text pairs returned. Defaults to 500000.

  • top_k (int, optional) -- For each sentence, we retrieve up to top_k other sentences. Defaults to 100.

  • score_function (Callable[[Tensor, Tensor], Tensor], optional) -- Function for computing scores. By default, cosine similarity. Defaults to cos_sim.

返回:

Returns a list of triplets with the format [score, id1, id2]

返回类型:

List[List[Union[float, int]]]

为了优化内存和计算时间,释义挖掘分块进行,由 query_chunk_sizecorpus_chunk_size 指定。 具体来说,每次只比较 query_chunk_size * corpus_chunk_size 对,而不是 len(sentences) * len(sentences)。这种方式在时间和内存上更高效。此外,paraphrase_mining() 仅考虑每个块中每个句子对应的 top_k 最佳分数。你可以尝试调整这个值以权衡效率和性能。

例如,对于每个句子,在这个脚本中你只会得到一个最相关的句子。

paraphrases = paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1)

最后一个关键参数是 max_pairs,它决定了函数返回的释义对的最大数量。通常,返回的对数会少于这个最大值,因为列表会清理掉重复项,例如,如果包含 (A, B) 和 (B, A),那么只会返回其中一个。

备注

如果 B 是 A 最相似的句子,A 不一定是 B 最相似的句子。因此,返回的列表中可能包含像 (A, B) 和 (B, C) 这样的条目。