数据集概览¶
提示
快速开始: 查找 精选数据集 <https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552>
_ 或 社区数据集 <https://huggingface.co/datasets?other=sentence-transformers>
,通过此 损失概览 <loss_overview.html>
选择一个损失函数,并 验证 <training_overview.html#dataset-format>
_ 它是否适用于您的数据集。
确保您的数据集格式与损失函数匹配(或者选择一个与数据集格式匹配的损失函数)非常重要。请参阅训练概述 > 数据集格式,了解如何验证数据集格式是否适用于损失函数。
在实践中,大多数数据集配置将采取以下四种形式之一:
正对:一对相关的句子。这可以用于对称任务(语义文本相似性)或非对称任务(语义搜索),示例包括释义对、全文及其摘要对、重复问题对、(查询,响应)对,或(源语言,目标语言)对。自然语言推理数据集也可以通过配对蕴含句子来格式化。
示例: sentence-transformers/sentence-compression, sentence-transformers/coco-captions, sentence-transformers/codesearchnet, sentence-transformers/natural-questions, sentence-transformers/gooaq, sentence-transformers/squad, sentence-transformers/wikihow, sentence-transformers/eli5
三元组:(锚点,正样本,负样本)文本三元组。这些数据集不需要标签。
示例: sentence-transformers/quora-duplicates, nirantk/triplets, sentence-transformers/all-nli
相似度评分对:一对带有评分指示其相似度的句子。常见的例子是“语义文本相似性”数据集。
示例:sentence-transformers/stsb, PhilipMay/stsb_multi_mt。
带有类别的文本:一个带有相应类别的文本。这种数据格式很容易被损失函数转换为三个句子(三元组),其中第一个是“锚点”,第二个是与锚点同类的“正例”,第三个是不同类的“负例”。
示例: trec, yahoo_answers_topics.
请注意,通常将数据集从一种格式转换为另一种格式,使其适用于您选择的损失函数,这往往很简单。
小技巧
你可以使用 :func:~sentence_transformers.util.mine_hard_negatives
将一个正样本对数据集转换为一个三元组数据集。它使用一个 :class:~sentence_transformers.SentenceTransformer
模型来寻找困难负样本:这些文本与第一个数据集列中的文本相似,但不如第二个数据集列中的文本相似。包含困难三元组的数据集通常优于仅包含正样本对的数据集。
例如,我们从 sentence-transformers/gooaq <https://huggingface.co/datasets/sentence-transformers/gooaq>
_ 中挖掘了硬负样本,生成了 tomaarsen/gooaq-hard-negatives <https://huggingface.co/datasets/tomaarsen/gooaq-hard-negatives>
_ ,并分别在两个数据集上训练了 tomaarsen/mpnet-base-gooaq <https://huggingface.co/tomaarsen/mpnet-base-gooaq>
_ 和 tomaarsen/mpnet-base-gooaq-hard-negatives <https://huggingface.co/tomaarsen/mpnet-base-gooaq-hard-negatives>
_ 。遗憾的是,这两个模型使用了不同的评估分割,因此它们的表现无法直接比较。
Hugging Face Hub 上的数据集¶
Datasets 库 <https://huggingface.co/docs/datasets/index>
_ (pip install datasets
) 允许你使用 :func:~datasets.load_dataset
函数从 Hugging Face Hub 加载数据集::
from datasets import load_dataset
# Indicate the dataset id from the Hub
dataset_id = "sentence-transformers/natural-questions"
dataset = load_dataset(dataset_id, split="train")
"""
Dataset({
features: ['query', 'answer'],
num_rows: 100231
})
"""
print(dataset[0])
"""
{
'query': 'when did richmond last play in a preliminary final',
'answer': "Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. Having advanced to the first preliminary finals for the first time since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd to a grand final since 1986. The Crows led at quarter time and led by as many as 13, but the Tigers took over the game as it progressed and scored seven straight goals at one point. They eventually would win by 48 points – 16.12 (108) to Adelaide's 8.12 (60) – to end their 37-year flag drought.[22] Dustin Martin also became the first player to win a Premiership medal, the Brownlow Medal and the Norm Smith Medal in the same season, while Damien Hardwick was named AFL Coaches Association Coach of the Year. Richmond's jump from 13th to premiers also marked the biggest jump from one AFL season to the next."
}
"""
有关如何操作数据集的更多信息,请参阅数据集文档。
小技巧
Hugging Face 数据集通常包含多余的列,例如 sample_id、metadata、source、type 等。你可以使用 :meth:Dataset.remove_columns <datasets.Dataset.remove_columns>
来移除这些列,否则它们将被用作输入。你也可以使用 :meth:Dataset.select_columns <datasets.Dataset.select_columns>
来仅保留所需的列。
现有数据集¶
Hugging Face Hub 托管了超过 15 万个数据集,其中许多可以转换用于训练嵌入模型。我们的目标是标记所有与 Sentence Transformers 开箱即用的 Hugging Face 数据集,标签为 sentence-transformers,使您可以通过浏览到 https://huggingface.co/datasets?other=sentence-transformers 轻松找到它们。我们强烈建议您浏览这些数据集,以找到可能对您的任务有用的训练数据集。
以下是一些流行的、标记为 sentence-transformers 的现有数据集,可用于训练和微调 SentenceTransformer 模型:
Dataset | Description |
---|---|
GooAQ | (Question, Answer) pairs from Google auto suggest |
Yahoo Answers | (Title+Question, Answer), (Title, Answer), (Title, Question), (Question, Answer) pairs from Yahoo Answers |
MS MARCO Triplets (msmarco-distilbert-base-tas-b) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (msmarco-distilbert-base-v3) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (msmarco-MiniLM-L-6-v3) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (distilbert-margin-mse-cls-dot-v2) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (distilbert-margin-mse-cls-dot-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (distilbert-margin-mse-mean-dot-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (mpnet-margin-mse-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (co-condenser-margin-mse-cls-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (distilbert-margin-mse-mnrl-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v2) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (co-condenser-margin-mse-sym-mnrl-mean-v1) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
MS MARCO Triplets (BM25) | (Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives |
Stack Exchange Duplicates | (Title, Title), (Title+Body, Title+Body), (Body, Body) pairs of duplicate questions from StackExchange |
ELI5 | (Question, Answer) pairs from ELI5 dataset |
SQuAD | (Question, Answer) pairs from SQuAD dataset |
WikiHow | (Summary, Text) pairs from WikiHow |
Amazon Reviews 2018 | (Title, review) pairs from Amazon Reviews |
Natural Questions | (Query, Answer) pairs from the Natural Questions dataset |
Amazon QA | (Question, Answer) pairs from Amazon |
S2ORC | (Title, Abstract), (Abstract, Citation), (Title, Citation) pairs of scientific papers |
Quora Duplicates | Duplicate question pairs from Quora |
WikiAnswers | Duplicate question pairs from WikiAnswers |
AGNews | (Title, Description) pairs of news articles from the AG News dataset |
AllNLI | (Anchor, Entailment, Contradiction) triplets from SNLI + MultiNLI |
NPR | (Title, Body) pairs from the npr.org website |
SPECTER | (Title, Positive Title, Negative Title) triplets of Scientific Publications from Specter |
Simple Wiki | (English, Simple English) pairs from Wikipedia |
PAQ | (Query, Answer) from the Probably-Asked Questions dataset |
altlex | (English, Simple English) pairs from Wikipedia |
CC News | (Title, article) pairs from the CC News dataset |
CodeSearchNet | (Comment, Code) pairs from open source libraries on GitHub |
Sentence Compression | (Long text, Short text) pairs from the Sentence Compression dataset |
Trivia QA | (Query, Answer) pairs from the TriviaQA dataset |
Flickr30k Captions | Duplicate captions from the Flickr30k dataset |
xsum | (News Article, Summary) pairs from XSUM dataset |
Coco Captions | Duplicate captions from the Coco Captions dataset |
Parallel Sentences: Europarl | (English, Non-English) pairs across numerous languages |
Parallel Sentences: Global Voices | (English, Non-English) pairs across numerous languages |
Parallel Sentences: MUSE | (English, Non-English) pairs across numerous languages |
Parallel Sentences: JW300 | (English, Non-English) pairs across numerous languages |
Parallel Sentences: News Commentary | (English, Non-English) pairs across numerous languages |
Parallel Sentences: OpenSubtitles | (English, Non-English) pairs across numerous languages |
Parallel Sentences: Talks | (English, Non-English) pairs across numerous languages |
Parallel Sentences: Tatoeba | (English, Non-English) pairs across numerous languages |
Parallel Sentences: WikiMatrix | (English, Non-English) pairs across numerous languages |
Parallel Sentences: WikiTitles | (English, Non-English) pairs across numerous languages |
备注
我们建议用户通过添加 tags: sentence-transformers
来标记可用于训练嵌入模型的数据集。我们也欢迎高质量的数据集被添加到上述列表中,供所有人查看和使用。