MS MARCO 跨编码器

MS MARCO 是一个大规模的信息检索语料库,它基于使用Bing搜索引擎的真实用户搜索查询创建。提供的模型可用于语义搜索,即,给定关键词/搜索短语/问题,模型将找到与搜索查询相关的段落。

训练数据包含超过50万个示例,而完整的语料库包含超过880万个段落。

使用 SentenceTransformers

预训练模型可以这样使用:

from sentence_transformers import CrossEncoder

model = CrossEncoder("model_name", max_length=512)
scores = model.predict(
    [("Query", "Paragraph1"), ("Query", "Paragraph2"), ("Query", "Paragraph3")]
)

与 Transformers 一起使用

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained("model_name")
tokenizer = AutoTokenizer.from_pretrained("model_name")

features = tokenizer(["Query", "Query"], ["Paragraph1", "Paragraph2"], padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

模型与性能

在下表中,我们提供了各种预训练的跨编码器及其在TREC深度学习2019和MS Marco Passage Reranking数据集上的表现。

Model-Name NDCG@10 (TREC DL 19) MRR@10 (MS Marco Dev) Docs / Sec
Version 2 models
cross-encoder/ms-marco-TinyBERT-L-2-v2 69.84 32.56 9000
cross-encoder/ms-marco-MiniLM-L-2-v2 71.01 34.85 4100
cross-encoder/ms-marco-MiniLM-L-4-v2 73.04 37.70 2500
cross-encoder/ms-marco-MiniLM-L-6-v2 74.30 39.01 1800
cross-encoder/ms-marco-MiniLM-L-12-v2 74.31 39.02 960
Version 1 models
cross-encoder/ms-marco-TinyBERT-L-2 67.43 30.15 9000
cross-encoder/ms-marco-TinyBERT-L-4 68.09 34.50 2900
cross-encoder/ms-marco-TinyBERT-L-6 69.57 36.13 680
cross-encoder/ms-marco-electra-base 71.99 36.41 340
Other models
nboost/pt-tinybert-msmarco 63.63 28.80 2900
nboost/pt-bert-base-uncased-msmarco 70.94 34.75 340
nboost/pt-bert-large-msmarco 73.36 36.48 100
Capreolus/electra-base-msmarco 71.23 36.89 340
amberoad/bert-multilingual-passage-reranking-msmarco 68.40 35.54 330
sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarco 72.82 37.88 720

Note: Runtime was computed on a V100 GPU with Hugging Face Transformers v4.