Transformers 文档

DistilBERT

Transformers

DistilBERT

概述

DistilBERT模型在博客文章更小、更快、更便宜、更轻：介绍DistilBERT，BERT的蒸馏版本和论文DistilBERT，BERT的蒸馏版本：更小、更快、更便宜和更轻中被提出。DistilBERT是一个通过蒸馏BERT基础模型训练的小型、快速、便宜且轻量的Transformer模型。它的参数比google-bert/bert-base-uncased少40%，运行速度快60%，同时在GLUE语言理解基准测试中保持了BERT超过95%的性能。

论文的摘要如下：

随着从大规模预训练模型中进行迁移学习在自然语言处理（NLP）中变得越来越普遍，在边缘设备和/或受限的计算训练或推理预算下操作这些大型模型仍然具有挑战性。在这项工作中，我们提出了一种预训练较小通用语言表示模型的方法，称为DistilBERT，然后可以在广泛的任务上进行微调，并表现出与较大模型相当的性能。虽然大多数先前的工作研究了使用蒸馏来构建特定任务的模型，但我们在预训练阶段利用知识蒸馏，并展示了可以将BERT模型的大小减少40%，同时保留其97%的语言理解能力，并且速度提高60%。为了利用在预训练期间由较大模型学习的归纳偏差，我们引入了一种结合语言建模、蒸馏和余弦距离损失的三重损失。我们更小、更快、更轻的模型预训练成本更低，并且我们在概念验证实验和比较设备上研究中展示了其在设备上计算的能力。

该模型由victorsanh贡献。该模型的jax版本由kamalkraj贡献。原始代码可以在这里找到。

使用提示

DistilBERT 没有 token_type_ids，你不需要指明哪个标记属于哪个段。只需使用分隔标记 tokenizer.sep_token（或 [SEP]）来分隔你的段。
DistilBERT 没有选择输入位置（position_ids 输入）的选项。不过，如果需要的话，可以添加这个功能，只需告诉我们你是否需要这个选项。
与BERT相同但更小。通过预训练的BERT模型的蒸馏进行训练，意味着它被训练来预测与更大模型相同的概率。实际的目标是以下组合：
- finding the same probabilities as the teacher model
- predicting the masked tokens correctly (but no next-sentence objective)
- a cosine similarity between the hidden states of the student and the teacher model

使用缩放点积注意力 (SDPA)

PyTorch 包含一个原生的缩放点积注意力（SDPA）操作符，作为 torch.nn.functional 的一部分。这个函数包含了几种实现，可以根据输入和使用的硬件进行应用。更多信息请参阅官方文档或 GPU 推理页面。

默认情况下，当有可用实现时，SDPA 用于 torch>=2.1.1，但你也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 来明确请求使用 SDPA。

from transformers import DistilBertModel
model = DistilBertModel.from_pretrained("distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa")

为了获得最佳加速效果，我们建议以半精度加载模型（例如 torch.float16 或 torch.bfloat16）。

在本地基准测试（NVIDIA GeForce RTX 2060-8GB，PyTorch 2.3.1，操作系统 Ubuntu 20.04）中，使用float16和带有MaskedLM头的distilbert-base-uncased模型，我们在训练和推理过程中看到了以下加速效果。

训练

训练步数	批量大小	序列长度	是否使用cuda	每批次时间（eager - 秒）	每批次时间（sdpa - 秒）	加速百分比	Eager峰值内存（MB）	sdpa峰值内存（MB）	内存节省百分比
100	1	128	False	0.010	0.008	28.870	397.038	399.629	-0.649
100	1	256	False	0.011	0.009	20.681	412.505	412.606	-0.025
100	2	128	False	0.011	0.009	23.741	412.213	412.606	-0.095
100	2	256	False	0.015	0.013	16.502	427.491	425.787	0.400
100	4	128	False	0.015	0.013	13.828	427.491	425.787	0.400
100	4	256	False	0.025	0.022	12.882	594.156	502.745	18.182
100	8	128	False	0.023	0.022	8.010	545.922	502.745	8.588
100	8	256	False	0.046	0.041	12.763	983.450	798.480	23.165

推理

num_batches	batch_size	seq_len	是否使用CUDA	是否使用半精度	使用掩码	每个令牌的延迟（eager模式，毫秒）	每个令牌的延迟（SDPA模式，毫秒）	加速百分比	内存使用（eager模式，MB）	内存使用（BT模式，MB）	内存节省百分比
50	2	64	True	True	True	0.032	0.025	28.192	154.532	155.531	-0.642
50	2	128	True	True	True	0.033	0.025	32.636	157.286	157.482	-0.125
50	4	64	True	True	True	0.032	0.026	24.783	157.023	157.449	-0.271
50	4	128	True	True	True	0.034	0.028	19.299	162.794	162.269	0.323
50	8	64	True	True	True	0.035	0.028	25.105	160.958	162.204	-0.768
50	8	128	True	True	True	0.052	0.046	12.375	173.155	171.844	0.763
50	16	64	True	True	True	0.051	0.045	12.882	172.106	171.713	0.229
50	16	128	True	True	True	0.096	0.081	18.524	191.257	191.517	-0.136

资源

以下是官方 Hugging Face 和社区（由🌎表示）提供的资源列表，帮助您开始使用 DistilBERT。如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

Text Classification

一篇关于使用Python和DistilBERT进行情感分析入门的博客文章。
一篇关于如何使用Blurr训练DistilBERT进行序列分类的博客文章。
一篇关于如何使用Ray来调整DistilBERT超参数的博客文章。
一篇关于如何使用Hugging Face和Amazon SageMaker训练DistilBERT的博客文章。
一个关于如何微调DistilBERT进行多标签分类的笔记本。🌎
一个关于如何使用PyTorch微调DistilBERT进行多类分类的笔记本。🌎
一个关于如何在TensorFlow中微调DistilBERT进行文本分类的笔记本。🌎
DistilBertForSequenceClassification 由这个示例脚本和笔记本支持。
TFDistilBertForSequenceClassification 由这个示例脚本和笔记本支持。
FlaxDistilBertForSequenceClassification 由这个示例脚本和笔记本支持。
文本分类任务指南

Token Classification

DistilBertForTokenClassification 由这个示例脚本和笔记本支持。
TFDistilBertForTokenClassification 由这个示例脚本和笔记本支持。
FlaxDistilBertForTokenClassification 由这个 example script 支持。
Token classification 🤗 Hugging Face 课程的章节。
Token分类任务指南

Fill-Mask

DistilBertForMaskedLM 由这个示例脚本和笔记本支持。
TFDistilBertForMaskedLM 由这个示例脚本和笔记本支持。
FlaxDistilBertForMaskedLM 由这个示例脚本和笔记本支持。
Masked language modeling 🤗 Hugging Face 课程的章节。
Masked language modeling task guide

Question Answering

DistilBertForQuestionAnswering 由这个示例脚本和笔记本支持。
TFDistilBertForQuestionAnswering 由这个示例脚本和笔记本支持。
FlaxDistilBertForQuestionAnswering 由这个示例脚本支持。
Question answering 章节来自 🤗 Hugging Face 课程。
问答任务指南

多项选择

DistilBertForMultipleChoice 由这个示例脚本和笔记本支持。
TFDistilBertForMultipleChoice 由这个示例脚本和笔记本支持。
多项选择任务指南

⚗️ 优化

一篇关于如何使用quantize DistilBERT with 🤗 Optimum and Intel的博客文章。
一篇关于如何使用Optimizing Transformers for GPUs with 🤗 Optimum优化GPU上的Transformers的博客文章。
一篇关于使用Hugging Face Optimum优化Transformers的博客文章。

⚡️ 推理

一篇关于如何使用DistilBERTAccelerate BERT inference with Hugging Face Transformers and AWS Inferentia的博客文章。
一篇关于使用Hugging Face的Transformers、DistilBERT和Amazon SageMaker进行无服务器推理的博客文章。

🚀 部署

一篇关于如何在Google Cloud上部署DistilBERT的博客文章。
一篇关于如何使用Amazon SageMaker部署DistilBERT的博客文章。
一篇关于如何使用Hugging Face Transformers、Amazon SageMaker和Terraform模块部署BERT的博客文章。

结合 DistilBERT 和 Flash Attention 2

首先，确保安装最新版本的 Flash Attention 2 以包含滑动窗口注意力功能。

pip install -U flash-attn --no-build-isolation

请确保您拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash-attn仓库的官方文档。同时，请确保以半精度加载您的模型（例如torch.float16）

要使用Flash Attention 2加载并运行模型，请参考以下代码片段：

>>> import torch
>>> from transformers import AutoTokenizer, AutoModel

>>> device = "cuda" # the device to load the model onto

>>> tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
>>> model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="flash_attention_2")

>>> text = "Replace me by any text you'd like."

>>> encoded_input = tokenizer(text, return_tensors='pt').to(device)
>>> model.to(device)

>>> output = model(**encoded_input)

Transformers

DistilBERT

概述

使用提示

使用缩放点积注意力 (SDPA)

训练

推理

资源

结合 DistilBERT 和 Flash Attention 2

DistilBertConfig

类 transformers.DistilBertConfig

DistilBertTokenizer

类 transformers.DistilBertTokenizer

build_inputs_with_special_tokens

convert_tokens_to_string

create_token_type_ids_from_sequences

get_special_tokens_mask

DistilBertTokenizerFast

类 transformers.DistilBertTokenizerFast

build_inputs_with_special_tokens

create_token_type_ids_from_sequences

DistilBertModel

类 transformers.DistilBertModel

前进

DistilBertForMaskedLM

类 transformers.DistilBertForMaskedLM

前进

DistilBertForSequenceClassification

类 transformers.DistilBertForSequenceClassification

前进

DistilBertForMultipleChoice

类 transformers.DistilBertForMultipleChoice

前进

DistilBertForTokenClassification

类 transformers.DistilBertForTokenClassification

前进

DistilBertForQuestionAnswering

类 transformers.DistilBertForQuestionAnswering

前进

TFDistilBertModel

类 transformers.TFDistilBertModel

调用

TFDistilBertForMaskedLM

类 transformers.TFDistilBertForMaskedLM

调用

TFDistilBertForSequenceClassification

类 transformers.TFDistilBertForSequenceClassification

调用

TFDistilBertForMultipleChoice

类 transformers.TFDistilBertForMultipleChoice

调用

TFDistilBertForTokenClassification

类 transformers.TFDistilBertForTokenClassification

调用

TFDistilBertForQuestionAnswering

类 transformers.TFDistilBertForQuestionAnswering

调用

FlaxDistilBertModel

类 transformers.FlaxDistilBertModel

__call__

FlaxDistilBertForMaskedLM

类 transformers.FlaxDistilBertForMaskedLM

__call__

FlaxDistilBertForSequenceClassification

类 transformers.FlaxDistilBertForSequenceClassification

__call__

FlaxDistilBertForMultipleChoice

类 transformers.FlaxDistilBertForMultipleChoice

__call__

FlaxDistilBertForTokenClassification

类 transformers.FlaxDistilBertForTokenClassification

__call__

FlaxDistilBertForQuestionAnswering

类 transformers.FlaxDistilBertForQuestionAnswering

__call__

call

call

call

call

call

call