Transformers 文档

OPT

Transformers

OPT

概述

OPT模型是由Meta AI在Open Pre-trained Transformer Language Models中提出的。 OPT是一系列开源的大型因果语言模型，其性能与GPT3相似。

论文的摘要如下：

大型语言模型通常需要数十万计算日的训练，它们在零样本和少样本学习方面展示了显著的能力。鉴于其计算成本，这些模型在没有大量资金的情况下难以复制。对于通过API提供的少数模型，无法访问完整的模型权重，这使得它们难以研究。我们推出了Open Pre-trained Transformers (OPT)，这是一套仅包含解码器的预训练变换器，参数范围从125M到175B，我们旨在全面且负责任地与感兴趣的研究人员分享。我们展示了OPT-175B与GPT-3相当，而开发所需的碳足迹仅为1/7。我们还发布了我们的日志，详细记录了我们面临的基础设施挑战，以及用于实验所有发布模型的代码。

该模型由Arthur Zucker、Younes Belkada和Patrick Von Platen贡献。原始代码可以在这里找到。

提示：

OPT 具有与 BartDecoder 相同的架构。
与GPT2相反，OPT在每个提示的开头添加了EOS标记。

资源

以下是官方Hugging Face和社区（由🌎表示）提供的资源列表，帮助您开始使用OPT。如果您有兴趣提交资源以包含在此处，请随时打开一个Pull Request，我们将进行审核。理想情况下，资源应展示一些新内容，而不是重复现有资源。

Text Generation

一个关于使用PEFT、bitsandbytes和Transformers微调OPT的笔记本。🌎
一篇关于使用OPT的解码策略的博客文章。
Causal language modeling 🤗 Hugging Face 课程的章节。
OPTForCausalLM 由这个因果语言建模示例脚本和 notebook 支持。
TFOPTForCausalLM 由这个因果语言建模示例脚本和 notebook 支持。
FlaxOPTForCausalLM 由这个因果语言建模示例脚本支持。

Text Classification

文本分类任务指南
OPTForSequenceClassification 由这个示例脚本和笔记本支持。

Question Answering

OPTForQuestionAnswering 由这个 question answering example script 和 notebook 支持。
Question answering 章节来自 🤗 Hugging Face 课程。

⚡️ 推理

一篇关于How 🤗 Accelerate runs very large models thanks to PyTorch与OPT的博客文章。

结合OPT和Flash Attention 2

首先，确保安装最新版本的 Flash Attention 2 以包含滑动窗口注意力功能。

pip install -U flash-attn --no-build-isolation

请确保您拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash-attn仓库的官方文档。同时，请确保以半精度加载您的模型（例如 `torch.float16`）。

要使用Flash Attention 2加载并运行模型，请参考以下代码片段：

>>> import torch
>>> from transformers import OPTForCausalLM, GPT2Tokenizer
>>> device = "cuda" # the device to load the model onto

>>> model = OPTForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
>>> tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m")

>>> prompt = ("A chat between a curious human and the Statue of Liberty.\n\nHuman: What is your name?\nStatue: I am the "
              "Statue of Liberty.\nHuman: Where do you live?\nStatue: New York City.\nHuman: How long have you lived "
              "there?")

>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
>>> model.to(device)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=30, do_sample=False)
>>> tokenizer.batch_decode(generated_ids)[0]
'</s>A chat between a curious human and the Statue of Liberty.\n\nHuman: What is your name?\nStatue: I am the Statue of Liberty.\nHuman: Where do you live?\nStatue: New York City.\nHuman: How long have you lived there?\nStatue: I have lived here for about a year.\nHuman: What is your favorite place to eat?\nStatue: I love'

预期的加速

下面是一个预期的加速图，比较了使用facebook/opt-2.7b检查点的transformers原生实现与使用两种不同序列长度的Flash Attention 2版本模型的纯推理时间。

下面是一个预期的加速图，比较了使用facebook/opt-350m检查点的transformers原生实现与使用两种不同序列长度的Flash Attention 2版本模型的纯推理时间。

使用缩放点积注意力 (SDPA)

PyTorch 包含一个原生的缩放点积注意力（SDPA）操作符，作为 torch.nn.functional 的一部分。这个函数包含了几种实现，可以根据输入和使用的硬件进行应用。更多信息请参阅官方文档或 GPU 推理页面。

默认情况下，当有可用实现时，SDPA 用于 torch>=2.1.1，但你也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 来明确请求使用 SDPA。

from transformers import OPTForCausalLM
model = OPTForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16, attn_implementation="sdpa")
...

为了获得最佳加速效果，我们建议以半精度加载模型（例如 torch.float16 或 torch.bfloat16）。

在本地基准测试（L40S-45GB，PyTorch 2.4.0，操作系统 Debian GNU/Linux 11）中使用 float16 和 facebook/opt-350m，我们在训练和推理过程中看到了以下加速效果。

训练

batch_size	seq_len	每批次时间（eager - 秒）	每批次时间（sdpa - 秒）	加速百分比 (%)	Eager 峰值内存 (MB)	sdpa 峰值内存 (MB)	内存节省百分比 (%)
1	128	0.047	0.037	26.360	1474.611	1474.32	0.019
1	256	0.046	0.037	24.335	1498.541	1499.49	-0.063
1	512	0.046	0.037	24.959	1973.544	1551.35	27.215
1	1024	0.062	0.038	65.135	4867.113	1698.35	186.578
1	2048	0.230	0.039	483.933	15662.224	2715.75	476.718
2	128	0.045	0.037	20.455	1498.164	1499.49	-0.089
2	256	0.046	0.037	24.027	1569.367	1551.35	1.161
2	512	0.045	0.037	20.965	3257.074	1698.35	91.778
2	1024	0.122	0.038	225.958	9054.405	2715.75	233.403
2	2048	0.464	0.067	593.646	30572.058	4750.55	543.548
4	128	0.045	0.037	21.918	1549.448	1551.35	-0.123
4	256	0.044	0.038	18.084	2451.768	1698.35	44.361
4	512	0.069	0.037	84.421	5833.180	2715.75	114.791
4	1024	0.262	0.062	319.475	17427.842	4750.55	266.860
4	2048	OOM	0.062	Eager OOM	OOM	4750.55	Eager OOM
8	128	0.044	0.037	18.436	2049.115	1697.78	20.694
8	256	0.048	0.036	32.887	4222.567	2715.75	55.484
8	512	0.153	0.06	154.862	10985.391	4750.55	131.245
8	1024	0.526	0.122	330.697	34175.763	8821.18	287.428
8	2048	OOM	0.122	急切 OOM	OOM	8821.18	急切 OOM

推理

batch_size	seq_len	每个令牌的延迟 eager (毫秒)	每个令牌的延迟 SDPA (毫秒)	加速 (%)	内存 eager (MB)	内存 BT (MB)	内存节省 (%)
1	128	11.634	8.647	34.546	717.676	717.674	0
1	256	11.593	8.86	30.851	742.852	742.845	0.001
1	512	11.515	8.816	30.614	798.232	799.593	-0.17
1	1024	11.556	8.915	29.628	917.265	895.538	2.426
2	128	12.724	11.002	15.659	762.434	762.431	0
2	256	12.704	11.063	14.83	816.809	816.733	0.009
2	512	12.757	10.947	16.535	917.383	918.339	-0.104
2	1024	13.018	11.018	18.147	1162.65	1114.81	4.291
4	128	12.739	10.959	16.243	856.335	856.483	-0.017
4	256	12.718	10.837	17.355	957.298	957.674	-0.039
4	512	12.813	10.822	18.393	1158.44	1158.45	-0.001
4	1024	13.416	11.06	21.301	1653.42	1557.19	6.18
8	128	12.763	10.891	17.193	1036.13	1036.51	-0.036
8	256	12.89	11.104	16.085	1236.98	1236.87	0.01
8	512	13.327	10.939	21.836	1642.29	1641.78	0.031
8	1024	15.181	11.175	35.848	2634.98	2443.35	7.843

Transformers

OPT

概述

资源

结合OPT和Flash Attention 2

预期的加速

使用缩放点积注意力 (SDPA)

训练

推理

OPTConfig

类 transformers.OPTConfig

OPTModel

类 transformers.OPTModel

前进

OPTForCausalLM

类 transformers.OPTForCausalLM

前进

OPTForSequenceClassification

类 transformers.OPTForSequenceClassification

前进

OPTForQuestionAnswering

类 transformers.OPTForQuestionAnswering

前进

TFOPTModel

类 transformers.TFOPTModel

调用

TFOPTForCausalLM

类 transformers.TFOPTForCausalLM

调用

FlaxOPTModel

类 transformers.FlaxOPTModel

__call__

FlaxOPTForCausalLM

类 transformers.FlaxOPTForCausalLM

__call__

call

call