Transformers 文档

Mistral

Transformers

Mistral

概述

Mistral 是由 Albert Jiang、Alexandre Sablayrolles、Arthur Mensch、Chris Bamford、Devendra Singh Chaplot、Diego de las Casas、Florian Bressand、Gianna Lengyel、Guillaume Lample、Lélio Renard Lavaud、Lucile Saulnier、Marie-Anne Lachaux、Pierre Stock、Teven Le Scao、Thibaut Lavril、Thomas Wang、Timothée Lacroix、William El Sayed 在这篇博客文章中介绍的。

博客文章的引言说：

Mistral AI 团队自豪地发布了 Mistral 7B，这是迄今为止其规模最强大的语言模型。

Mistral-7B 是由 mistral.ai 发布的第一个大型语言模型（LLM）。

架构细节

Mistral-7B 是一个仅解码器的 Transformer，具有以下架构选择：

滑动窗口注意力 - 使用8k上下文长度和固定缓存大小进行训练，理论注意力跨度为128K个标记
GQA（分组查询注意力） - 允许更快的推理和更低的缓存大小。
字节回退BPE分词器 - 确保字符永远不会映射到词汇表外的标记。

更多详情请参阅发布博客文章。

许可证

Mistral-7B 是根据 Apache 2.0 许可证发布的。

使用提示

Mistral 团队已经发布了 3 个检查点：

一个基础模型，Mistral-7B-v0.1，它已经经过预训练，用于预测互联网规模数据中的下一个标记。
一个经过指令调优的模型，Mistral-7B-Instruct-v0.1，这是通过监督微调（SFT）和直接偏好优化（DPO）优化的基础模型，专为聊天目的而设计。
一个改进的指令调优模型，Mistral-7B-Instruct-v0.2，它在v1的基础上有所改进。

基础模型可以如下使用：

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

>>> prompt = "My favourite condiment is"

>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
"My favourite condiment is to ..."

调整后的指令模型可以如下使用：

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

>>> messages = [
...     {"role": "user", "content": "What is your favourite condiment?"},
...     {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
...     {"role": "user", "content": "Do you have mayonnaise recipes?"}
... ]

>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
"Mayonnaise can be made as follows: (...)"

可以看出，指令调优模型需要应用聊天模板，以确保输入以正确的格式准备。

通过使用Flash Attention加速Mistral

上面的代码片段展示了没有任何优化技巧的推理。然而，通过利用Flash Attention，可以显著加快模型的速度，这是模型中使用的注意力机制的更快实现。

首先，确保安装最新版本的 Flash Attention 2 以包含滑动窗口注意力功能。

pip install -U flash-attn --no-build-isolation

请确保您拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash attention repository的官方文档。同时，请确保以半精度（例如torch.float16）加载您的模型。

要使用Flash Attention-2加载并运行模型，请参考以下代码片段：

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

>>> prompt = "My favourite condiment is"

>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
"My favourite condiment is to (...)"

预期的加速

下面是一个预期的加速图，比较了使用mistralai/Mistral-7B-v0.1检查点的transformers原生实现与Flash Attention 2版本模型的纯推理时间。

滑动窗口注意力

当前实现支持滑动窗口注意力机制和内存高效的缓存管理。要启用滑动窗口注意力，只需确保拥有与滑动窗口注意力兼容的flash-attn版本（>=2.3.0）。

Flash Attention-2 模型还使用了更高效的内存缓存切片机制——正如 Mistral 模型的官方实现所建议的那样，我们使用滚动缓存机制保持缓存大小固定（self.config.sliding_window），仅支持 padding_side="left" 的批量生成，并使用当前 token 的绝对位置来计算位置嵌入。

使用量化缩小Mistral

由于Mistral模型有70亿个参数，这将需要大约14GB的GPU内存（半精度，float16），因为每个参数存储在2个字节中。然而，可以使用量化来缩小模型的大小。如果模型被量化为4位（或每个参数半字节），则仅需要大约3.5GB的内存。

量化模型就像将quantization_config传递给模型一样简单。下面，我们将利用BitsAndyBytes量化（但请参考此页面了解其他量化方法）：

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

>>> # specify how to quantize the model
>>> quantization_config = BitsAndBytesConfig(
...         load_in_4bit=True,
...         bnb_4bit_quant_type="nf4",
...         bnb_4bit_compute_dtype="torch.float16",
... )

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", quantization_config=True, device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

>>> prompt = "My favourite condiment is"

>>> messages = [
...     {"role": "user", "content": "What is your favourite condiment?"},
...     {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
...     {"role": "user", "content": "Do you have mayonnaise recipes?"}
... ]

>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
"The expected output"

该模型由Younes Belkada和Arthur Zucker贡献。原始代码可以在这里找到。

资源

一份由Hugging Face官方和社区（由🌎表示）提供的资源列表，帮助您开始使用Mistral。如果您有兴趣提交资源以包含在此处，请随时打开一个Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

Text Generation

一个用于执行Mistral-7B监督微调（SFT）的演示笔记本可以在这里找到。🌎
一篇关于如何在2024年使用Hugging Face工具微调LLMs的博客文章。🌎
Hugging Face 的 Alignment Handbook 包含了使用 Mistral-7B 进行监督微调（SFT）和直接偏好优化的脚本和配方。这包括用于全微调、单 GPU 上的 QLoRa 以及多 GPU 微调的脚本。
因果语言建模任务指南

Transformers

Mistral

概述

架构细节

许可证

使用提示

通过使用Flash Attention加速Mistral

预期的加速

滑动窗口注意力

使用量化缩小Mistral

资源

MistralConfig

类 transformers.MistralConfig

MistralModel

类 transformers.MistralModel

前进

MistralForCausalLM

类 transformers.MistralForCausalLM

前进

MistralForSequenceClassification

类 transformers.MistralForSequenceClassification

前进

MistralForTokenClassification

类 transformers.MistralForTokenClassification

前进

MistralForQuestionAnswering

类 transformers.MistralForQuestionAnswering

前进

FlaxMistralModel

类 transformers.FlaxMistralModel

__call__

FlaxMistralForCausalLM

类 transformers.FlaxMistralForCausalLM

__call__

TFMistralModel

class transformers.TFMistralModel

调用

TFMistralForCausalLM

类 transformers.TFMistralForCausalLM

调用

TFMistralForSequenceClassification

类 transformers.TFMistralForSequenceClassification

调用

call

call