Transformers 文档

Idefics2

Transformers

Idefics2

概述

Idefics2模型由Léo Tronchon、Hugo Laurencon和Victor Sanh在构建视觉语言模型时什么最重要？中提出。相关的博客文章可以在这里找到。

Idefics2 是一个开放的多模态模型，能够接受任意序列的图像和文本输入，并生成文本输出。该模型可以回答关于图像的问题，描述视觉内容，基于多张图像创作故事，或者在没有视觉输入的情况下单纯作为纯语言模型运行。它在 IDEFICS-1 的基础上进行了改进，特别是在文档理解、OCR 或视觉推理方面。Idefics2 是轻量级的（80亿参数），并以原始的长宽比和分辨率处理图像，这使得推理效率可以有所不同。

论文的摘要如下：

对视觉语言模型（VLMs）日益增长的兴趣是由大型语言模型和视觉变换器的改进推动的。尽管关于这一主题的文献丰富，我们观察到关于VLMs设计的关键决策往往没有得到合理的解释。我们认为，这些未经支持的决策阻碍了该领域的进展，因为难以确定哪些选择能提高模型性能。为了解决这个问题，我们围绕预训练模型、架构选择、数据和训练方法进行了广泛的实验。我们的研究结果包括开发了Idefics2，这是一个拥有80亿参数的高效基础VLM。Idefics2在其规模类别内的各种多模态基准测试中达到了最先进的性能，并且通常与四倍于其大小的模型相当。我们发布了该模型（基础版、指导版和聊天版）以及为其训练创建的数据集。

Idefics2 architecture. Taken from the original paper.

该模型由amyeroberts贡献。原始代码可以在这里找到。

使用提示

每个样本可以包含多张图像，且样本之间的图像数量可以不同。处理器会将输入填充到批次中最大图像数量，以便输入到模型中。
处理器有一个do_image_splitting选项。如果设置为True，每个输入图像将被分割成4个子图像，并与原始图像拼接形成5个图像。这对于提高模型性能很有用。如果模型没有使用此选项进行训练，请确保processor.image_processor.do_image_splitting设置为False。
text 传递给处理器时，应在需要插入图像的位置包含标记。如果文本是聊天消息，则应在每条消息的末尾包含。
处理器有自己的apply_chat_template方法，用于将聊天消息转换为文本，然后可以作为text传递给处理器。

如何在聊天消息上使用处理器的示例：

import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"

image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What’s the difference between these two images?"},
        {"type": "image"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
model.to(device)

# at inference time, one needs to pass `add_generation_prompt=True` in order to make sure the model completes the prompt
text = processor.apply_chat_template(messages, add_generation_prompt=True)
print(text)
# 'User: What’s the difference between these two images?<image><image><end_of_utterance>\nAssistant:'

inputs = processor(images=images, text=text, return_tensors="pt").to(device)

generated_text = model.generate(**inputs, max_new_tokens=500)
generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0]
print("Generated text:", generated_text)

在训练过程中，确定模型不应学习哪些标记非常重要。对于Idefics2，这通常归结为图像和填充标记。这意味着可以按以下方式创建标签：

import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
import torch

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"

image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What’s the difference between these two images?"},
        {"type": "image"},
        {"type": "image"},
    ],
},
{
    "role": "assistant",
    "content": [
        {"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."},
    ],
}]

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
model.to(device)

text = processor.apply_chat_template(messages, add_generation_prompt=False)
inputs = processor(images=images, text=text, return_tensors="pt").to(device)

labels = inputs.input_ids.clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == model.config.image_token_id] = -100

inputs["labels"] = labels

outputs = model(**inputs)
loss = outputs.loss
loss.backward()

请注意，在训练Idefics2进行用户和助手之间的多轮对话时，通常也会将所有与用户消息对应的标记设置为-100。

模型优化：Flash Attention

上面的代码片段展示了没有任何优化技巧的推理。然而，通过利用Flash Attention，可以显著加快模型的速度，这是模型中使用的注意力机制的更快实现。

首先，确保安装最新版本的 Flash Attention 2 以包含滑动窗口注意力功能。

pip install -U flash-attn --no-build-isolation

请确保您拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash attention repository的官方文档。同时，请确保以半精度（例如torch.float16）加载您的模型。

要使用Flash Attention-2加载和运行模型，只需将上述代码片段更改为以下内容：

model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
+    attn_implementation="flash_attention_2",
).to(device)

使用量化缩小Idefics2

由于Idefics2模型有80亿个参数，以半精度（float16）存储时，每个参数占用2字节，因此需要大约16GB的GPU内存。然而，可以通过使用量化来缩小模型的大小。如果将模型量化为4位（即每个参数占用半字节），则仅需要大约3.5GB的内存。

量化模型就像将quantization_config传递给模型一样简单。我们可以用以下更改来更改上面的代码片段。我们将利用BitsAndyBytes量化（但请参考此页面了解其他量化方法）：

+ from transformers import BitsAndBytesConfig

+ quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_compute_dtype=torch.float16
+ )
model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
+    quantization_config=quantization_config,
).to(device)

资源

以下是官方 Hugging Face 和社区（由🌎表示）提供的资源列表，帮助您开始使用 Idefics2。如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

关于如何使用Trainer在自定义数据集上微调Idefics2的笔记本可以在这里找到。它支持完全微调以及（量化的）LoRa。
关于如何使用TRL库微调Idefics2的脚本可以在这里找到。
关于微调Idefics2用于JSON提取用例的演示笔记本可以在这里找到。🌎

Idefics2Config

类 transformers.Idefics2Config

< source >

( use_cache = True image_token_id = 32001 tie_word_embeddings = False vision_config = None perceiver_config = None text_config = None **kwargs )

参数

use_cache (bool, optional, defaults to True) — 模型是否应该缓存注意力机制的键/值对。
image_token_id (int, optional, defaults to 32001) — “image” token的id.
tie_word_embeddings (bool, optional, defaults to False) — 是否将词嵌入与标记嵌入绑定。
vision_config (IdeficsVisionConfig 或 dict, 可选) — 自定义视觉配置或字典
perceiver_config (IdeficsPerceiverConfig 或 dict, 可选) — 自定义感知器配置或字典
text_config (MistralConfig or dict, optional) — 自定义文本配置或文本模型的字典

这是用于存储Idefics2Model配置的配置类。它用于根据指定的参数实例化一个Idefics2模型，定义模型架构。使用默认值实例化配置将产生与HuggingFaceM4/idefics2-8b架构模型相似的配置。

配置对象继承自PretrainedConfig，可用于控制模型输出。阅读PretrainedConfig的文档以获取更多信息。

示例：

>>> from transformers import Idefics2Model, Idefics2Config
>>> # Initializing configuration
>>> configuration = Idefics2Config()
>>> # Initializing a model from the configuration
>>> model = Idefics2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Idefics2Model

类 transformers.Idefics2Model

< source >

( 配置: Idefics2Config )

参数

config (Idefics2Config 或 Idefics2VisionConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

Idefics2模型由SIGLIP视觉编码器和Mistral语言解码器组成该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入大小、修剪头部等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None pixel_attention_mask: typing.Optional[torch.BoolTensor] = None image_hidden_states: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？

可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

如果使用了past_key_values，则可以选择性地仅输入最后一个decoder_input_ids（参见past_key_values）。

如果你想改变填充行为，你应该阅读modeling_opt._prepare_decoder_attention_mask 并根据你的需求进行修改。有关默认策略的更多信息，请参见论文中的图1。
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), 可选) — 每个输入序列标记在位置嵌入中的位置索引。选择范围在 [0, config.n_positions - 1] 内。什么是位置ID？
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有更多控制，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) -- 对应于输入图像的张量。像素值可以使用 [AutoImageProcessor](/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoImageProcessor) 获取。详情请参见 [CLIPImageProcessor.__call__()](/docs/transformers/v4.47.1/en/model_doc/imagegpt#transformers.ImageGPTFeatureExtractor.__call__) ([]LlavaProcessor`] 使用 CLIPImageProcessor 处理图像).
pixel_attention_mask (torch.Tensor of shape (batch_size, image_size, image_size), optional) — 用于避免在填充像素索引上执行注意力的掩码。
image_hidden_states (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — 图像编码器在模态投影和感知器重采样后的隐藏状态。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个 ModelOutput 而不是一个普通的元组。

Idefics2Model 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

输入到模型的输入可以包含任意数量的图像。为了适应这一点，输入到模型的pixel_values具有图像填充 -> (batch_size, max_num_images, 3, max_heights, max_widths)，其中max_num_images是批次中batch_size样本中图像的最大数量。

在模型入口处填充像素值后，不需要再填充图像。为了提高效率，我们只通过vision_model的前向传递真实图像，丢弃填充图像，即大小为(image_batch_size, 3, height, width)的pixel_values，其中当num_images_per_sample=[1, 3, 1, 2]且max_num_images为3时，image_batch_size将为7。

Idefics2ForConditionalGeneration

类 transformers.Idefics2ForConditionalGeneration

< source >

( config )

参数

config (Idefics2Config 或 Idefics2VisionConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。

Idefics2模型带有语言建模头。它由一个SigLIP视觉编码器组成，顶部有一个语言建模头。该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None pixel_attention_mask: typing.Optional[torch.BoolTensor] = None image_hidden_states: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None num_logits_to_keep: int = 0 ) → transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？

可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

如果使用了past_key_values，则可以选择性地仅输入最后一个decoder_input_ids（参见past_key_values）。

如果你想改变填充行为，你应该阅读modeling_opt._prepare_decoder_attention_mask 并根据你的需求进行修改。有关默认策略的更多信息，请参见论文中的图1。
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — 每个输入序列标记在位置嵌入中的位置索引。选择范围在 [0, config.n_positions - 1] 内。什么是位置ID？
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有更多控制权，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) -- 对应于输入图像的张量。可以使用 [AutoImageProcessor](/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoImageProcessor) 获取像素值。有关详细信息，请参阅 [CLIPImageProcessor.__call__()](/docs/transformers/v4.47.1/en/model_doc/imagegpt#transformers.ImageGPTFeatureExtractor.__call__) ([]LlavaProcessor`] 使用 CLIPImageProcessor 处理图像)。
pixel_attention_mask (torch.Tensor of shape (batch_size, image_size, image_size), optional) — 用于避免在填充像素索引上执行注意力的掩码。
image_hidden_states (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — 图像编码器在模态投影和感知器重采样后的隐藏状态。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个 ModelOutput 而不是一个普通的元组。
Args — labels (torch.LongTensor of shape (batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or model.image_token_id (where model is your instance of Idefics2ForConditionalGeneration). Tokens with indices set to model.image_token_id are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].
num_logits_to_keep (int, 可选): 计算最后num_logits_to_keep个token的logits。如果为0，则计算所有input_ids的logits（特殊情况）。生成时只需要最后一个token的logits，仅计算该token的logits可以节省内存，这对于长序列或大词汇量来说非常重要。

transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（Idefics2Config）和输入。

loss (torch.FloatTensor 形状为 (1,), 可选, 当提供 labels 时返回) — 语言建模损失（用于下一个标记的预测）。
logits (torch.FloatTensor 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（在 SoftMax 之前的每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量）包含预计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）以加速顺序解码。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。模型在每层输出处的隐藏状态加上可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (tuple(torch.FloatTensor), 可选) — torch.FloatTensor 元组（一个用于图像嵌入的输出，(batch_size, num_images, sequence_length, hidden_size)。由视觉编码器生成的模型的 image_hidden_states，以及可选的感知器生成的。

Idefics2ForConditionalGeneration 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> import requests
>>> import torch
>>> from PIL import Image
>>> from io import BytesIO

>>> from transformers import AutoProcessor, AutoModelForVision2Seq
>>> from transformers.image_utils import load_image

>>> # Note that passing the image urls (instead of the actual pil images) to the processor is also possible
>>> image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
>>> image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
>>> image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

>>> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
>>> model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b-base", device_map="auto")

>>> BAD_WORDS_IDS = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> EOS_WORDS_IDS = [processor.tokenizer.eos_token_id]

>>> # Create inputs
>>> prompts = [
...   "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
...   "In which city is that bridge located?<image>",
... ]
>>> images = [[image1, image2], [image3]]
>>> inputs = processor(images=images, text=prompts, padding=True, return_tensors="pt").to("cuda")

>>> # Generate
>>> generated_ids = model.generate(**inputs, bad_words_ids=BAD_WORDS_IDS, max_new_tokens=20)
>>> generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> print(generated_texts)
['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of New York, and more specifically the Statue of Liberty.\n\n', 'In which city is that bridge located?\n\nThe bridge is located in the city of Pittsburgh, Pennsylvania.\n\n\nThe bridge is']

Idefics2ImageProcessor

类 transformers.Idefics2ImageProcessor

< source >

( do_convert_rgb: bool = True do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: bool = True do_image_splitting: bool = False **kwargs )

参数

do_convert_rgb (bool, optional, defaults to True) — 是否将图像转换为RGB。如果输入图像是其他格式（例如RGBA），这将非常有用。仅在输入图像为PIL格式时有效。
do_resize (bool, 可选, 默认为 True) — 是否调整图像大小。图像的最长边将被调整为 <= size["longest_edge"]，最短边将按比例调整以保持输入的宽高比，最小尺寸为 size["shortest_edge"].
size (Dict, 可选) — 控制输出图像的大小。这是一个包含“shortest_edge”和“longest_edge”键的字典。
resample (Resampling, 可选, 默认为 Resampling.BILINEAR) — 调整图像大小时使用的重采样过滤器。
do_rescale (bool, 可选, 默认为 True) — 是否重新缩放图像。如果设置为 True，图像将被重新缩放，使其像素值在0到1之间。
rescale_factor (float, 可选, 默认为 1/255) — 如果 do_rescale 设置为 True，则用于重新缩放图像的重新缩放因子。
do_normalize (bool, 可选, 默认为 True) — 是否对图像进行归一化。如果设置为 True，图像将被归一化为具有 image_mean 的均值和 image_std 的标准差。
image_mean (float 或 List[float], 可选, 默认为 IDEFICS_STANDARD_MEAN) — 如果对图像进行归一化，则使用的均值。这是一个浮点数或与图像通道数长度相同的浮点数列表。可以在 preprocess 方法中通过 image_mean 参数覆盖。可以在 preprocess 方法中通过 image_mean 参数覆盖。
image_std (float 或 List[float], 可选, 默认为 IDEFICS_STANDARD_STD) — 如果对图像进行归一化，则使用的标准差。这是一个浮点数或与图像通道数长度相同的浮点数列表。可以在 preprocess 方法中通过 image_std 参数覆盖。可以在 preprocess 方法中通过 image_std 参数覆盖。
do_pad (bool, 可选, 默认为 True) — 是否将图像填充到批次中最大的高度和宽度以及每个样本中的图像数量，使得返回的张量形状为 (batch_size, max_num_images, num_channels, max_height, max_width)。
do_image_splitting (bool, 可选, 默认为 False) — 是否将图像分割成4个相等的子图像并与原始图像连接。该策略首次在https://arxiv.org/abs/2311.06607中引入。

构建一个Idefics图像处理器。

预处理

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] do_convert_rgb: typing.Optional[bool] = None do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = None do_image_splitting: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None input_data_format: typing.Optional[transformers.image_utils.ChannelDimension] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = )

参数

图片 (ImageInput) — 要预处理的图片列表。
do_convert_rgb (bool, 可选, 默认为 self.do_convert_rgb) — 是否将图像转换为RGB.
do_resize (bool, optional, defaults to self.do_resize) — 是否调整图像大小.
size (Dict[str, int], 可选, 默认为 self.size) — 调整大小后的图像尺寸。图像的最短边将调整为 size[“shortest_edge”]，最长边将调整以保持输入的宽高比。
resample (int, 可选, 默认为 self.resample) — 如果调整图像大小，则使用的重采样过滤器。这可以是枚举 PILImageResampling 中的一个。只有在 do_resize 设置为 True 时才会生效。
do_rescale (bool, optional, defaults to self.do_rescale) — 是否对图像进行重新缩放.
rescale_factor (float, optional, defaults to self.rescale_factor) — 如果 do_rescale 设置为 True，则用于重新缩放图像的缩放因子。
do_normalize (bool, optional, defaults to self.do_normalize) — 是否对图像进行归一化处理。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 用于归一化的图像均值。仅在 do_normalize 设置为 True 时有效。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 用于归一化的图像标准差。仅在 do_normalize 设置为 True 时有效。
do_pad (bool, optional, defaults to self.do_pad) — 是否将图像填充到批次中最大的高度和宽度。
do_image_splitting (bool, 可选, 默认为 self.do_image_splitting) — 是否将图像分割成4个相等的子图像并与原始图像连接。该策略首次在https://arxiv.org/abs/2311.06607中引入。
return_tensors (str 或 TensorType, 可选) — 返回的张量类型。可以是以下之一：
- 未设置：返回一个 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回一个类型为 tf.Tensor 的批次。
- TensorType.PYTORCH 或 'pt'：返回一个类型为 torch.Tensor 的批次。
- TensorType.NUMPY 或 'np'：返回一个类型为 np.ndarray 的批次。
- TensorType.JAX 或 'jax'：返回一个类型为 jax.numpy.ndarray 的批次。
data_format (ChannelDimension 或 str, 可选, 默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- 未设置：使用输入图像的通道维度格式。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE: 图像格式为 (height, width)。

预处理一批图像。

Idefics2Processor

类 transformers.Idefics2Processor

< source >

( image_processor tokenizer = None image_seq_len: int = 64 chat_template: str = None **kwargs )

参数

image_processor (Idefics2ImageProcessor) — 一个 Idefics2ImageProcessor 的实例。图像处理器是一个必需的输入。
tokenizer (PreTrainedTokenizerBase, 可选) — PreTrainedTokenizerBase 的一个实例。这应该与模型的文本模型相对应。tokenizer 是一个必需的输入。
image_seq_len (int, 可选, 默认为 64) — 图像序列的长度，即输入中每个图像的标记数量。此参数用于从输入提示和图像标记构建字符串，并且应与所使用的模型的config.perceiver_config.resampler_n_latents值匹配。
chat_template (str, 可选) — 一个Jinja模板，用于将聊天中的消息列表转换为可标记的字符串。

构建一个IDEFICS2处理器，它将LLama分词器和IDEFICS2图像处理器封装到一个单一的处理器中。

IdeficsProcessor 提供了 Idefics2ImageProcessor 和 LlamaTokenizerFast 的所有功能。请参阅 call() 和 decode() 的文档字符串以获取更多信息。

call

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')], typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]], typing.List[typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]]]] = None text: typing.Union[str, ForwardRef('PreTokenizedInput'), typing.List[str], typing.List[ForwardRef('PreTokenizedInput')]] = None audio = None videos = None **kwargs: typing_extensions.Unpack[transformers.models.idefics2.processing_idefics2.Idefics2ProcessorKwargs] )

参数

图像 (PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], List[torch.Tensor], 可选) — 要准备的图像或图像批次。每个图像可以是PIL图像、NumPy数组或PyTorch张量。如果是List[ImageInput]类型，则假定这是用于单个提示的，即批次大小为1。
text (Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
每当遇到图像标记时，它会被扩展为 + image_seq_len `。
return_tensors (Union[str, TensorType], 可选) — 如果设置，将返回特定框架的张量。有关更多信息，请参见 PreTrainedTokenizerFast.call().

处理输入提示并返回一个BatchEncoding。

示例：

>>> import requests
>>> from transformers import Idefics2Processor
>>> from transformers.image_utils import load_image

>>> processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", image_seq_len=2)
>>> processor.image_processor.do_image_splitting = False  # Force as False to simplify the example

>>> url1 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
>>> url2 = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"

>>> image1, image2 = load_image(url1), load_image(url2)
>>> images = [[image1], [image2]]

>>> text = [
...     "<image>In this image, we see",
...     "bla bla bla<image>",
... ]
>>> outputs = processor(images=images, text=text, return_tensors="pt", padding=True)
>>> input_ids = outputs.input_ids
>>> input_tokens = processor.tokenizer.batch_decode(input_ids)
>>> print(input_tokens)
['<s><fake_token_around_image><image><image><fake_token_around_image> In this image, we see', '<s> bla bla bla<fake_token_around_image><image><image><fake_token_around_image>']

< > Update on GitHub

←IDEFICS Idefics3→