Transformers 文档

视觉编码解码模型

Transformers

视觉编码解码模型

概述

VisionEncoderDecoderModel 可以用来初始化一个从图像到文本的模型，使用任何预训练的基于Transformer的视觉模型作为编码器（例如 ViT, BEiT, DeiT, Swin）和任何预训练的语言模型作为解码器（例如 RoBERTa, GPT2, BERT, DistilBERT）。

使用预训练检查点初始化图像到文本序列模型的有效性已在（例如）TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models 中展示，作者为Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei。

在训练/微调了这样的VisionEncoderDecoderModel之后，它可以像其他模型一样保存/加载（更多信息请参见下面的示例）。

一个示例应用是图像字幕生成，其中编码器用于编码图像，然后自回归语言模型生成字幕。另一个示例是光学字符识别。请参考TrOCR，它是VisionEncoderDecoderModel的一个实例。

从模型配置中随机初始化VisionEncoderDecoderModel。

VisionEncoderDecoderModel 可以从编码器和解码器配置中随机初始化。在下面的示例中，我们展示了如何使用默认的 ViTModel 配置作为编码器，以及默认的 BertForCausalLM 配置作为解码器来实现这一点。

>>> from transformers import BertConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel

>>> config_encoder = ViTConfig()
>>> config_decoder = BertConfig()

>>> config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
>>> model = VisionEncoderDecoderModel(config=config)

从预训练的编码器和预训练的解码器初始化VisionEncoderDecoderModel。

VisionEncoderDecoderModel 可以从预训练的编码器检查点和预训练的解码器检查点初始化。请注意，任何预训练的基于Transformer的视觉模型，例如 Swin，都可以作为编码器，而预训练的自编码模型，例如 BERT，预训练的因果语言模型，例如 GPT2，以及序列到序列模型的预训练解码器部分，例如 BART的解码器，都可以用作解码器。根据您选择作为解码器的架构，交叉注意力层可能会被随机初始化。从预训练的编码器和解码器检查点初始化 VisionEncoderDecoderModel 需要在下游任务上进行微调，如《Warm-starting-encoder-decoder 博客文章》中所示。为此，VisionEncoderDecoderModel 类提供了 VisionEncoderDecoderModel.from_encoder_decoder_pretrained() 方法。

>>> from transformers import VisionEncoderDecoderModel

>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "microsoft/swin-base-patch4-window7-224-in22k", "google-bert/bert-base-uncased"
... )

加载现有的 VisionEncoderDecoderModel 检查点并执行推理。

要加载VisionEncoderDecoderModel类的微调检查点，VisionEncoderDecoderModel提供了from_pretrained(...)方法，就像Transformers中的任何其他模型架构一样。

要进行推理，可以使用generate方法，该方法允许自回归生成文本。此方法支持各种形式的解码，例如贪婪、束搜索和多采样。

>>> import requests
>>> from PIL import Image

>>> from transformers import GPT2TokenizerFast, ViTImageProcessor, VisionEncoderDecoderModel

>>> # load a fine-tuned image captioning model and corresponding tokenizer and image processor
>>> model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
>>> tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
>>> image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

>>> # let's perform inference on an image
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values

>>> # autoregressively generate caption (uses greedy decoding by default)
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
>>> print(generated_text)
a cat laying on a blanket next to a cat laying on a bed

将 PyTorch 检查点加载到 TFVisionEncoderDecoderModel 中。

TFVisionEncoderDecoderModel.from_pretrained() 目前不支持从PyTorch检查点初始化模型。向此方法传递from_pt=True将会抛出异常。如果某个视觉编码器-解码器模型只有PyTorch检查点，一个解决方法是：

>>> from transformers import VisionEncoderDecoderModel, TFVisionEncoderDecoderModel

>>> _model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

>>> _model.encoder.save_pretrained("./encoder")
>>> _model.decoder.save_pretrained("./decoder")

>>> model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
... )
>>> # This is only for copying some specific attributes of this particular model.
>>> model.config = _model.config

训练

一旦模型创建完成，它可以在（图像，文本）对的数据集上进行微调，类似于BART、T5或任何其他编码器-解码器模型。如你所见，模型只需要2个输入来计算损失：pixel_values（即图像）和labels（即编码目标序列的input_ids）。

>>> from transformers import ViTImageProcessor, BertTokenizer, VisionEncoderDecoderModel
>>> from datasets import load_dataset

>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "google-bert/bert-base-uncased"
... )

>>> model.config.decoder_start_token_id = tokenizer.cls_token_id
>>> model.config.pad_token_id = tokenizer.pad_token_id

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values

>>> labels = tokenizer(
...     "an image of two cats chilling on a couch",
...     return_tensors="pt",
... ).input_ids

>>> # the forward function automatically creates the correct decoder_input_ids
>>> loss = model(pixel_values=pixel_values, labels=labels).loss

该模型由nielsr贡献。该模型的TensorFlow和Flax版本由ydshieh贡献。

VisionEncoderDecoderConfig

类 transformers.VisionEncoderDecoderConfig

< source >

( **kwargs )

参数

kwargs (可选) — 关键字参数字典。特别是：
- encoder (PretrainedConfig, 可选) — 定义编码器配置的配置对象实例。
- decoder (PretrainedConfig, 可选) — 定义解码器配置的配置对象实例。

VisionEncoderDecoderConfig 是用于存储 VisionEncoderDecoderModel 配置的配置类。它用于根据指定的参数实例化一个视觉编码器-文本解码器模型，定义编码器和解码器的配置。

配置对象继承自PretrainedConfig，可用于控制模型输出。阅读PretrainedConfig的文档以获取更多信息。

示例：

>>> from transformers import BertConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel

>>> # Initializing a ViT & BERT style configuration
>>> config_encoder = ViTConfig()
>>> config_decoder = BertConfig()

>>> config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)

>>> # Initializing a ViTBert model (with random weights) from a ViT & google-bert/bert-base-uncased style configurations
>>> model = VisionEncoderDecoderModel(config=config)

>>> # Accessing the model configuration
>>> config_encoder = model.config.encoder
>>> config_decoder = model.config.decoder
>>> # set decoder config to causal lm
>>> config_decoder.is_decoder = True
>>> config_decoder.add_cross_attention = True

>>> # Saving the model, including its configuration
>>> model.save_pretrained("my-model")

>>> # loading model and config from pretrained folder
>>> encoder_decoder_config = VisionEncoderDecoderConfig.from_pretrained("my-model")
>>> model = VisionEncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)

from_encoder_decoder_configs

< source >

( encoder_config: PretrainedConfig decoder_config: PretrainedConfig **kwargs ) → VisionEncoderDecoderConfig

VisionEncoderDecoderConfig

配置对象的一个实例

从预训练的编码器模型配置和解码器模型配置实例化一个VisionEncoderDecoderConfig（或派生类）。

Pytorch

Hide Pytorch content

VisionEncoderDecoderModel

类 transformers.VisionEncoderDecoderModel

< source >

( config: typing.Optional[transformers.configuration_utils.PretrainedConfig] = None encoder: typing.Optional[transformers.modeling_utils.PreTrainedModel] = None decoder: typing.Optional[transformers.modeling_utils.PreTrainedModel] = None )

参数

config (VisionEncoderDecoderConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。

该类可用于初始化一个图像到文本序列的模型，使用任何预训练的视觉自编码模型作为编码器，以及任何预训练的文本自回归模型作为解码器。编码器通过from_pretrained()函数加载，解码器也通过from_pretrained()函数加载。交叉注意力层会自动添加到解码器中，并应在下游生成任务（如图像字幕生成）上进行微调。

使用预训练检查点初始化序列到序列模型在序列生成任务中的有效性在Leveraging Pre-trained Checkpoints for Sequence Generation Tasks中由Sascha Rothe、Shashi Narayan、Aliaksei Severyn、Michael Matena、Yanqi Zhou、Wei Li和Peter J. Liu展示。

此外，在TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models中展示了如何利用大型预训练视觉模型进行光学字符识别（OCR），从而显著提高性能。

在训练/微调这样的Vision-Encoder-Text-Decoder模型之后，它可以像任何其他模型一样保存/加载（更多信息请参见示例）。

该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头部等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

VisionEncoderDecoderModel 是一个通用模型类，当使用 :meth~transformers.AutoModel.from_pretrained 类方法创建编码器和使用 :meth~transformers.AutoModelForCausalLM.from_pretrained 类方法创建解码器时，它将作为一个变压器架构实例化，其中编码器是库中的一个基础视觉模型类，解码器是另一个基础视觉模型类。

前进

< source >

( pixel_values: typing.Optional[torch.FloatTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.BoolTensor] = None encoder_outputs: typing.Optional[typing.Tuple[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs ) → transformers.modeling_outputs.Seq2SeqLMOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 像素值。像素值可以使用图像处理器获取（例如，如果您使用ViT作为编码器，您应该使用AutoImageProcessor）。详情请参见ViTImageProcessor.call().
decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.
可以使用PreTrainedTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？

如果使用了past_key_values，则可以选择性地仅输入最后一个decoder_input_ids（参见past_key_values）。

在训练过程中，decoder_input_ids 由模型自动生成，通过将 labels 向右移动，用 pad_token_id 替换 -100，并在前面加上 decoder_start_token_id。
decoder_attention_mask (torch.BoolTensor of shape (batch_size, target_sequence_length), optional) — 默认行为：生成一个忽略decoder_input_ids中填充标记的张量。默认情况下也会使用因果掩码。
encoder_outputs (tuple(torch.FloatTensor), optional) — 这个元组必须包含 (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size)) 是一个张量，表示编码器最后一层的隐藏状态输出。用于解码器的交叉注意力机制中。
past_key_values (tuple(tuple(torch.FloatTensor)) of length config.n_layers with each tuple having 4 tensors of shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)) — Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递decoder_input_ids。如果您希望对如何将decoder_input_ids索引转换为相关向量有更多控制权，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — 用于计算解码器的掩码语言建模损失的标签。索引应在 [-100, 0, ..., config.vocab_size] 范围内（参见 input_ids 文档字符串）。索引设置为 -100 的标记将被忽略（掩码），损失仅计算标签在 [0, ..., config.vocab_size] 范围内的标记
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, optional) — 如果设置为 True，模型将返回一个 ~utils.Seq2SeqLMOutput 而不是一个普通的元组。
kwargs (可选) — 剩余的关键字参数字典。关键字参数有两种形式：
- 没有前缀的将作为编码器前向函数的**encoder_kwargs输入。
- 带有decoder_前缀的将作为解码器前向函数的**decoder_kwargs输入。

transformers.modeling_outputs.Seq2SeqLMOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.Seq2SeqLMOutput 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（VisionEncoderDecoderConfig）和输入。

loss (torch.FloatTensor 形状为 (1,), 可选, 当提供 labels 时返回) — 语言建模损失。
logits (torch.FloatTensor 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量和 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的额外张量。

包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

解码器在每层输出处的隐藏状态加上初始嵌入输出。
decoder_attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的交叉注意力层的注意力权重，在注意力 softmax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

编码器在每层输出处的隐藏状态加上初始嵌入输出。
encoder_attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

VisionEncoderDecoderModel 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import AutoProcessor, VisionEncoderDecoderModel
>>> import requests
>>> from PIL import Image
>>> import torch

>>> processor = AutoProcessor.from_pretrained("microsoft/trocr-base-handwritten")
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

>>> # load image from the IAM dataset
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

>>> # training
>>> model.config.decoder_start_token_id = processor.tokenizer.eos_token_id
>>> model.config.pad_token_id = processor.tokenizer.pad_token_id
>>> model.config.vocab_size = model.config.decoder.vocab_size

>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> text = "hello world"
>>> labels = processor.tokenizer(text, return_tensors="pt").input_ids
>>> outputs = model(pixel_values=pixel_values, labels=labels)
>>> loss = outputs.loss

>>> # inference (generation)
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

from_encoder_decoder_pretrained

< source >

( encoder_pretrained_model_name_or_path: str = None decoder_pretrained_model_name_or_path: str = None *model_args **kwargs )

参数

encoder_pretrained_model_name_or_path (str, optional) — Information necessary to initiate the image encoder. Can be either:
- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. An example is google/vit-base-patch16-224-in21k.
- A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.
- A path or url to a tensorflow index checkpoint file (e.g, ./tf_model/model.ckpt.index). In this case, from_tf should be set to True and a configuration object should be provided as config argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
decoder_pretrained_model_name_or_path (str, optional, defaults to None) — Information necessary to initiate the text decoder. Can be either:
- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
- A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.
- A path or url to a tensorflow index checkpoint file (e.g, ./tf_model/model.ckpt.index). In this case, from_tf should be set to True and a configuration object should be provided as config argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
model_args (剩余的位置参数, 可选) — 所有剩余的位置参数将被传递给底层模型的 __init__ 方法.
kwargs (remaining dictionary of keyword arguments, optional) — Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., output_attentions=True).
- To update the encoder configuration, use the prefix encoder_ for each configuration parameter.
- To update the decoder configuration, use the prefix decoder_ for each configuration parameter.
- To update the parent model configuration, do not use a prefix for each configuration parameter.
根据是否提供了config或自动加载而表现不同。

从一个或多个库的基类实例化编码器和解码器，使用预训练模型的检查点。

模型默认使用model.eval()设置为评估模式（Dropout模块被停用）。要训练模型，你需要首先使用model.train()将其重新设置为训练模式。

示例：

>>> from transformers import VisionEncoderDecoderModel

>>> # initialize a vit-bert from a pretrained ViT and a pretrained BERT model. Note that the cross-attention layers will be randomly initialized
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "google-bert/bert-base-uncased"
... )
>>> # saving model after fine-tuning
>>> model.save_pretrained("./vit-bert")
>>> # load fine-tuned model
>>> model = VisionEncoderDecoderModel.from_pretrained("./vit-bert")

TensorFlow

Hide TensorFlow content

TFVisionEncoderDecoderModel

类 transformers.TFVisionEncoderDecoderModel

< source >

( config: Optional[PretrainedConfig] = None encoder: Optional[TFPreTrainedModel] = None decoder: Optional[TFPreTrainedModel] = None )

参数

config (VisionEncoderDecoderConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。

在训练/微调这样的Vision-Encoder-Text-Decoder模型之后，它可以像任何其他模型一样保存/加载（更多信息请参见示例）。

该模型继承自 TFPreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头部等）。

该模型也是一个keras.Model子类。可以将其作为常规的TF 2.0 Keras模型使用，并参考TF 2.0文档以了解与一般使用和行为相关的所有事项。

TFVisionEncoderDecoderModel 是一个通用模型类，当使用 from_pretrained() 类方法为编码器和解码器创建时，它将实例化为一个变压器架构，其中编码器是库中的一个基础视觉模型类，解码器是另一个基础模型类。

调用

< source >

( pixel_values: np.ndarray | tf.Tensor | None = None decoder_input_ids: np.ndarray | tf.Tensor | None = None decoder_attention_mask: np.ndarray | tf.Tensor | None = None encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None decoder_inputs_embeds: np.ndarray | tf.Tensor | None = None labels: np.ndarray | tf.Tensor | None = None use_cache: Optional[bool] = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False **kwargs ) → transformers.modeling_tf_outputs.TFSeq2SeqLMOutput 或 tuple(tf.Tensor)

参数

pixel_values (np.ndarray, tf.Tensor, List[tf.Tensor] `Dict[str, tf.Tensor] or Dict[str, np.ndarray] and each example must have the shape (batch_size, num_channels, height, width)) — 像素值。像素值可以通过视觉模型的图像处理器获得。例如，使用 AutoImageProcessor。详情请参见ViTImageProcessor.call().
decoder_input_ids (np.ndarray or tf.Tensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.
可以使用PreTrainedTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？

如果使用了past_key_values，则可以选择性地仅输入最后一个decoder_input_ids（参见past_key_values）。

为解码器提供序列到序列的训练。可以使用 PreTrainedTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和PreTrainedTokenizer.call()。
decoder_attention_mask (np.ndarray 或 tf.Tensor，形状为 (batch_size, target_sequence_length)，可选) — 默认行为：生成一个忽略 decoder_input_ids 中填充标记的张量。默认情况下也会使用因果掩码。
encoder_outputs (tuple(tuple(tf.Tensor), 可选) — 这个元组必须包含 (last_hidden_state, 可选: hidden_states, 可选: attentions) last_hidden_state (tf.Tensor 形状为 (batch_size, sequence_length, hidden_size)) 是编码器最后一层的隐藏状态张量。用于解码器的交叉注意力机制中。
past_key_values (tuple(tuple(tf.Tensor)) of length config.n_layers with each tuple having 4 tensors of shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)) — Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
decoder_inputs_embeds (np.ndarray 或 tf.Tensor 形状为 (batch_size, target_sequence_length, hidden_size), 可选) — 可选地，您可以选择直接传递嵌入表示，而不是传递 decoder_input_ids。如果您希望对如何将 decoder_input_ids 索引转换为相关向量有更多控制权，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
labels (np.ndarray 或 tf.Tensor 形状为 (batch_size, sequence_length), 可选) — 用于计算解码器的掩码语言建模损失的标签。索引应在 [-100, 0, ..., config.vocab_size] 范围内（参见 input_ids 文档字符串）。索引设置为 -100 的标记将被忽略（掩码），损失仅计算标签在 [0, ..., config.vocab_size] 范围内的标记
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, optional) — 如果设置为 True，模型将返回一个 ~utils.Seq2SeqLMOutput 而不是一个普通的元组。
训练 (bool, 可选, 默认为 False) — 是否在训练模式下使用模型（一些模块如dropout模块在训练和评估时具有不同的行为）。
kwargs (可选) — 剩余的关键字参数字典。关键字参数有两种形式：
- 没有前缀的将作为编码器前向函数的**encoder_kwargs输入。
- 带有decoder_前缀的将作为解码器前向函数的**decoder_kwargs输入。

transformers.modeling_tf_outputs.TFSeq2SeqLMOutput 或 tuple(tf.Tensor)

一个 transformers.modeling_tf_outputs.TFSeq2SeqLMOutput 或一个 tf.Tensor 元组（如果 return_dict=False 被传递或当 config.return_dict=False 时），包含根据配置 (VisionEncoderDecoderConfig) 和输入的各种元素。

loss (tf.Tensor 形状为 (n,), 可选, 其中 n 是非掩码标签的数量，当提供 labels 时返回) — 语言建模损失。
logits (tf.Tensor 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (List[tf.Tensor], 可选, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tf.Tensor 列表，每个张量的形状为 (2, batch_size, num_heads, sequence_length, embed_size_per_head))。

包含解码器的预计算隐藏状态（注意力块中的键和值），可以用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(tf.Tensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 tf.Tensor 元组（一个用于嵌入输出 + 一个用于每层的输出）。

解码器在每层输出处的隐藏状态加上初始嵌入输出。
decoder_attentions (tuple(tf.Tensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 tf.Tensor 元组（每层一个）。

解码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(tf.Tensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 tf.Tensor 元组（每层一个）。

解码器的交叉注意力层的注意力权重，在注意力 softmax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (tf.Tensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(tf.Tensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 tf.Tensor 元组（一个用于嵌入输出 + 一个用于每层的输出）。

编码器在每层输出处的隐藏状态加上初始嵌入输出。
encoder_attentions (tuple(tf.Tensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 tf.Tensor 元组（每层一个）。

编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

TFVisionEncoderDecoderModel 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import AutoImageProcessor, AutoTokenizer, TFVisionEncoderDecoderModel
>>> from PIL import Image
>>> import requests

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> decoder_tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

>>> # initialize a bert2gpt2 from a pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
>>> model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "openai-community/gpt2"
... )

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> img = Image.open(requests.get(url, stream=True).raw)

>>> # forward
>>> pixel_values = image_processor(images=img, return_tensors="tf").pixel_values  # Batch size 1
>>> decoder_input_ids = decoder_tokenizer("Linda Davis", return_tensors="tf").input_ids  # Batch size 1
>>> outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)

>>> # training
>>> outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids, labels=decoder_input_ids)
>>> loss, logits = outputs.loss, outputs.logits

>>> # save and load from pretrained
>>> model.save_pretrained("vit-gpt2")
>>> model = TFVisionEncoderDecoderModel.from_pretrained("vit-gpt2")

>>> # generation
>>> generated = model.generate(pixel_values, decoder_start_token_id=model.config.decoder.bos_token_id)

from_encoder_decoder_pretrained

< source >

( encoder_pretrained_model_name_or_path: str = None decoder_pretrained_model_name_or_path: str = None *model_args **kwargs )

参数

encoder_pretrained_model_name_or_path (str, optional) — Information necessary to initiate the encoder. Can be either:
- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. An example is google/vit-base-patch16-224-in21k.
- A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.
- A path or url to a pytorch index checkpoint file (e.g, ./pt_model/). In this case, encoder_from_pt should be set to True.
decoder_pretrained_model_name_or_path (str, optional, defaults to None) — Information necessary to initiate the decoder. Can be either:
- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
- A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.
- A path or url to a pytorch checkpoint file (e.g, ./pt_model/). In this case, decoder_from_pt should be set to True.
model_args（剩余的位置参数，可选）— 所有剩余的位置参数将传递给底层模型的__init__方法。
kwargs (remaining dictionary of keyword arguments, optional) — Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., output_attentions=True).
- To update the encoder configuration, use the prefix encoder_ for each configuration parameter.
- To update the decoder configuration, use the prefix decoder_ for each configuration parameter.
- To update the parent model configuration, do not use a prefix for each configuration parameter.
根据是否提供了config或自动加载而表现不同。

从一个或多个库的基类实例化编码器和解码器，使用预训练模型的检查点。

示例：

>>> from transformers import TFVisionEncoderDecoderModel

>>> # initialize a vit-bert from a pretrained ViT and a pretrained BERT model. Note that the cross-attention layers will be randomly initialized
>>> model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "google-bert/bert-base-uncased"
... )
>>> # saving model after fine-tuning
>>> model.save_pretrained("./vit-bert")
>>> # load fine-tuned model
>>> model = TFVisionEncoderDecoderModel.from_pretrained("./vit-bert")

JAX

Hide JAX content

FlaxVisionEncoderDecoderModel

类 transformers.FlaxVisionEncoderDecoderModel

< source >

( config: VisionEncoderDecoderConfig input_shape: typing.Optional[typing.Tuple] = None seed: int = 0 dtype: dtype = _do_init: bool = True **kwargs )

参数

config (VisionEncoderDecoderConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。
dtype (jax.numpy.dtype, optional, defaults to jax.numpy.float32) — The data type of the computation. Can be one of jax.numpy.float32, jax.numpy.float16 (on GPUs) and jax.numpy.bfloat16 (on TPUs).
这可以用于在GPU或TPU上启用混合精度训练或半精度推理。如果指定，所有计算将使用给定的dtype执行。

请注意，这仅指定了计算的数据类型，并不影响模型参数的数据类型。

如果您希望更改模型参数的dtype，请参阅to_fp16()和 to_bf16().

在训练/微调这样的Vision-Encoder-Text-Decoder模型之后，它可以像任何其他模型一样保存/加载（更多信息请参见示例）。

该模型继承自FlaxPreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头等）。

该模型也是一个Flax Linen flax.nn.Module 子类。将其作为常规的Flax模块使用，并参考Flax文档以获取与一般用法和行为相关的所有信息。

FlaxVisionEncoderDecoderModel 是一个通用模型类，当使用 :meth~transformers.FlaxAutoModel.from_pretrained 类方法创建编码器和使用 :meth~transformers.FlaxAutoModelForCausalLM.from_pretrained 类方法创建解码器时，它将作为一个转换器架构实例化，其中编码器模块是库中的一个基础视觉模型类的模块（flax.nn.Module），解码器模块是另一个基础视觉模型类的模块。

call

< source >

( pixel_values: 数组 decoder_input_ids: 可选的[jax.Array] = 无 decoder_attention_mask: 可选的[jax.Array] = 无 decoder_position_ids: 可选的[jax.Array] = 无 output_attentions: 可选的[bool] = 无 output_hidden_states: 可选的[bool] = 无 return_dict: 可选的[bool] = 无 train: bool = 假 params: 字典 = 无 dropout_rng: = 无 ) → transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (jnp.ndarray of shape (batch_size, num_channels, height, width)) — 像素值。像素值可以通过视觉模型的图像处理器获得。例如，使用 AutoImageProcessor。详情请参见ViTImageProcessor.call().
decoder_input_ids (jnp.ndarray of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.
可以使用PreTrainedTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是解码器输入ID？
decoder_attention_mask (jnp.ndarray of shape (batch_size, target_sequence_length), optional) — 默认行为：生成一个忽略decoder_input_ids中填充标记的张量。默认情况下也会使用因果掩码。
decoder_position_ids (jnp.ndarray of shape (batch_size, sequence_length), optional) — 每个解码器输入序列标记在位置嵌入中的位置索引。选择范围为 [0, config.decoder.max_position_embeddings - 1].
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 如果设置为 True，模型将返回一个 ~utils.FlaxSeq2SeqLMOutput 而不是一个普通的元组。

transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（VisionEncoderDecoderConfig）和输入。

logits (jnp.ndarray 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (tuple(tuple(jnp.ndarray)), 可选, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(jnp.ndarray) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量和 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的额外张量。

包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(jnp.ndarray), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 jnp.ndarray 元组（一个用于嵌入的输出 + 一个用于每层的输出）。

解码器在每层输出处的隐藏状态加上初始嵌入输出。
decoder_attentions (tuple(jnp.ndarray), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 jnp.ndarray 元组（每层一个）。

解码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(jnp.ndarray), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 jnp.ndarray 元组（每层一个）。

解码器的交叉注意力层的注意力权重，在注意力 softmax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (jnp.ndarray 形状为 (batch_size, sequence_length, hidden_size), 可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(jnp.ndarray), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 jnp.ndarray 元组（一个用于嵌入的输出 + 一个用于每层的输出）。

编码器在每层输出处的隐藏状态加上初始嵌入输出。
encoder_attentions (tuple(jnp.ndarray), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 jnp.ndarray 元组（每层一个）。

编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

FlaxVisionEncoderDecoderModel 的 forward 方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import FlaxVisionEncoderDecoderModel, AutoImageProcessor, AutoTokenizer
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")

>>> # load output tokenizer
>>> tokenizer_output = AutoTokenizer.from_pretrained("openai-community/gpt2")

>>> # initialize a vit-gpt2 from pretrained ViT and GPT2 models. Note that the cross-attention layers will be randomly initialized
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "openai-community/gpt2"
... )

>>> pixel_values = image_processor(images=image, return_tensors="np").pixel_values

>>> # use GPT2's eos_token as the pad as well as eos token
>>> model.config.eos_token_id = model.config.decoder.eos_token_id
>>> model.config.pad_token_id = model.config.eos_token_id

>>> # generation
>>> sequences = model.generate(pixel_values, num_beams=4, max_length=12).sequences

>>> captions = tokenizer_output.batch_decode(sequences, skip_special_tokens=True)

from_encoder_decoder_pretrained

< source >

( encoder_pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType] = None decoder_pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType] = None *model_args **kwargs )

参数

encoder_pretrained_model_name_or_path (Union[str, os.PathLike], optional) — 用于初始化编码器的必要信息。可以是以下之一：
- 一个字符串，表示托管在huggingface.co上的模型仓库中的预训练模型的模型id。例如google/vit-base-patch16-224-in21k。
- 一个路径，指向使用save_pretrained()保存的模型权重的目录，例如./my_model_directory/。
decoder_pretrained_model_name_or_path (Union[str, os.PathLike], 可选, 默认为 None) — 初始化解码器所需的信息。可以是以下之一：
- 一个字符串，表示托管在huggingface.co上的模型仓库中的预训练模型的模型ID。
- 一个路径，指向使用save_pretrained()保存的模型权重的目录，例如./my_model_directory/。
model_args (剩余的位置参数, 可选) — 所有剩余的位置参数将被传递给底层模型的 __init__ 方法.
kwargs (remaining dictionary of keyword arguments, optional) — Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., output_attentions=True).
- To update the encoder configuration, use the prefix encoder_ for each configuration parameter.
- To update the decoder configuration, use the prefix decoder_ for each configuration parameter.
- To update the parent model configuration, do not use a prefix for each configuration parameter.
根据是否提供了config或自动加载而表现不同。

从一个或多个库的基类实例化编码器和解码器，使用预训练模型的检查点。

示例：

>>> from transformers import FlaxVisionEncoderDecoderModel

>>> # initialize a vit-gpt2 from a pretrained ViT and a pretrained GPT2 model. Note that the cross-attention layers will be randomly initialized
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "openai-community/gpt2"
... )
>>> # saving model after fine-tuning
>>> model.save_pretrained("./vit-gpt2")
>>> # load fine-tuned model
>>> model = FlaxVisionEncoderDecoderModel.from_pretrained("./vit-gpt2")

< > Update on GitHub

←VipLlava Vision Text Dual Encoder→