Transformers 文档

Qwen2-VL

Transformers

Qwen2-VL

概述

Qwen2-VL 模型是阿里巴巴研究院Qwen团队对Qwen-VL的重大更新。

博客的摘要如下：

这篇博客介绍了Qwen2-VL，这是Qwen-VL模型的高级版本，在过去一年中经历了显著的增强。主要改进包括增强的图像理解能力、先进的视频理解能力、集成的视觉代理功能以及扩展的多语言支持。该模型架构通过Naive Dynamic Resolution支持进行了优化，以处理任意图像分辨率，并利用多模态旋转位置嵌入（M-ROPE）有效处理一维文本和多维视觉数据。这个更新后的模型在视觉相关任务中展示了与领先的AI系统如GPT-4o和Claude 3.5 Sonnet相竞争的性能，并在文本能力方面在开源模型中排名靠前。这些进步使Qwen2-VL成为需要强大多模态处理和推理能力的各种应用的多功能工具。

Qwen2-VL architecture. Taken from the blog post.

该模型由simonJJJ贡献。

使用示例

单媒体推理

模型可以接受图像和视频作为输入。以下是一个推理的示例代码。


from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
            },
            {
                "type":"text",
                "text":"Describe this image."
            }
        ]
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)

# Video
def fetch_video(ele: Dict, nframe_factor=2):
    if isinstance(ele['video'], str):
        def round_by_factor(number: int, factor: int) -> int:
            return round(number / factor) * factor

        video = ele["video"]
        if video.startswith("file://"):
            video = video[7:]

        video, _, info = io.read_video(
            video,
            start_pts=ele.get("video_start", 0.0),
            end_pts=ele.get("video_end", None),
            pts_unit="sec",
            output_format="TCHW",
        )
        assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
        if "nframes" in ele:
            nframes = round_by_factor(ele["nframes"], nframe_factor)
        else:
            fps = ele.get("fps", 1.0)
            nframes = round_by_factor(video.size(0) / info["video_fps"] * fps, nframe_factor)
        idx = torch.linspace(0, video.size(0) - 1, nframes, dtype=torch.int64)
        return video[idx]

video_info = {"type": "video", "video": "/path/to/video.mp4", "fps": 1.0}
video = fetch_video(video_info)
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": "What happened in the video?"},
        ],
    }
]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>What happened in the video?<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)

批量混合媒体推理

该模型可以批量处理由各种类型的混合样本组成的输入，例如图像、视频和文本。以下是一个示例。

image1 = Image.open("/path/to/image1.jpg")
image2 = Image.open("/path/to/image2.jpg")
image3 = Image.open("/path/to/image3.jpg")
image4 = Image.open("/path/to/image4.jpg")
image5 = Image.open("/path/to/image5.jpg")
video = fetch_video({
    "type": "video",
    "video": "/path/to/video.mp4",
    "fps": 1.0
})

# Conversation for the first image
conversation1 = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

# Conversation with two images
conversation2 = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "What is written in the pictures?"}
        ]
    }
]

# Conversation with pure text
conversation3 = [
    {
        "role": "user",
        "content": "who are you?"
    }
]


# Conversation with mixed midia
conversation4 = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "video"},
            {"type": "text", "text": "What are the common elements in these medias?"},
        ],
    }
]

conversations = [conversation1, conversation2, conversation3, conversation4]
# Preparation for batch inference
texts = [processor.apply_chat_template(msg, add_generation_prompt=True) for msg in conversations]
inputs = processor(
    text=texts,
    images=[image1, image2, image3, image4, image5],
    videos=[video],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to('cuda')

# Batch Inference
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)

使用提示

图像分辨率权衡

该模型支持多种分辨率输入。默认情况下，它使用原生分辨率作为输入，但更高的分辨率可以提高性能，代价是需要更多的计算。用户可以根据需要设置最小和最大像素数，以实现最佳配置。

min_pixels = 224*224
max_pixels = 2048*2048
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

在GPU内存有限的情况下，可以按如下方式降低分辨率：

min_pixels = 256*28*28
max_pixels = 1024*28*28 
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

这确保了每张图像使用256到1024个标记之间的数字进行编码。28来自于模型使用的补丁大小为14，时间补丁大小为2（14 x 2 = 28）。

多图像输入

默认情况下，图像和视频内容直接包含在对话中。当处理多个图像时，为图像和视频添加标签以便更好地参考是有帮助的。用户可以通过以下设置来控制此行为：

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"}, 
            {"type": "text", "text": "Hello, how are you?"}
        ]
    },
    {
        "role": "assistant",
        "content": "I'm doing well, thank you for asking. How can I assist you today?"
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Can you describe these images and video?"}, 
            {"type": "image"}, 
            {"type": "image"}, 
            {"type": "video"}, 
            {"type": "text", "text": "These are from my vacation."}
        ]
    },
    {
        "role": "assistant",
        "content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?"
    },
    {
        "role": "user",
        "content": "It was a trip to the mountains. Can you see the details in the images and video?"
    }
]

# default:
prompt_without_id = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'


# add ids
prompt_with_id = processor.apply_chat_template(conversation, add_generation_prompt=True, add_vision_id=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'

Flash-Attention 2 加速生成

首先，请确保安装最新版本的 Flash Attention 2：

pip install -U flash-attn --no-build-isolation

此外，您应该拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash attention仓库的官方文档。FlashAttention-2只能在模型以torch.float16或torch.bfloat16加载时使用。

要使用Flash Attention-2加载并运行模型，只需在加载模型时添加attn_implementation="flash_attention_2"，如下所示：

from transformers import Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2",
)

Qwen2VLConfig

类 transformers.Qwen2VLConfig

< source >

( 词汇大小 = 152064 隐藏大小 = 8192 中间大小 = 29568 隐藏层数 = 80 注意力头数 = 64 键值头数 = 8 隐藏激活函数 = 'silu' 最大位置嵌入 = 32768 初始化范围 = 0.02 RMS归一化epsilon = 1e-05 使用缓存 = True 绑定词嵌入 = False rope_theta = 1000000.0 使用滑动窗口 = False 滑动窗口 = 4096 最大窗口层数 = 80 注意力丢弃率 = 0.0 视觉配置 = None rope_scaling = None **kwargs )

参数

vocab_size (int, 可选, 默认为 152064) — Qwen2VL 模型的词汇表大小。定义了调用 Qwen2VLModel 时传递的 inputs_ids 可以表示的不同标记的数量
hidden_size (int, optional, defaults to 8192) — 隐藏表示的维度。
intermediate_size (int, optional, 默认为 29568) — MLP 表示的维度。
num_hidden_layers (int, optional, 默认为 80) — Transformer 编码器中的隐藏层数量。
num_attention_heads (int, optional, 默认为 64) — Transformer 编码器中每个注意力层的注意力头数。
num_key_value_heads (int, 可选, 默认为 8) — 这是用于实现分组查询注意力（Grouped Query Attention）的键值头数量。如果 num_key_value_heads=num_attention_heads，模型将使用多头注意力（MHA），如果 num_key_value_heads=1，模型将使用多查询注意力（MQA），否则将使用GQA。当将多头检查点转换为GQA检查点时，每个组的键和值头应通过平均池化该组中的所有原始头来构建。更多详情请查看这篇论文。如果未指定，将默认为32.
hidden_act (str 或 function, 可选, 默认为 "silu") — 解码器中的非线性激活函数（函数或字符串）。
max_position_embeddings (int, optional, defaults to 32768) — 此模型可能使用的最大序列长度。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。
rms_norm_eps (float, optional, defaults to 1e-05) — rms归一化层使用的epsilon值。
use_cache (bool, 可选, 默认为 True) — 模型是否应返回最后的键/值注意力（并非所有模型都使用）。仅在 config.is_decoder=True 时相关。
tie_word_embeddings (bool, 可选, 默认为 False) — 是否应该将模型的输入和输出词嵌入绑定在一起。
rope_theta (float, optional, defaults to 1000000.0) — RoPE嵌入的基础周期。
use_sliding_window (bool, optional, defaults to False) — 是否使用滑动窗口注意力机制.
sliding_window (int, 可选, 默认为 4096) — 滑动窗口注意力（SWA）窗口大小。如果未指定，将默认为 4096.
max_window_layers (int, 可选, 默认为 80) — 使用SWA（滑动窗口注意力）的层数。底层使用SWA，而顶层使用完全注意力。
attention_dropout (float, optional, 默认为 0.0) — 注意力概率的丢弃比例。
vision_config (Dict, optional) — 视觉编码器初始化的配置。
rope_scaling (Dict, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longer max_position_embeddings, we recommend you to update this value accordingly. Expected contents: rope_type (str): The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation. factor (float, optional): Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, a factor of x will enable the model to handle sequences of length x original maximum pre-trained length. original_max_position_embeddings (int, optional): Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining. attention_factor (float, optional): Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using the factor field to infer the suggested value. beta_fast (float, optional): Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32. beta_slow (float, optional): Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1. short_factor (List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to short contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 long_factor (List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to long contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 low_freq_factor (float, optional): Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPE high_freq_factor (float, optional*): Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE

这是用于存储Qwen2VLModel配置的配置类。它用于根据指定的参数实例化一个Qwen2-VL模型，定义模型架构。使用默认值实例化配置将产生类似于Qwen/Qwen2-VL-7B-Instruct的配置。

配置对象继承自PretrainedConfig，可用于控制模型输出。阅读PretrainedConfig的文档以获取更多信息。

>>> from transformers import Qwen2VLForConditionalGeneration, Qwen2VLConfig

>>> # Initializing a Qwen2VL style configuration
>>> configuration = Qwen2VLConfig()

>>> # Initializing a model from the Qwen2-VL-7B style configuration
>>> model = Qwen2VLForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Qwen2VLImageProcessor

类 transformers.Qwen2VLImageProcessor

< source >

( do_resize: bool = True resample: Resampling = do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_convert_rgb: bool = True min_pixels: int = 3136 max_pixels: int = 1003520 patch_size: int = 14 temporal_patch_size: int = 2 merge_size: int = 2 **kwargs )

参数

do_resize (bool, optional, defaults to True) — 是否调整图像的（高度，宽度）尺寸。
resample (PILImageResampling, 可选, 默认为 Resampling.BICUBIC) — 调整图像大小时使用的重采样过滤器。
do_rescale (bool, 可选, 默认为 True) — 是否通过指定的比例 rescale_factor 来重新缩放图像.
rescale_factor (int 或 float, 可选, 默认为 1/255) — 如果重新缩放图像，则使用的缩放因子。
do_normalize (bool, optional, defaults to True) — 是否对图像进行归一化处理。
image_mean (float 或 List[float], 可选, 默认为 [0.48145466, 0.4578275, 0.40821073]) — 如果对图像进行归一化，则使用的均值。这是一个浮点数或图像中每个通道的浮点数列表。
image_std (float 或 List[float], 可选, 默认为 [0.26862954, 0.26130258, 0.27577711]) — 如果对图像进行归一化，则使用的标准差。这是图像中每个通道的浮点数或浮点数列表。
do_convert_rgb (bool, optional, defaults to True) — 是否将图像转换为RGB.
min_pixels (int, 可选, 默认为 56 * 56) — 调整图像大小的最小像素数。
max_pixels (int, 可选, 默认为 28 * 28 * 1280) — 图像的最大像素值，用于调整图像大小。
patch_size (int, optional, defaults to 14) — 视觉编码器的空间补丁大小。
temporal_patch_size (int, optional, defaults to 2) — 视觉编码器的时间补丁大小。
merge_size (int, optional, 默认为 2) — 视觉编码器到LLM编码器的合并大小。

构建一个Qwen2-VL图像处理器，该处理器根据原始图像动态调整图像大小。

预处理

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] videos: typing.Union[typing.List[ForwardRef('PIL.Image.Image')], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), typing.List[ForwardRef('np.ndarray')], typing.List[ForwardRef('torch.Tensor')], typing.List[typing.List[ForwardRef('PIL.Image.Image')]], typing.List[typing.List[ForwardRef('np.ndarrray')]], typing.List[typing.List[ForwardRef('torch.Tensor')]]] = None do_resize: bool = None size: typing.Dict[str, int] = None resample: Resampling = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_convert_rgb: bool = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

参数

图像 (ImageInput) — 要预处理的图像。期望输入单个或批量的图像，像素值范围在0到255之间。如果传入的图像像素值在0到1之间，请设置 do_rescale=False.
视频 (VideoInput) — 要预处理的视频。期望输入单个或批量的视频，像素值范围在0到255之间。如果传入的视频像素值在0到1之间，请设置do_rescale=False.
do_resize (bool, optional, defaults to self.do_resize) — 是否调整图像大小.
size (Dict[str, int], 可选, 默认为 self.size) — 调整大小后的图像尺寸。图像的最短边将调整为size[“shortest_edge”]，最长边将按比例调整以保持输入的宽高比。
resample (int, 可选, 默认为 self.resample) — 如果调整图像大小，则使用的重采样过滤器。这可以是枚举 PILImageResampling 中的一个。只有在 do_resize 设置为 True 时才会生效。
do_rescale (bool, optional, defaults to self.do_rescale) — 是否对图像进行重新缩放.
rescale_factor (float, optional, defaults to self.rescale_factor) — 如果 do_rescale 设置为 True，则用于重新缩放图像的重新缩放因子。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否对图像进行归一化处理.
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 用于归一化的图像均值。仅在 do_normalize 设置为 True 时有效。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 用于归一化的图像标准差。仅在 do_normalize 设置为 True 时有效。
do_convert_rgb (bool, 可选, 默认为 self.do_convert_rgb) — 是否将图像转换为RGB.
return_tensors (str 或 TensorType, 可选) — 返回的张量类型。可以是以下之一：
- 未设置：返回一个 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回一个类型为 tf.Tensor 的批次。
- TensorType.PYTORCH 或 'pt'：返回一个类型为 torch.Tensor 的批次。
- TensorType.NUMPY 或 'np'：返回一个类型为 np.ndarray 的批次。
- TensorType.JAX 或 'jax'：返回一个类型为 jax.numpy.ndarray 的批次。
data_format (ChannelDimension 或 str, 可选, 默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- 未设置：使用输入图像的通道维度格式。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE: 图像格式为 (height, width)。

Qwen2VLProcessor

类 transformers.Qwen2VLProcessor

< source >

( image_processor = 无 tokenizer = 无 chat_template = 无 **kwargs )

参数

image_processor (Qwen2VLImageProcessor, optional) — 图像处理器是一个必需的输入。
tokenizer (Qwen2TokenizerFast, optional) — tokenizer 是一个必需的输入。
chat_template (str, optional) — 一个Jinja模板，用于将聊天中的消息列表转换为可标记的字符串。

构建一个Qwen2-VL处理器，它将Qwen2-VL图像处理器和Qwen2分词器封装成一个单一的处理器。 Qwen2VLProcessor 提供了 Qwen2VLImageProcessor 和 Qwen2TokenizerFast 的所有功能。更多信息请参见 __call__() 和 decode()。

batch_decode

< source >

( *args **kwargs )

此方法将其所有参数转发给Qwen2TokenizerFast的batch_decode()。请参考该方法的文档字符串以获取更多信息。

解码

< source >

( *args **kwargs )

此方法将其所有参数转发给Qwen2TokenizerFast的decode()。请参考该方法的文档字符串以获取更多信息。

post_process_image_text_to_text

< source >

( generated_outputs ) → List[str]

参数

generated_outputs (torch.Tensor 或 np.ndarray) — 模型的 generate 函数的输出。输出预期是一个形状为 (batch_size, sequence_length) 或 (sequence_length,) 的张量。

List[str]

解码后的文本。

对模型的输出进行后处理以解码文本。

Qwen2VLModel

类 transformers.Qwen2VLModel

< source >

( 配置: Qwen2VLConfig )

参数

config (Qwen2VLConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化时不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

裸的Qwen2VL模型输出原始的隐藏状态，没有任何特定的头部。该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头部等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

Qwen2VLForConditionalGeneration

类 transformers.Qwen2VLForConditionalGeneration

< source >

( config )

前进

< source >

( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None pixel_values: typing.Optional[torch.Tensor] = None pixel_values_videos: typing.Optional[torch.FloatTensor] = None image_grid_thw: typing.Optional[torch.LongTensor] = None video_grid_thw: typing.Optional[torch.LongTensor] = None rope_deltas: typing.Optional[torch.LongTensor] = None cache_position: typing.Optional[torch.LongTensor] = None ) → transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？

可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

如果使用了past_key_values，则可以选择性地仅输入最后一个decoder_input_ids（参见past_key_values）。

如果你想改变填充行为，你应该阅读modeling_opt._prepare_decoder_attention_mask 并根据你的需求进行修改。有关默认策略的更多信息，请参见论文中的图1。
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), 可选) — 每个输入序列标记在位置嵌入中的位置索引。选择范围在 [0, config.n_positions - 1] 内。什么是位置ID？
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有更多控制，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个ModelOutput而不是一个普通的元组。
pixel_values (torch.FloatTensor of shape `(seq_length, num_channels image_size image_size)) — 对应于输入图像的张量。可以使用 AutoImageProcessor获取像素值。详情请参见Qwen2VLImageProcessor.call()。Qwen2VLProcessor使用 Qwen2VLImageProcessor处理图像。
pixel_values_videos (torch.FloatTensor of shape `(seq_length, num_channels temporal_size image_size * image_size)) — 对应于输入视频的张量。可以使用 AutoImageProcessor获取像素值。详情请参见Qwen2VLImageProcessor.call()。Qwen2VLProcessor使用 Qwen2VLImageProcessor处理视频。
image_grid_thw (torch.LongTensor 形状为 (num_images, 3), 可选) — LLM中每张图像特征形状的时间、高度和宽度。
video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional) — LLM中每个视频特征形状的时间、高度和宽度。
rope_deltas (torch.LongTensor of shape (batch_size, ), optional) — 序列长度与多模态rope之间的rope索引差异。
Args — labels (torch.LongTensor of shape (batch_size, sequence_length), optional): 用于计算掩码语言建模损失的标签。索引应在 [0, ..., config.vocab_size] 或 -100 之间（参见 input_ids 文档字符串）。索引设置为 -100 的标记将被忽略（掩码），损失仅针对标签在 [0, ..., config.vocab_size] 之间的标记计算。

transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLCausalLMOutputWithPast 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（Qwen2VLConfig）和输入。

loss (torch.FloatTensor 形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失（用于下一个标记预测）。
logits (torch.FloatTensor 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor))，可选，当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量

包含预计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
hidden_states (tuple(torch.FloatTensor)，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

模型在每层输出处的隐藏状态加上可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
rope_deltas (torch.LongTensor 形状为 (batch_size, )，可选) — 序列长度和多模态 rope 之间的 rope 索引差异。

Qwen2VLForConditionalGeneration 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

>>> model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
>>> processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

>>> messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    },
]
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=[text], images=[image], vision_infos=[vision_infos])

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"The image shows a street scene with a red stop sign in the foreground. In the background, there is a large red gate with Chinese characters ..."

< > Update on GitHub

←Qwen2Audio Segment Anything→