Transformers 文档

LLaVa-NeXT-视频

Transformers

LLaVa-NeXT-Video

概述

LLaVa-NeXT-Video 模型由 Yuanhan Zhang、Bo Li、Haotian Liu、Yong Jae Lee、Liangke Gui、Di Fu、Jiashi Feng、Ziwei Liu 和 Chunyuan Li 在 LLaVA-NeXT: 一个强大的零样本视频理解模型中提出。LLaVa-NeXT-Video 通过在视频和图像数据集上进行微调，改进了 LLaVa-NeXT，从而提高了模型在视频上的性能。

LLaVA-NeXT 出人意料地在使用 AnyRes 技术时，以零样本方式在理解视频内容方面表现出色。AnyRes 技术自然地将高分辨率图像表示为多个图像。由于视频可以被视为一组帧（类似于 LLaVa-NeXT 中的一组图像），因此该技术自然可以推广到表示视频。当前版本的 LLaVA-NeXT 利用 AnyRes 并在 LLaVA-Next 的基础上对视频数据进行监督微调（SFT）训练，以实现更好的视频理解能力。该模型目前在开源模型中在 VideoMME 基准上是 SOTA。

博客的介绍如下：

2024年1月30日，我们发布了LLaVA-NeXT，这是一个开源的大型多模态模型（LMM），专门在文本-图像数据上进行训练。通过提出的AnyRes技术，它增强了推理、OCR和世界知识的能力，在一系列基于图像的多模态理解任务中表现出色，甚至在多个图像基准测试中超过了Gemini-Pro，例如MMMU和MathVista。

**在今天的探索中，我们深入研究了LLaVA-NeXT在视频理解任务中的表现。我们发现LLaVA-NeXT在理解视频内容方面表现出色。当前版本的LLaVA-NeXT在视频方面有以下几个改进：

具有AnyRes的零样本视频表示能力：AnyRes技术自然地将高分辨率图像表示为预训练的VIT能够消化的多个图像，并将它们形成一个连续的序列。该技术自然可以推广到表示视频（由多个帧组成），使得仅经过图像训练的LLaVA-Next模型在视频任务上表现出色。值得注意的是，这是LMM首次展示出强大的零样本模态转移能力。
长度泛化的推理在较长的视频上有所改进。线性缩放技术实现了长度泛化，使LLaVA-NeXT能够有效处理超出LLM“max_token_length”限制的长视频。
强大的视频理解能力。(1) LLaVA-Next-Image，结合了上述两种技术，在零样本性能上优于在视频上调整的开源LMMs。(2) LLaVA-Next-Video，在视频数据上进一步监督微调（SFT）LLaVA-Next-Image，相比LLaVA-Next-Image实现了更好的视频理解能力。(3) LLaVA-Next-Video-DPO，通过直接偏好优化（DPO）将模型响应与AI反馈对齐，显示出显著的性能提升。
使用SGLang进行高效的部署和推理。它可以在视频任务上实现5倍更快的推理速度，从而实现更可扩展的服务，例如百万级别的视频重新字幕生成。请参阅我们仓库中的说明。

该模型由RaushanTurganbay贡献。原始代码可以在这里找到。

使用提示

我们建议用户在计算批量生成时使用padding_side="left"，因为它会带来更准确的结果。只需确保在生成之前调用processor.tokenizer.padding_side = "left"。

Llava-Next 对图像使用不同数量的补丁，因此除了在处理输入时进行的填充外，还必须在建模代码中对输入进行填充。如果模型处于 eval() 模式，则默认设置为“左填充”，否则为“右填充”。

[!注意] LLaVA模型在发布v4.46版本后，将会发出关于添加processor.patch_size = {{patch_size}}、processor.num_additional_image_tokens = {{num_additional_image_tokens}}和processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}的警告。强烈建议如果您拥有模型检查点，请将这些属性添加到处理器中，或者如果不是您拥有的，请提交一个PR。添加这些属性意味着LLaVA将尝试推断每张图像所需的图像令牌数量，并使用尽可能多的占位符扩展文本。通常每张图像大约有500个令牌，因此请确保文本没有被截断，否则在合并嵌入时会出现失败。这些属性可以从模型配置中获取，如model.config.vision_config.patch_size或model.config.vision_feature_select_strategy。如果视觉骨干添加了CLS令牌，则num_additional_image_tokens应为1，如果没有向视觉补丁添加额外内容，则应为0`。

请注意，每个检查点都是使用特定的提示格式进行训练的，这取决于使用了哪个大型语言模型（LLM）。您可以使用分词器的apply_chat_template来正确格式化您的提示。以下是如何执行此操作的示例。

我们将使用LLaVA-NeXT-Video-7B-hf和视频及图像的对话历史。每个内容字段必须是一个字典列表，如下所示：

from transformers import LlavaNextVideoProcessor

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."},
            ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s shown in this image?"},
            {"type": "image"},
            ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "This image shows a red stop sign."},]
    },
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your visuals
print(text_prompt)

使用示例

单媒体模式

模型可以接受图像和视频作为输入。以下是一个半精度推理的示例代码（torch.float16）：

import av
import torch
import numpy as np
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

# Load the model in half-precision
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", torch_dtype=torch.float16, device_map="auto")
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")

# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos)
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
video = read_video_pyav(container, indices)

conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=video, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=60)
processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)

混合媒体模式

该模型还可以从交错的图像-视频输入中生成内容。但请注意，它并未在交错的图像-视频设置中进行训练，这可能会影响性能。以下是混合媒体输入的示例用法，将以下行添加到上述代码片段中：

from PIL import Image
import requests

# Generate from image and video mixed inputs
# Load and image and write a new prompt
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "How many cats are there in the image?"},
            {"type": "image"},
            ],
    },
    {

        "role": "assistant",
        "content": [{"type": "text", "text": "There are two cats"}],
    },
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=50)
processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

模型优化

使用Bitsandbytes进行量化以提高内存效率

模型可以以较低的位数加载，显著减少内存负担，同时保持原始模型的性能。这使得在资源受限的情况下能够高效部署。

首先，确保通过运行pip install bitsandbytes来安装bitsandbytes，并且确保可以访问该库支持的GPU/加速器。

bitsandbytes 正在进行重构，以支持除 CUDA 之外的多种后端。目前，ROCm（AMD GPU）和 Intel CPU 的实现已经成熟，Intel XPU 正在开发中，预计将在 Q4/Q1 支持 Apple Silicon。有关安装说明和最新后端更新，请访问此链接。

我们重视您的反馈，以帮助在正式发布前识别错误！查看这些文档以获取更多详细信息和反馈链接。

然后只需通过添加BitsAndBytesConfig来加载量化模型，如下所示：

from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", quantization_config=quantization_config, device_map="auto")

Flash-Attention 2 加速生成

此外，我们可以通过使用Flash Attention大大加快模型推理速度，这是模型内部使用的注意力机制的更快实现。

首先，请确保安装最新版本的 Flash Attention 2：

pip install -U flash-attn --no-build-isolation

此外，您应该拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash attention仓库的官方文档。FlashAttention-2只能在模型以torch.float16或torch.bfloat16加载时使用。

要使用Flash Attention-2加载并运行模型，只需在加载模型时添加attn_implementation="flash_attention_2"，如下所示：

from transformers import LlavaNextVideoForConditionalGeneration

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf", 
    torch_dtype=torch.float16, 
    attn_implementation="flash_attention_2",
).to(0)

LlavaNextVideoConfig

类 transformers.LlavaNextVideoConfig

< source >

( vision_config = None text_config = None ignore_index = -100 image_token_index = 32001 projector_hidden_act = 'gelu' vision_feature_select_strategy = 'default' vision_feature_layer = -2 image_grid_pinpoints = None tie_word_embeddings = False video_token_index = 32000 spatial_pool_mode = 'average' spatial_pool_stride = 2 image_seq_length = 576 video_seq_length = 288 **kwargs )

参数

vision_config (Union[AutoConfig, dict], optional, defaults to CLIPVisionConfig) — 视觉骨干的配置对象或字典。
text_config (Union[AutoConfig, dict], 可选, 默认为 LlamaConfig) — 文本主干的配置对象或字典.
ignore_index (int, 可选, 默认为 -100) — 损失函数的忽略索引。
image_token_index (int, optional, 默认为 32001) — 用于编码图像提示的图像令牌索引。
projector_hidden_act (str, optional, defaults to "gelu") — 多模态投影器使用的激活函数。
vision_feature_select_strategy (str, 可选, 默认为 "default") — 用于从视觉骨干中选择视觉特征的特征选择策略。可以是 "default" 或 "full" 之一。如果选择 "default"，则从视觉特征中移除 CLS 标记。如果选择 "full"，则使用完整的视觉特征。
vision_feature_layer (int, optional, defaults to -2) — 选择视觉特征的层的索引。
image_grid_pinpoints (List, 可选, 默认为 [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]) — 用于处理高分辨率图像的可能分辨率列表。列表中的每个项目应为形式为 (height, width) 的元组或列表。
tie_word_embeddings (bool, optional, defaults to False) — 是否应该将模型的输入和输出词嵌入绑定在一起。
video_token_index (int, optional, 默认为 32000) — 用于编码图像提示的视频令牌索引。
spatial_pool_mode (str, 可选, 默认为 "average") — 用于视频的池化模式。可以是“average”、“max”或“conv”。
spatial_pool_stride (int, optional, defaults to 2) — 用于视频池化层的步幅。
image_seq_length (int, optional, 默认为 576) — 一张图像嵌入的序列长度。
video_seq_length (int, optional, defaults to 288) — 一个视频嵌入的序列长度。

这是用于存储LlavaNextVideoForConditionalGeneration配置的配置类。它用于根据指定的参数实例化一个Llava-NeXT模型，定义模型架构。使用默认值实例化配置将生成与llava-hf/LLaVA-NeXT-Video-7B-hf模型类似的配置。配置对象继承自PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读PretrainedConfig的文档。

示例：

>>> from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoConfig, CLIPVisionConfig, LlamaConfig

>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()

>>> # Initializing a Llama config
>>> text_config = LlamaConfig()

>>> configuration = LlavaNextVideoConfig(vision_config, text_config)

>>> model = LlavaNextVideoForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

LlavaNextVideoProcessor

类 transformers.LlavaNextVideoProcessor

< source >

( video_processor = 无 image_processor = 无 tokenizer = 无 chat_template = 无 patch_size = 无 vision_feature_select_strategy = 无 video_token = ' image_token = '' num_additional_image_tokens = 0 **kwargs )

参数

video_processor (LlavaNextVideoImageProcessor, optional) — 视频处理器是一个必需的输入。
image_processor (LlavaNextImageProcessor, optional) — 图像处理器是一个必需的输入。
tokenizer (LlamaTokenizerFast, optional) — 分词器是一个必需的输入。
chat_template (str, 可选) — Jinja聊天模板，将用于tokenizer的apply_chat_template
patch_size (int, optional) — 视觉塔中的补丁大小。
vision_feature_select_strategy (str, optional) — 用于从视觉骨干中选择视觉特征的特征选择策略。应与模型配置中的相同
video_token (str, optional, defaults to ") — 用于表示视频位置的特殊标记。
image_token (str, optional, defaults to "") — 用于表示图像位置的特殊标记。
num_additional_image_tokens (int, 可选, 默认为 0) — 添加到图像嵌入中的额外令牌数量，例如 CLS (+1)。如果骨干网络没有 CLS 或其他附加令牌，则无需设置此参数。

构建一个LLaVa-NeXT-Video处理器，它将LLaVa-NeXT图像处理器、LLaVa-NeXT-Video视频处理器和一个LLaMa分词器封装成一个单一的处理器。

LlavaNextVideoProcessor 提供了 LlavaNextImageProcessor、LlavaNextVideoImageProcessor 和 LlamaTokenizerFast 的所有功能。有关更多信息，请参阅 __call__() 和 decode()。

batch_decode

< source >

( *args **kwargs )

此方法将其所有参数转发给LlamaTokenizerFast的batch_decode()。请参考该方法的文档字符串以获取更多信息。

解码

< source >

( *args **kwargs )

此方法将其所有参数转发给LlamaTokenizerFast的decode()。请参考该方法的文档字符串以获取更多信息。

LlavaNextVideoImageProcessor

类 transformers.LlavaNextVideoImageProcessor

< source >

( do_resize: bool = True size: typing.Dict[str, int] = None image_grid_pinpoints: typing.List = None resample: Resampling = do_center_crop: bool = True crop_size: typing.Dict[str, int] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_convert_rgb: bool = True **kwargs )

参数

do_resize (bool, 可选, 默认为 True) — 是否将图像的（高度，宽度）尺寸调整为指定的 size。可以在 preprocess 方法中被 do_resize 覆盖。
size (Dict[str, int] optional, defaults to {"shortest_edge" -- 224}): 调整后图像的大小。图像的最短边将调整为size[“shortest_edge”]，最长边将调整以保持输入的宽高比。可以在preprocess方法中通过size覆盖此设置。
image_grid_pinpoints (List 可选, 默认为 [[672, 336], [336, 672], [672, 672], [336, 1008], [1008, 336]]) — 用于处理高分辨率图像的可能分辨率列表。最佳分辨率是根据图像的原始大小选择的。可以通过preprocess方法中的image_grid_pinpoints进行覆盖。不用于处理视频。
resample (PILImageResampling, 可选, 默认为 Resampling.BICUBIC) — 如果调整图像大小，则使用的重采样过滤器。可以在 preprocess 方法中通过 resample 覆盖。
do_center_crop (bool, 可选, 默认为 True) — 是否将图像中心裁剪到指定的 crop_size。可以在 preprocess 方法中通过 do_center_crop 覆盖此设置。
crop_size (Dict[str, int] optional, 默认为 224) — 应用 center_crop 后输出图像的大小。可以在 preprocess 方法中通过 crop_size 覆盖此设置。
do_rescale (bool, 可选, 默认为 True) — 是否通过指定的比例 rescale_factor 来重新缩放图像。可以在 preprocess 方法中被 do_rescale 覆盖。
rescale_factor (int 或 float, 可选, 默认为 1/255) — 如果重新缩放图像，则使用的缩放因子。可以在 preprocess 方法中被 rescale_factor 覆盖。
do_normalize (bool, 可选, 默认为 True) — 是否对图像进行归一化。可以在 preprocess 方法中通过 do_normalize 进行覆盖。
image_mean (float 或 List[float], 可选, 默认为 [0.48145466, 0.4578275, 0.40821073]) — 如果对图像进行归一化，则使用的均值。这是一个浮点数或浮点数列表，长度为图像中的通道数。可以通过 preprocess 方法中的 image_mean 参数进行覆盖。
image_std (float 或 List[float], 可选, 默认为 [0.26862954, 0.26130258, 0.27577711]) — 如果对图像进行归一化，则使用的标准差。这是一个浮点数或与图像通道数长度相同的浮点数列表。可以在 preprocess 方法中通过 image_std 参数覆盖。可以在 preprocess 方法中通过 image_std 参数覆盖。
do_convert_rgb (bool, 可选, 默认为 True) — 是否将图像转换为RGB.

构建一个LLaVa-NeXT-Video视频处理器。基于CLIPImageProcessor，并加入了处理每一帧视频的功能。

预处理

< source >

( images: typing.Union[typing.List[ForwardRef('PIL.Image.Image')], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), typing.List[ForwardRef('np.ndarray')], typing.List[ForwardRef('torch.Tensor')], typing.List[typing.List[ForwardRef('PIL.Image.Image')]], typing.List[typing.List[ForwardRef('np.ndarrray')]], typing.List[typing.List[ForwardRef('torch.Tensor')]]] do_resize: bool = None size: typing.Dict[str, int] = None resample: Resampling = None do_center_crop: bool = None crop_size: int = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_convert_rgb: bool = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

参数

图像 (VideoInput) — 要预处理的视频。期望输入单个或批量的视频，像素值范围从0到255。如果传入的图像的像素值在0到1之间，请设置do_rescale=False.
do_resize (bool, optional, defaults to self.do_resize) — 是否调整视频大小.
size (Dict[str, int], 可选, 默认为 self.size) — 调整大小后的视频尺寸。视频的最短边将调整为size[“shortest_edge”]，最长边将按比例调整以保持输入的宽高比。
resample (int, 可选, 默认为 self.resample) — 如果调整视频大小，则使用的重采样过滤器。这可以是枚举 PILImageResampling 中的一个。只有在 do_resize 设置为 True 时才会生效。
do_center_crop (bool, optional, defaults to self.do_center_crop) — 是否对视频进行中心裁剪。
crop_size (Dict[str, int], 可选, 默认为 self.crop_size) — 中心裁剪的大小。仅在 do_center_crop 设置为 True 时有效。
do_rescale (bool, optional, defaults to self.do_rescale) — 是否对视频进行重新缩放.
rescale_factor (float, 可选, 默认为 self.rescale_factor) — 如果 do_rescale 设置为 True，则用于重新缩放视频的重新缩放因子。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否对视频进行归一化处理。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 用于归一化的帧均值。仅在 do_normalize 设置为 True 时有效。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 用于归一化的帧标准差。仅在 do_normalize 设置为 True 时有效。
do_convert_rgb (bool, 可选, 默认为 self.do_convert_rgb) — 是否将视频转换为RGB.
return_tensors (str 或 TensorType, 可选) — 返回的张量类型。可以是以下之一：
- 未设置：返回一个 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回一个类型为 tf.Tensor 的批次。
- TensorType.PYTORCH 或 'pt'：返回一个类型为 torch.Tensor 的批次。
- TensorType.NUMPY 或 'np'：返回一个类型为 np.ndarray 的批次。
- TensorType.JAX 或 'jax'：返回一个类型为 jax.numpy.ndarray 的批次。
data_format (ChannelDimension 或 str, 可选, 默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- 未设置：使用输入图像的通道维度格式。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE: 图像格式为 (height, width)。

调整大小

< source >

( image: ndarray size: typing.Dict[str, int] resample: Resampling = data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )

参数

image (np.ndarray) — 要调整大小的图像。
size (Dict[str, int]) — 输出图像的大小。
resample (PILImageResampling, 可选, 默认为 PILImageResampling.BICUBIC) — 调整图像大小时使用的重采样过滤器。
data_format (str 或 ChannelDimension, 可选) — 图像的通道维度格式。如果未提供，它将与输入图像相同。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未提供，将会自动推断。

调整图像大小。图像的最短边将调整为size["shortest_edge"]，最长边将调整以保持输入的宽高比。

LlavaNextVideoForConditionalGeneration

类 transformers.LlavaNextVideoForConditionalGeneration

< source >

( config: LlavaNextVideoConfig )

参数

config (LlavaNextVideoConfig 或 LlavaNextVideoVisionConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。

LLAVA-NeXT模型由视觉主干和语言模型组成。该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头部等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( input_ids: LongTensor = None pixel_values: FloatTensor = None pixel_values_videos: FloatTensor = None image_sizes: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Optional[int] = None vision_feature_select_strategy: typing.Optional[str] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0 ) → transformers.models.llava_next_video.modeling_llava_next_video.LlavaNextVideoCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？
pixel_values (torch.FloatTensor of shape `(batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 AutoImageProcessor获取。详情请参见LlavaNextVideoImageProcessor.call()。LlavaProcessor使用 LlavaNextVideoImageProcessor来处理图像。
image_sizes (torch.LongTensor of shape (batch_size, 2), optional) — 批次中图像的大小，每个图像的大小为（高度，宽度）。
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？

可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

如果使用了past_key_values，则可以选择性地仅输入最后一个decoder_input_ids（参见past_key_values）。

如果你想改变填充行为，你应该阅读modeling_opt._prepare_decoder_attention_mask 并根据你的需求进行修改。有关默认策略的更多信息，请参见论文中的图1。
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), 可选) — 每个输入序列标记在位置嵌入中的位置索引。选择范围在 [0, config.n_positions - 1] 内。什么是位置ID？
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有更多控制权，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
vision_feature_layer (int, optional, defaults to -2) — 选择视觉特征的层的索引。
vision_feature_select_strategy (str, 可选, 默认为 "default") — 用于从视觉骨干中选择视觉特征的特征选择策略。可以是 "default" 或 "full" 之一。如果选择 "default"，则从视觉特征中移除 CLS 标记。如果选择 "full"，则使用完整的视觉特征。
use_cache (bool, 可选) — 如果设置为 True，将返回 past_key_values 键值状态，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, optional) — 是否返回一个ModelOutput而不是一个普通的元组。
cache_position (torch.LongTensor of shape (sequence_length), optional) — 表示输入序列标记在序列中的位置的索引。与position_ids相反，这个张量不受填充的影响。它用于在正确的位置更新缓存并推断完整的序列长度。
Args — pixel_values_videos (torch.FloatTensor of shape (batch_size, num_frames, num_channels, image_size, image_size)): The tensors corresponding to the input videos. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoImageProcessor). See LlavaNextVideoVideoProcessor.callfor details. [LlavaProcessor](/docs/transformers/v4.47.1/en/model_doc/llava#transformers.LlavaProcessor) usesLlavaNextVideoVideoProcessor for processing videos. labels (torch.LongTensorof shape(batch_size, sequence_length), *optional*): Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, …, config.vocab_size]. num_logits_to_keep (int, *optional*): Calculate logits for the last num_logits_to_keeptokens. If0, calculate logits for all input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size.

transformers.models.llava_next_video.modeling_llava_next_video.LlavaNextVideoCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.llava_next_video.modeling_llava_next_video.LlavaNextVideoCausalLMOutputWithPast 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含根据配置（LlavaNextVideoConfig）和输入而定的各种元素。

loss (torch.FloatTensor 形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失（用于下一个令牌预测）。
logits (torch.FloatTensor 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇令牌的分数）。
past_key_values (tuple(tuple(torch.FloatTensor))，可选，当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量。

包含预计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
hidden_states (tuple(torch.FloatTensor)，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出），形状为 (batch_size, sequence_length, hidden_size)。

模型在每层输出处的隐藏状态加上可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (torch.FloatTensor，可选) — 一个形状为 (batch_size * num_patches, num_images, sequence_length, hidden_size) 的 torch.FloatTensor。由视觉编码器生成并在投影最后一个隐藏状态后的模型的 image_hidden_states。
video_hidden_states (torch.FloatTensor，可选) — 一个形状为 (batch_size * num_frames, num_videos, sequence_length, hidden_size) 的 torch.FloatTensor。由视觉编码器生成并在投影最后一个隐藏状态后的模型的 video_hidden_states。

LlavaNextVideoForConditionalGeneration 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from PIL import Image
>>> import requests
>>> import av
>>> from transformers import AutoProcessor, LlavaNextVideoForConditionalGeneration

>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])

>>> model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", device_map="auto")
>>> processor = AutoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")

>>> prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
>>> video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
>>> container = av.open(video_path)

>>> # sample uniformly 8 frames from the video (model was trained with 32 frames per video, but this video is short)
>>> total_frames = container.streams.video[0].frames
>>> indices = np.arange(0, total_frames, total_frames / 8).astype(int)
>>> clip = read_video_pyav(container, indices)
>>> inputs_video = processor(text=prompt, videos=clip, return_tensors="pt").to(model.device)

>>> # load an image to generate from an image
>>> prompt = "USER:<image>\nWhat is shown in this image? ASSISTANT:"
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs_image = processor(text=prompt, images=image, return_tensors="pt").to(model.device)

>>> # Generate from video
>>> generate_ids = model.generate(**inputs_video, max_length=50)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"USER:\nWhy is this video funny? ASSISTANT: The humor in this video comes from the unexpected and endearing sight of a baby wearing glasses and (...)"

>>> # Generate from image
>>> generate_ids = model.generate(**inputs_image, max_length=30)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"USER: \nWhat's the content of the image? ASSISTANT: The image shows a red stop sign on a pole, with a traditional Chinese archway (...)"

< > Update on GitHub

←LLaVA-NeXT LLaVA-Onevision→