Transformers

Mllama

概述

Llama 3.2-Vision 多模态大语言模型（LLMs）集合是一组预训练和指令调优的图像推理生成模型，包含11B和90B两种规模（文本+图像输入/文本输出）。Llama 3.2-Vision 指令调优模型针对视觉识别、图像推理、图像描述以及回答关于图像的一般问题进行了优化。

模型架构： Llama 3.2-Vision 建立在 Llama 3.1 纯文本模型之上，Llama 3.1 是一个使用优化变压器架构的自回归语言模型。调优版本使用监督微调（SFT）和带有人类反馈的强化学习（RLHF）来与人类对帮助性和安全性的偏好保持一致。为了支持图像识别任务，Llama 3.2-Vision 模型使用了一个单独训练的视觉适配器，该适配器与预训练的 Llama 3.1 语言模型集成。适配器由一系列交叉注意力层组成，这些层将图像编码器表示输入到核心 LLM 中。

使用技巧

对于图像+文本和文本输入，使用 MllamaForConditionalGeneration。
对于仅文本输入，使用MllamaForCausalLM进行生成，以避免加载视觉塔。
每个样本可以包含多张图像，且样本之间的图像数量可以不同。处理器会将输入填充到样本中图像数量的最大值，以及每张图像中瓦片数量的最大值。
传递给处理器的文本应在应插入图像的位置包含"<|image|>"标记。
处理器有自己的apply_chat_template方法，用于将聊天消息转换为文本，然后可以将该文本传递给处理器。

Mllama 有一个额外的标记，用作文本中图像位置的占位符。这意味着输入ID和输入嵌入层将有一个额外的标记。但由于输入和输出嵌入的权重没有绑定，lm_head 层少了一个标记，如果你想在图像标记上计算损失或应用一些logit处理器，将会失败。如果你正在训练，请确保在labels中屏蔽掉特殊的"<|image|>"标记，因为模型不应该被训练来预测它们。

否则，如果在生成时看到CUDA端的索引错误，请使用以下代码将lm_head扩展一个额外的标记。

old_embeddings = model.get_output_embeddings()

num_tokens = model.vocab_size + 1
resized_embeddings = model._get_resized_lm_head(old_embeddings, new_num_tokens=num_tokens, mean_resizing=True)
resized_embeddings.requires_grad_(old_embeddings.weight.requires_grad)
model.set_output_embeddings(resized_embeddings)

使用示例

指令模型

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    [
        {
            "role": "user", 
            "content": [
                {"type": "image"},
                {"type": "text", "text": "What does the image show?"}
            ]
        }
    ],
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)

url = "https://llava-vl.github.io/static/images/view.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))

基础模型

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

prompt = "<|image|>If I had to write a haiku for this one"
url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0], skip_special_tokens=True))

Transformers

Mllama

概述

使用技巧

使用示例

指令模型

基础模型

MllamaConfig

类 transformers.MllamaConfig

MllamaProcessor

类 transformers.MllamaProcessor

batch_decode

解码

post_process_image_text_to_text

MllamaImageProcessor

类 transformers.MllamaImageProcessor

pad

预处理

调整大小

MllamaForConditionalGeneration

类 transformers.MllamaForConditionalGeneration

前进

MllamaForCausalLM

类 transformers.MllamaForCausalLM

前进

MllamaTextModel

类 transformers.MllamaTextModel

前进

MllamaForCausalLM

类 transformers.MllamaForCausalLM

前进

MllamaVisionModel

类 transformers.MllamaVisionModel

前进