Transformers 文档

CLIP

Transformers

CLIP

概述

CLIP模型由Alec Radford、Jong Wook Kim、Chris Hallacy、Aditya Ramesh、Gabriel Goh、Sandhini Agarwal、Girish Sastry、Amanda Askell、Pamela Mishkin、Jack Clark、Gretchen Krueger、Ilya Sutskever在《从自然语言监督中学习可迁移的视觉模型》中提出。CLIP（对比语言-图像预训练）是一种在多种（图像，文本）对上训练的神经网络。它可以通过自然语言指令来预测与给定图像最相关的文本片段，而无需直接针对任务进行优化，类似于GPT-2和3的零样本能力。

论文的摘要如下：

最先进的计算机视觉系统被训练来预测一组固定的预定对象类别。这种受限的监督形式限制了它们的通用性和可用性，因为需要额外的标记数据来指定任何其他视觉概念。直接从关于图像的原始文本中学习是一种有前途的替代方法，它利用了更广泛的监督来源。我们证明了预测哪个标题与哪个图像匹配的简单预训练任务是一种高效且可扩展的方法，可以从互联网上收集的4亿（图像，文本）对数据集中从头开始学习SOTA图像表示。预训练后，自然语言用于引用学习到的视觉概念（或描述新的概念），使模型能够零样本迁移到下游任务。我们通过在30多个不同的现有计算机视觉数据集上进行基准测试来研究这种方法的性能，涵盖OCR、视频中的动作识别、地理定位和许多类型的细粒度对象分类等任务。该模型在大多数任务上都有显著的迁移效果，并且通常与完全监督的基线竞争，而无需任何特定数据集的训练。例如，我们在ImageNet上零样本匹配了原始ResNet-50的准确性，而无需使用其训练的128万个训练示例中的任何一个。我们在此https URL发布了我们的代码和预训练模型权重。

该模型由valhalla贡献。原始代码可以在这里找到。

使用提示和示例

CLIP 是一个多模态视觉和语言模型。它可以用于图像-文本相似度和零样本图像分类。CLIP 使用类似 ViT 的变换器来获取视觉特征，并使用因果语言模型来获取文本特征。然后，文本和视觉特征都被投影到一个具有相同维度的潜在空间。投影后的图像和文本特征之间的点积被用作相似度分数。

为了将图像输入到Transformer编码器中，每张图像被分割成一系列固定大小的不重叠的补丁，然后进行线性嵌入。添加一个[CLS]标记作为整个图像的表示。作者还添加了绝对位置嵌入，并将生成的向量序列输入到标准的Transformer编码器中。CLIPImageProcessor可以用于调整图像大小（或重新缩放）并对图像进行归一化处理。

CLIPTokenizer 用于编码文本。CLIPProcessor 将 CLIPImageProcessor 和 CLIPTokenizer 包装成一个实例，以便同时编码文本和准备图像。以下示例展示了如何使用 CLIPProcessor 和 CLIPModel 获取图像-文本相似度分数。

>>> from PIL import Image
>>> import requests

>>> from transformers import CLIPProcessor, CLIPModel

>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

结合CLIP和Flash Attention 2

首先，请确保安装最新版本的 Flash Attention 2。

pip install -U flash-attn --no-build-isolation

请确保您拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash-attn仓库的官方文档。同时，请确保以半精度加载您的模型（例如torch.float16）

对于小批量大小，您可能会注意到在使用闪存注意力时模型速度变慢。请参考下面的使用闪存注意力和SDPA的预期加速部分，并选择合适的注意力实现。

要使用Flash Attention 2加载并运行模型，请参考以下代码片段：

>>> import torch
>>> import requests
>>> from PIL import Image

>>> from transformers import CLIPProcessor, CLIPModel

>>> device = "cuda"
>>> torch_dtype = torch.float16

>>> model = CLIPModel.from_pretrained(
...     "openai/clip-vit-base-patch32",
...     attn_implementation="flash_attention_2",
...     device_map=device,
...     torch_dtype=torch_dtype,
... )
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> inputs.to(device)

>>> with torch.no_grad():
...     with torch.autocast(device):
...         outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
>>> print(probs)
tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16)

使用缩放点积注意力 (SDPA)

PyTorch 包含一个原生的缩放点积注意力（SDPA）操作符，作为 torch.nn.functional 的一部分。这个函数包含了几种实现，可以根据输入和使用的硬件进行应用。更多信息请参阅官方文档或 GPU 推理页面。

默认情况下，当有可用实现时，SDPA 用于 torch>=2.1.1，但你也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 来明确请求使用 SDPA。

from transformers import CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa")

为了获得最佳加速效果，我们建议以半精度加载模型（例如 torch.float16 或 torch.bfloat16）。

使用Flash Attention和SDPA的预期加速

在本地基准测试（NVIDIA A10G, PyTorch 2.3.1+cu121）中使用float16，我们在推理过程中看到了以下加速效果，针对"openai/clip-vit-large-patch14"检查点（代码）：

CLIPTextModel

文本标签数量	Eager (秒/迭代)	FA2 (秒/迭代)	FA2 加速比	SDPA (秒/迭代)	SDPA 加速比
4	0.009	0.012	0.737	0.007	1.269
16	0.009	0.014	0.659	0.008	1.187
32	0.018	0.021	0.862	0.016	1.142
64	0.034	0.034	1.001	0.03	1.163
128	0.063	0.058	1.09	0.054	1.174

clip_text_model_viz_3

CLIPVisionModel

图像批量大小	Eager (秒/迭代)	FA2 (秒/迭代)	FA2 加速	SDPA (秒/迭代)	SDPA 加速
1	0.016	0.013	1.247	0.012	1.318
4	0.025	0.021	1.198	0.021	1.202
16	0.093	0.075	1.234	0.075	1.24
32	0.181	0.147	1.237	0.146	1.241

clip_image_model_viz_3

CLIPModel

图像批次大小	文本标签数量	Eager (秒/迭代)	FA2 (秒/迭代)	FA2 加速	SDPA (秒/迭代)	SDPA 加速
1	4	0.025	0.026	0.954	0.02	1.217
1	16	0.026	0.028	0.918	0.02	1.287
1	64	0.042	0.046	0.906	0.036	1.167
4	4	0.028	0.033	0.849	0.024	1.189
4	16	0.034	0.035	0.955	0.029	1.169
4	64	0.059	0.055	1.072	0.05	1.179
16	4	0.096	0.088	1.091	0.078	1.234
16	16	0.102	0.09	1.129	0.083	1.224
16	64	0.127	0.11	1.157	0.105	1.218
32	4	0.185	0.159	1.157	0.149	1.238
32	16	0.19	0.162	1.177	0.154	1.233
32	64	0.216	0.181	1.19	0.176	1.228

资源

一份官方的Hugging Face和社区（由🌎表示）资源列表，帮助您开始使用CLIP。

使用遥感（卫星）图像和标题微调CLIP，一篇关于如何使用RSICD数据集微调CLIP以及数据增强对性能变化影响的博客文章。
这个示例脚本展示了如何使用预训练的视觉和文本编码器以及COCO数据集来训练一个类似CLIP的视觉-文本双编码器模型。

Image-to-Text

一个关于如何使用预训练的CLIP进行推理的notebook，使用束搜索进行图像描述。🌎

图像检索

一个关于使用预训练的CLIP进行图像检索并计算MRR（平均倒数排名）分数的notebook。🌎
一个关于图像检索并显示相似度得分的notebook。🌎
一个关于如何使用Multilingual CLIP将图像和文本映射到同一向量空间的notebook。🌎
一个关于如何使用Unsplash和TMDB数据集在语义图像搜索上运行CLIP的notebook。🌎

可解释性

一个关于如何可视化输入标记和图像片段之间相似性的notebook。🌎

如果您有兴趣提交资源以包含在此处，请随时打开一个Pull Request，我们将对其进行审查。理想情况下，资源应展示一些新的内容，而不是重复现有的资源。

Transformers

CLIP

概述

使用提示和示例

结合CLIP和Flash Attention 2

使用缩放点积注意力 (SDPA)

使用Flash Attention和SDPA的预期加速

CLIPTextModel

CLIPVisionModel

CLIPModel

资源

CLIPConfig

类 transformers.CLIPConfig

from_text_vision_configs

CLIPTextConfig

类 transformers.CLIPTextConfig

CLIPVisionConfig

类 transformers.CLIPVisionConfig

CLIPTokenizer

类 transformers.CLIPTokenizer

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

保存词汇表

CLIPTokenizerFast

类 transformers.CLIPTokenizerFast

build_inputs_with_special_tokens

create_token_type_ids_from_sequences

CLIPImageProcessor

类 transformers.CLIPImageProcessor

预处理

CLIPFeatureExtractor

class transformers.CLIPFeatureExtractor

CLIPProcessor

类 transformers.CLIPProcessor

batch_decode

解码

CLIPModel

类 transformers.CLIPModel

前进

get_text_features

get_image_features

CLIPTextModel

类 transformers.CLIPTextModel

前进

CLIPTextModelWithProjection

类 transformers.CLIPTextModelWithProjection

前进

CLIPVisionModelWithProjection

类 transformers.CLIPVisionModelWithProjection

前进

CLIPVisionModel

类 transformers.CLIPVisionModel

前进

CLIPForImageClassification

类 transformers.CLIPForImageClassification

前进

TFCLIPModel

类 transformers.TFCLIPModel

调用

get_text_features

get_image_features

TFCLIPTextModel

class transformers.TFCLIPTextModel

调用

TFCLIPVisionModel

类 transformers.TFCLIPVisionModel

调用

FlaxCLIPModel

类 transformers.FlaxCLIPModel

__call__

get_text_features

get_image_features

FlaxCLIPTextModel

类 transformers.FlaxCLIPTextModel

__call__

FlaxCLIPTextModelWithProjection

class transformers.FlaxCLIPTextModelWithProjection

__call__

FlaxCLIPVisionModel

call

call

call

call