Transformers 文档

SigLIP

Transformers

SigLIP

概述

SigLIP模型由Xiaohua Zhai、Basil Mustafa、Alexander Kolesnikov和Lucas Beyer在Sigmoid Loss for Language Image Pre-Training中提出。SigLIP提出用简单的成对sigmoid损失函数替换CLIP中使用的损失函数。这导致在ImageNet上的零样本分类准确率方面表现更好。

论文的摘要如下：

我们提出了一种简单的成对Sigmoid损失函数用于语言-图像预训练（SigLIP）。与使用softmax归一化的标准对比学习不同，sigmoid损失仅作用于图像-文本对，并且不需要全局视角来归一化成对相似性。sigmoid损失同时允许进一步扩大批量大小，同时在较小的批量大小下表现更好。结合锁定图像调优，仅使用四个TPUv4芯片，我们训练了一个SigLiT模型，在两天内实现了84.5%的ImageNet零样本准确率。将批量大小与损失解耦进一步使我们能够研究样本与对以及负样本与正样本比例的影响。最后，我们将批量大小推向极端，达到一百万，发现增加批量大小的好处迅速减少，32k的批量大小已经足够合理。

使用提示

SigLIP 的使用与 CLIP 类似。主要区别在于训练损失，它不需要对批次内所有图像和文本的成对相似性进行全局视图。需要对 logits 应用 sigmoid 激活函数，而不是 softmax。
支持训练但不使用torch.distributed工具，这可能会限制批量大小的可扩展性。然而，DDP和FDSP在单节点多GPU设置上有效。
当使用独立的SiglipTokenizer或SiglipProcessor时，请确保传递padding="max_length"，因为这是模型训练的方式。
为了获得与管道相同的结果，应使用“这是一张{label}的照片。”的提示模板。

SigLIP evaluation results compared to CLIP. Taken from the original paper.

该模型由nielsr贡献。原始代码可以在这里找到。

使用示例

使用 SigLIP 主要有两种方式：一种是使用 pipeline API，它会为你抽象掉所有的复杂性；另一种是自己使用 SiglipModel 类。

管道 API

管道允许在几行代码中使用模型：

>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> # load pipe
>>> image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-224")

>>> # load image
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> # inference
>>> candidate_labels = ["2 cats", "a plane", "a remote"]
>>> outputs = image_classifier(image, candidate_labels=candidate_labels)
>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
>>> print(outputs)
[{'score': 0.1979, 'label': '2 cats'}, {'score': 0.0, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]

自己使用模型

如果你想自己进行预处理和后处理，以下是操作方法：

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f'This is a photo of {label}.' for label in candidate_labels]
>>> # important: we pass `padding=max_length` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
31.9% that image 0 is 'a photo of 2 cats'

资源

以下是官方Hugging Face和社区（由🌎表示）提供的资源列表，帮助您开始使用SigLIP。

零样本图像分类任务指南
SigLIP 的演示笔记本可以在这里找到。🌎

如果您有兴趣提交资源以包含在此处，请随时打开一个 Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

结合 SigLIP 和 Flash Attention 2

首先，请确保安装最新版本的 Flash Attention 2。

pip install -U flash-attn --no-build-isolation

请确保您拥有与Flash-Attention 2兼容的硬件。更多信息请参阅flash-attn仓库的官方文档。同时，请确保以半精度加载您的模型（例如 `torch.float16`）。

要使用Flash Attention 2加载并运行模型，请参考以下代码片段：

>>> import torch
>>> import requests
>>> from PIL import Image
>>> from transformers import SiglipProcessor, SiglipModel
>>> device = "cuda" # the device to load the model onto

>>> model = SiglipModel.from_pretrained(
...     "google/siglip-so400m-patch14-384",
...     attn_implementation="flash_attention_2",
...     torch_dtype=torch.float16,
...     device_map=device,
... )
>>> processor = SiglipProcessor.from_pretrained("google/siglip-so400m-patch14-384")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f'This is a photo of {label}.' for label in candidate_labels]
# important: we pass `padding=max_length` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
>>> inputs.to(device)

>>> with torch.no_grad():
...     with torch.autocast(device):
...         outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
51.3% that image 0 is 'This is a photo of 2 cats.'

使用缩放点积注意力 (SDPA)

PyTorch 包含一个原生的缩放点积注意力（SDPA）操作符，作为 torch.nn.functional 的一部分。这个函数包含了几种实现，可以根据输入和使用的硬件进行应用。更多信息请参阅官方文档或 GPU 推理页面。

你可以在from_pretrained()中设置attn_implementation="sdpa"来明确请求使用SDPA。确保你有torch>=2.1.1。

>>> from transformers import SiglipModel

>>> model = SiglipModel.from_pretrained(
...     "google/siglip-so400m-patch14-384",
...     attn_implementation="sdpa",
...     torch_dtype=torch.float16,
...     device_map=device,
... )

为了获得最佳加速效果，我们建议以半精度加载模型（例如 torch.float16 或 torch.bfloat16）。

预期的加速

下面是一个预期的加速图，比较了使用google/siglip-so400m-patch14-384检查点在float16精度下的transformers原生实现与使用不同批量大小的Flash Attention 2 / SDPA版本模型的推理时间。

Transformers

SigLIP

概述

使用提示

使用示例

管道 API

自己使用模型

资源

结合 SigLIP 和 Flash Attention 2

使用缩放点积注意力 (SDPA)

预期的加速

SiglipConfig

类 transformers.SiglipConfig

from_text_vision_configs

SiglipTextConfig

类 transformers.SiglipTextConfig

SiglipVisionConfig

类 transformers.SiglipVisionConfig

SiglipTokenizer

类 transformers.SiglipTokenizer

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

保存词汇表

SiglipImageProcessor

类 transformers.SiglipImageProcessor

预处理

SiglipProcessor

类 transformers.SiglipProcessor

batch_decode

解码

SiglipModel

类 transformers.SiglipModel

前进

get_text_features

get_image_features

SiglipTextModel

类 transformers.SiglipTextModel

前进

SiglipVisionModel

类 transformers.SiglipVisionModel

前进

SiglipForImageClassification

类 transformers.SiglipForImageClassification

前进