Transformers 文档

OWL-ViT

Transformers

OWL-ViT

概述

OWL-ViT（全称为Vision Transformer for Open-World Localization）由Matthias Minderer、Alexey Gritsenko、Austin Stone、Maxim Neumann、Dirk Weissenborn、Alexey Dosovitskiy、Aravindh Mahendran、Anurag Arnab、Mostafa Dehghani、Zhuoran Shen、Xiao Wang、Xiaohua Zhai、Thomas Kipf和Neil Houlsby在Simple Open-Vocabulary Object Detection with Vision Transformers中提出。OWL-ViT是一个开放词汇对象检测网络，训练于多种（图像，文本）对。它可以用于通过一个或多个文本查询来查询图像，以搜索和检测文本中描述的目标对象。

论文的摘要如下：

将简单的架构与大规模预训练相结合，已经在图像分类领域带来了巨大的改进。对于目标检测，预训练和扩展方法尚未完全确立，尤其是在长尾和开放词汇设置中，训练数据相对稀缺。在本文中，我们提出了一种将图像-文本模型迁移到开放词汇目标检测的强大方法。我们使用标准的Vision Transformer架构，仅进行最小限度的修改，结合对比图像-文本预训练和端到端的检测微调。我们对这种设置的扩展特性分析表明，增加图像级预训练和模型规模在下游检测任务中带来了一致的改进。我们提供了适应策略和正则化方法，以在零样本文本条件和单样本图像条件的目标检测中实现非常强的性能。代码和模型可在GitHub上获取。

OWL-ViT architecture. Taken from the original paper.

该模型由adirik贡献。原始代码可以在这里找到。

使用提示

OWL-ViT 是一个零样本文本条件目标检测模型。OWL-ViT 使用 CLIP 作为其多模态骨干网络，使用类似 ViT 的 Transformer 获取视觉特征，并使用因果语言模型获取文本特征。为了将 CLIP 用于检测，OWL-ViT 移除了视觉模型的最终令牌池化层，并在每个 Transformer 输出令牌上附加了一个轻量级的分类和边界框头。通过用从文本模型获得的类名嵌入替换固定的分类层权重，实现了开放词汇分类。作者首先从头训练 CLIP，并使用二分匹配损失在标准检测数据集上对分类和边界框头进行端到端微调。每张图像可以使用一个或多个文本查询来执行零样本文本条件目标检测。

OwlViTImageProcessor 可以用于调整（或重新缩放）和归一化模型的图像，而 CLIPTokenizer 用于编码文本。OwlViTProcessor 将 OwlViTImageProcessor 和 CLIPTokenizer 封装到一个实例中，以便同时编码文本和准备图像。以下示例展示了如何使用 OwlViTProcessor 和 OwlViTForObjectDetection 执行对象检测。

>>> import requests
>>> from PIL import Image
>>> import torch

>>> from transformers import OwlViTProcessor, OwlViTForObjectDetection

>>> processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
>>> model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> texts = [["a photo of a cat", "a photo of a dog"]]
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
>>> outputs = model(**inputs)

>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
>>> target_sizes = torch.Tensor([image.size[::-1]])
>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
>>> i = 0  # Retrieve predictions for the first image for the corresponding text queries
>>> text = texts[i]
>>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
>>> for box, score, label in zip(boxes, scores, labels):
...     box = [round(i, 2) for i in box.tolist()]
...     print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Detected a photo of a cat with confidence 0.707 at location [324.97, 20.44, 640.58, 373.29]
Detected a photo of a cat with confidence 0.717 at location [1.46, 55.26, 315.55, 472.17]

资源

一个关于使用OWL-ViT进行零样本和单样本（图像引导）目标检测的演示笔记本可以在这里找到。

Transformers

OWL-ViT

概述

使用提示

资源

OwlViTConfig

类 transformers.OwlViTConfig

from_text_vision_configs

OwlViTTextConfig

类 transformers.OwlViTTextConfig

OwlViTVisionConfig

类 transformers.OwlViTVisionConfig

OwlViTImageProcessor

类 transformers.OwlViTImageProcessor

预处理

post_process_object_detection

post_process_image_guided_detection

OwlViTFeatureExtractor

类 transformers.OwlViTFeatureExtractor

__call__

后处理

post_process_image_guided_detection

OwlViTProcessor

类 transformers.OwlViTProcessor

batch_decode

解码

后处理

post_process_image_guided_detection

post_process_object_detection

OwlViTModel

类 transformers.OwlViTModel

前进

get_text_features

get_image_features

OwlViTTextModel

类 transformers.OwlViTTextModel

前进

OwlViTVisionModel

类 transformers.OwlViTVisionModel

前进

OwlViTForObjectDetection

类 transformers.OwlViTForObjectDetection

前进

image_guided_detection

call