Transformers

MobileViT

概述

MobileViT模型由Sachin Mehta和Mohammad Rastegari在MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer中提出。MobileViT引入了一个新层，该层使用transformer替换了卷积中的局部处理，实现了全局处理。

论文的摘要如下：

轻量级卷积神经网络（CNNs）是移动视觉任务的实际选择。它们的空间归纳偏差使它们能够在不同的视觉任务中以较少的参数学习表示。然而，这些网络在空间上是局部的。为了学习全局表示，基于自注意力的视觉变换器（ViTs）被采用。与CNNs不同，ViTs是重量级的。在本文中，我们提出了以下问题：是否有可能结合CNNs和ViTs的优势，为移动视觉任务构建一个轻量级且低延迟的网络？为此，我们引入了MobileViT，一种用于移动设备的轻量级通用视觉变换器。MobileViT为使用变换器进行全局信息处理提供了一个不同的视角，即变换器作为卷积。我们的结果表明，MobileViT在不同任务和数据集上显著优于基于CNN和ViT的网络。在ImageNet-1k数据集上，MobileViT以约600万个参数实现了78.4%的top-1准确率，比具有相似参数数量的MobileNetv3（基于CNN）和DeIT（基于ViT）分别高出3.2%和6.2%。在MS-COCO目标检测任务中，MobileViT在相似参数数量下比MobileNetv3准确率高5.7%。

该模型由matthijs贡献。该模型的TensorFlow版本由sayakpaul贡献。原始代码和权重可以在这里找到。

使用提示

MobileViT 更像是一个 CNN 而不是 Transformer 模型。它不处理序列数据，而是处理批量图像。与 ViT 不同，它没有嵌入。骨干模型输出一个特征图。您可以按照本教程进行轻量级介绍。
可以使用MobileViTImageProcessor来为模型准备图像。请注意，如果您自己进行预处理，预训练的检查点期望图像为BGR像素顺序（而不是RGB）。
可用的图像分类检查点是在ImageNet-1k（也称为ILSVRC 2012，包含130万张图像和1,000个类别）上预训练的。
分割模型使用了一个DeepLabV3头。可用的语义分割检查点是在PASCAL VOC上预训练的。
顾名思义，MobileViT 旨在在手机上表现出色且高效。MobileViT 模型的 TensorFlow 版本完全兼容 TensorFlow Lite。

您可以使用以下代码将MobileViT检查点（无论是图像分类还是语义分割）转换为生成TensorFlow Lite模型：

from transformers import TFMobileViTForImageClassification
import tensorflow as tf


model_ckpt = "apple/mobilevit-xx-small"
model = TFMobileViTForImageClassification.from_pretrained(model_ckpt)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS,
    tf.lite.OpsSet.SELECT_TF_OPS,
]
tflite_model = converter.convert()
tflite_filename = model_ckpt.split("/")[-1] + ".tflite"
with open(tflite_filename, "wb") as f:
    f.write(tflite_model)

生成的模型将只有大约1MB，使其非常适合资源和网络带宽可能受限的移动应用程序。

资源

以下是官方Hugging Face和社区（由🌎表示）提供的资源列表，帮助您开始使用MobileViT。

Image Classification

MobileViTForImageClassification 由这个示例脚本和笔记本支持。
另请参阅：图像分类任务指南

语义分割

语义分割任务指南

如果您有兴趣提交资源以包含在此处，请随时打开一个 Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

Transformers

MobileViT

概述

使用提示

资源

MobileViTConfig

类 transformers.MobileViTConfig

MobileViTFeatureExtractor

类 transformers.MobileViTFeatureExtractor

__call__

post_process_semantic_segmentation

MobileViTImageProcessor

class transformers.MobileViTImageProcessor

预处理

post_process_semantic_segmentation

MobileViTModel

类 transformers.MobileViTModel

前进

MobileViTForImageClassification

类 transformers.MobileViTForImageClassification

前进

MobileViTForSemanticSegmentation

类 transformers.MobileViTForSemanticSegmentation

前进

TFMobileViTModel

类 transformers.TFMobileViTModel

调用

TFMobileViTForImageClassification

类 transformers.TFMobileViTForImageClassification

调用

TFMobileViTForSemanticSegmentation

类 transformers.TFMobileViTForSemanticSegmentation

调用

call