• Tutorials >
  • Optimizing Vision Transformer Model for Deployment
Shortcuts

优化视觉Transformer模型以进行部署

创建于:2021年3月15日 | 最后更新:2024年1月19日 | 最后验证:2024年11月5日

Jeff Tang, Geeta Chauhan

Vision Transformer 模型将自然语言处理中引入的最先进的基于注意力的 transformer 模型应用于计算机视觉任务,以实现各种最先进的(SOTA)结果。Facebook 的数据高效图像 Transformer DeiT 是一个在 ImageNet 上训练用于图像分类的 Vision Transformer 模型。

在本教程中,我们将首先介绍DeiT是什么以及如何使用它,然后详细介绍在iOS和Android应用中编写脚本、量化、优化和使用模型的完整步骤。我们还将比较量化、优化模型与非量化、非优化模型的性能,并展示在步骤中应用量化和优化对模型的好处。

什么是DeiT

自2012年深度学习兴起以来,卷积神经网络(CNNs)一直是图像分类的主要模型,但CNNs通常需要数亿张图像进行训练才能达到SOTA结果。DeiT是一种视觉变换器模型,它需要的数据和计算资源要少得多,就能与领先的CNNs在图像分类任务中竞争,这得益于DeiT的两个关键组件:

  • 数据增强,模拟在更大数据集上的训练;

  • 原生蒸馏允许变压器网络从CNN的输出中学习。

DeiT 展示了 Transformers 可以成功应用于计算机视觉任务,即使在数据和资源有限的情况下。有关 DeiT 的更多详细信息,请参阅 repopaper

使用DeiT进行图像分类

按照DeiT仓库中的README.md获取如何使用DeiT对图像进行分类的详细信息,或者进行快速测试,首先安装所需的包:

pip install torch torchvision timm pandas requests

要在Google Colab中运行,请通过运行以下命令安装依赖项:

!pip install timm pandas requests

然后运行以下脚本:

from PIL import Image
import torch
import timm
import requests
import torchvision.transforms as transforms
from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD

print(torch.__version__)
# should be 1.8.0


model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()

transform = transforms.Compose([
    transforms.Resize(256, interpolation=3),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD),
])

img = Image.open(requests.get("https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png", stream=True).raw)
img = transform(img)[None,]
out = model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
2.5.0+cu124
Downloading: "https://github.com/facebookresearch/deit/zipball/main" to /var/lib/ci-user/.cache/torch/hub/main.zip
/usr/local/lib/python3.10/dist-packages/timm/models/registry.py:4: FutureWarning:

Importing from timm.models.registry is deprecated, please import via timm.models

/usr/local/lib/python3.10/dist-packages/timm/models/layers/__init__.py:48: FutureWarning:

Importing from timm.models.layers is deprecated, please import via timm.layers

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:63: UserWarning:

Overwriting deit_tiny_patch16_224 in registry with models.deit_tiny_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:78: UserWarning:

Overwriting deit_small_patch16_224 in registry with models.deit_small_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:93: UserWarning:

Overwriting deit_base_patch16_224 in registry with models.deit_base_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:108: UserWarning:

Overwriting deit_tiny_distilled_patch16_224 in registry with models.deit_tiny_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:123: UserWarning:

Overwriting deit_small_distilled_patch16_224 in registry with models.deit_small_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:138: UserWarning:

Overwriting deit_base_distilled_patch16_224 in registry with models.deit_base_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:153: UserWarning:

Overwriting deit_base_patch16_384 in registry with models.deit_base_patch16_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:168: UserWarning:

Overwriting deit_base_distilled_patch16_384 in registry with models.deit_base_distilled_patch16_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

Downloading: "https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/deit_base_patch16_224-b5f2ef4d.pth

  0%|          | 0.00/330M [00:00<?, ?B/s]
  6%|6         | 20.5M/330M [00:00<00:01, 214MB/s]
 13%|#2        | 41.8M/330M [00:00<00:01, 219MB/s]
 19%|#9        | 63.0M/330M [00:00<00:01, 220MB/s]
 26%|##5       | 84.2M/330M [00:00<00:01, 221MB/s]
 32%|###1      | 106M/330M [00:00<00:01, 221MB/s]
 38%|###8      | 127M/330M [00:00<00:00, 221MB/s]
 45%|####4     | 148M/330M [00:00<00:00, 222MB/s]
 51%|#####1    | 169M/330M [00:00<00:00, 221MB/s]
 58%|#####7    | 190M/330M [00:00<00:00, 222MB/s]
 64%|######4   | 212M/330M [00:01<00:00, 222MB/s]
 71%|#######   | 233M/330M [00:01<00:00, 222MB/s]
 77%|#######6  | 254M/330M [00:01<00:00, 222MB/s]
 83%|########3 | 275M/330M [00:01<00:00, 222MB/s]
 90%|########9 | 297M/330M [00:01<00:00, 222MB/s]
 96%|#########6| 318M/330M [00:01<00:00, 222MB/s]
100%|##########| 330M/330M [00:01<00:00, 221MB/s]
269

输出应为269,根据ImageNet的类别索引到标签文件,映射到timber wolf, grey wolf, gray wolf, Canis lupus

既然我们已经验证了可以使用DeiT模型来分类图像,那么让我们看看如何修改模型,使其能够在iOS和Android应用程序上运行。

脚本编写 DeiT

要在移动设备上使用该模型,我们首先需要编写脚本。请参阅Script and Optimize recipe以快速了解。运行以下代码将上一步中使用的DeiT模型转换为可以在移动设备上运行的TorchScript格式。

model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("fbdeit_scripted.pt")
Using cache found in /var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main

脚本模型文件 fbdeit_scripted.pt 大小约为 346MB 已生成。

量化DeiT

为了显著减少训练模型的大小,同时保持推理精度大致相同,可以对模型应用量化。由于DeiT中使用了transformer模型,我们可以轻松地对模型应用动态量化,因为动态量化对LSTM和transformer模型效果最佳(详见这里了解更多详情)。

现在运行下面的代码:

# Use 'x86' for server inference (the old 'fbgemm' is still available but 'x86' is the recommended default) and ``qnnpack`` for mobile inference.
backend = "x86" # replaced with ``qnnpack`` causing much worse inference speed for quantized model on this notebook
model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend

quantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_quantized_model = torch.jit.script(quantized_model)
scripted_quantized_model.save("fbdeit_scripted_quantized.pt")
/usr/local/lib/python3.10/dist-packages/torch/ao/quantization/observer.py:229: UserWarning:

Please use quant_min and quant_max to specify the range for observers.                     reduce_range will be deprecated in a future release of PyTorch.

这将生成脚本化和量化版本的模型 fbdeit_quantized_scripted.pt,大小约为89MB,比未量化模型大小346MB减少了74%!

你可以使用scripted_quantized_model来生成相同的推理结果:

out = scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# The same output 269 should be printed
269

优化DeiT

在移动设备上使用量化和脚本化模型之前的最后一步是优化它:

from torch.utils.mobile_optimizer import optimize_for_mobile
optimized_scripted_quantized_model = optimize_for_mobile(scripted_quantized_model)
optimized_scripted_quantized_model.save("fbdeit_optimized_scripted_quantized.pt")

生成的fbdeit_optimized_scripted_quantized.pt文件大小与量化、脚本化但未优化的模型大致相同。推理结果保持不变。

out = optimized_scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# Again, the same output 269 should be printed
269

使用Lite解释器

为了了解Lite解释器能带来多少模型大小减少和推理速度提升,让我们创建模型的lite版本。

optimized_scripted_quantized_model._save_for_lite_interpreter("fbdeit_optimized_scripted_quantized_lite.ptl")
ptl = torch.jit.load("fbdeit_optimized_scripted_quantized_lite.ptl")

尽管精简版模型的尺寸与非精简版相当,但在移动设备上运行精简版时,预计推理速度会加快。

比较推理速度

要查看四种模型的推理速度有何不同 - 原始模型、脚本化模型、量化并脚本化模型、优化量化并脚本化模型 - 请运行以下代码:

with torch.autograd.profiler.profile(use_cuda=False) as prof1:
    out = model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof2:
    out = scripted_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof3:
    out = scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof4:
    out = optimized_scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof5:
    out = ptl(img)

print("original model: {:.2f}ms".format(prof1.self_cpu_time_total/1000))
print("scripted model: {:.2f}ms".format(prof2.self_cpu_time_total/1000))
print("scripted & quantized model: {:.2f}ms".format(prof3.self_cpu_time_total/1000))
print("scripted & quantized & optimized model: {:.2f}ms".format(prof4.self_cpu_time_total/1000))
print("lite model: {:.2f}ms".format(prof5.self_cpu_time_total/1000))
original model: 171.81ms
scripted model: 107.33ms
scripted & quantized model: 128.83ms
scripted & quantized & optimized model: 146.44ms
lite model: 149.71ms

在Google Colab上运行的结果是:

original model: 1236.69ms
scripted model: 1226.72ms
scripted & quantized model: 593.19ms
scripted & quantized & optimized model: 598.01ms
lite model: 600.72ms

以下结果总结了每个模型的推理时间以及每个模型相对于原始模型的减少百分比。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Model': ['original model','scripted model', 'scripted & quantized model', 'scripted & quantized & optimized model', 'lite model']})
df = pd.concat([df, pd.DataFrame([
    ["{:.2f}ms".format(prof1.self_cpu_time_total/1000), "0%"],
    ["{:.2f}ms".format(prof2.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof2.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof3.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof3.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof4.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof4.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof5.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof5.self_cpu_time_total)/prof1.self_cpu_time_total*100)]],
    columns=['Inference Time', 'Reduction'])], axis=1)

print(df)

"""
        Model                             Inference Time    Reduction
0   original model                             1236.69ms           0%
1   scripted model                             1226.72ms        0.81%
2   scripted & quantized model                  593.19ms       52.03%
3   scripted & quantized & optimized model      598.01ms       51.64%
4   lite model                                  600.72ms       51.43%
"""
                                    Model  ... Reduction
0                          original model  ...        0%
1                          scripted model  ...    37.53%
2              scripted & quantized model  ...    25.02%
3  scripted & quantized & optimized model  ...    14.77%
4                              lite model  ...    12.87%

[5 rows x 3 columns]

'\n        Model                             Inference Time    Reduction\n0\toriginal model                             1236.69ms           0%\n1\tscripted model                             1226.72ms        0.81%\n2\tscripted & quantized model                  593.19ms       52.03%\n3\tscripted & quantized & optimized model      598.01ms       51.64%\n4\tlite model                                  600.72ms       51.43%\n'