Transformers

单目深度估计

单目深度估计是一种计算机视觉任务，涉及从单张图像预测场景的深度信息。换句话说，它是从单个相机视角估计场景中物体距离的过程。

单目深度估计有各种应用，包括3D重建、增强现实、自动驾驶和机器人技术。这是一项具有挑战性的任务，因为它要求模型理解场景中物体之间的复杂关系以及相应的深度信息，这些信息可能会受到光照条件、遮挡和纹理等因素的影响。

主要有两种深度估计类别：

绝对深度估计：此任务变体旨在从相机提供精确的深度测量。该术语与度量深度估计互换使用，其中深度以米或英尺的精确测量值提供。绝对深度估计模型输出带有数值的深度图，这些数值代表现实世界中的距离。
相对深度估计：相对深度估计旨在预测场景中物体或点的深度顺序，而不提供精确的测量值。这些模型输出一个深度图，指示场景中哪些部分相对于彼此更近或更远，而不提供到A和B的实际距离。

在本指南中，我们将了解如何使用Depth Anything V2（一种最先进的零样本相对深度估计模型）和ZoeDepth（一种绝对深度估计模型）进行推断。

查看深度估计任务页面，以查看所有兼容的架构和检查点。

在我们开始之前，我们需要安装最新版本的Transformers：

pip install -q -U transformers

深度估计管道

尝试使用支持深度估计的模型进行推理的最简单方法是使用相应的 pipeline()。从 Hugging Face Hub 上的检查点实例化一个管道：

>>> from transformers import pipeline
>>> import torch
>>> from accelerate.test_utils.testing import get_backend
# automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
>>> device, _, _ = get_backend()
>>> checkpoint = "depth-anything/Depth-Anything-V2-base-hf"
>>> pipe = pipeline("depth-estimation", model=checkpoint, device=device)

接下来，选择一张图片进行分析：

>>> from PIL import Image
>>> import requests

>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image

将图像传递给管道。

>>> predictions = pipe(image)

管道返回一个包含两个条目的字典。第一个条目称为predicted_depth，是一个张量，其值为每个像素的深度，以米为单位表示。第二个条目depth，是一个PIL图像，用于可视化深度估计结果。

让我们看一下可视化结果：

>>> predictions["depth"]

手动深度估计推理

现在你已经了解了如何使用深度估计管道，让我们看看如何手动复制相同的结果。

首先从Hugging Face Hub上的检查点加载模型和相关的处理器。这里我们将使用与之前相同的检查点：

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> checkpoint = "Intel/zoedepth-nyu-kitti"

>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint).to(device)

使用image_processor准备模型的图像输入，它将负责必要的图像转换，例如调整大小和归一化：

>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)

将准备好的输入通过模型传递：

>>> import torch

>>> with torch.no_grad():
...     outputs = model(pixel_values)

让我们对结果进行后处理，以去除任何填充并将深度图调整为与原始图像大小匹配。post_process_depth_estimation 输出一个包含 "predicted_depth" 的字典列表。

>>> # ZoeDepth dynamically pads the input image. Thus we pass the original image size as argument
>>> # to `post_process_depth_estimation` to remove the padding and resize to original dimensions.
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     source_sizes=[(image.height, image.width)],
... )

>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
>>> depth = depth.detach().cpu().numpy() * 255
>>> depth = Image.fromarray(depth.astype("uint8"))

在原始实现中，ZoeDepth模型对原始图像和翻转图像进行推理并平均结果。post_process_depth_estimation函数可以通过将翻转输出传递给可选的outputs_flipped参数来处理这一点：

>>> with torch.no_grad():   
...     outputs = model(pixel_values)
...     outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     source_sizes=[(image.height, image.width)],
...     outputs_flipped=outputs_flipped,
... )

< > Update on GitHub

←Zero-shot image classification Image-to-Image→