Transformers

DePlot

概述

DePlot 是在论文 DePlot: One-shot visual language reasoning by plot-to-table translation 中提出的，作者包括 Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun。

论文的摘要陈述如下：

视觉语言，如图表和绘图，在人类世界中无处不在。理解图表和绘图需要强大的推理能力。之前的最先进（SOTA）模型需要至少数万个训练示例，并且它们的推理能力仍然非常有限，尤其是在复杂的人类编写的查询上。本文提出了第一个一次性解决方案来解决视觉语言推理问题。我们将视觉语言推理的挑战分解为两个步骤：（1）图表到文本的转换，以及（2）对转换后的文本进行推理。该方法的关键是一个名为DePlot的模态转换模块，它将图表或绘图的图像转换为线性化的表格。DePlot的输出可以直接用于提示预训练的大型语言模型（LLM），利用LLM的少量推理能力。为了获得DePlot，我们通过建立统一的任务格式和指标来标准化图表到表格的任务，并在此任务上端到端地训练DePlot。然后，DePlot可以与LLM一起即插即用地使用。与在超过28k数据点上微调的SOTA模型相比，仅使用一次性提示的DePlot+LLM在人类编写的查询上实现了24.0%的改进。

DePlot 是一个使用 Pix2Struct 架构训练的模型。你可以在 Pix2Struct 文档中找到更多关于 Pix2Struct 的信息。 DePlot 是 Pix2Struct 架构的视觉问答子集。它将输入的问题渲染在图像上并预测答案。

使用示例

目前DePlot有一个可用的检查点：

google/deplot: 在ChartQA数据集上微调的DePlot

from transformers import AutoProcessor, Pix2StructForConditionalGeneration
import requests
from PIL import Image

model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot")
processor = AutoProcessor.from_pretrained("google/deplot")
url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, text="Generate underlying data table of the figure below:", return_tensors="pt")
predictions = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(predictions[0], skip_special_tokens=True))

微调

要微调DePlot，请参考pix2struct的微调笔记本。对于Pix2Struct模型，我们发现使用Adafactor和余弦学习率调度器微调模型可以更快地收敛：

from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup

optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)

DePlot 是一个使用 Pix2Struct 架构训练的模型。有关 API 参考，请参阅 Pix2Struct 文档。

< > Update on GitHub

←Data2Vec Donut→