Transformers

TensorFlow模型的XLA集成

加速线性代数，简称XLA，是一个用于加速TensorFlow模型运行时的编译器。来自官方文档：

XLA（加速线性代数）是一个特定领域的线性代数编译器，可以在不更改源代码的情况下加速TensorFlow模型。

在 TensorFlow 中使用 XLA 非常简单——它已经包含在 tensorflow 库中，并且可以通过在任何图形创建函数（如 tf.function）中使用 jit_compile 参数来触发。当使用像 fit() 和 predict() 这样的 Keras 方法时，您只需将 jit_compile 参数传递给 model.compile() 即可启用 XLA。然而，XLA 不仅限于这些方法——它还可以用于加速任何任意的 tf.function。

🤗 Transformers 中的多个 TensorFlow 方法已被重写为与 XLA 兼容，包括针对 GPT2、T5 和 OPT 等模型的文本生成，以及针对 Whisper 等模型的语音处理。

虽然加速的确切程度很大程度上取决于模型，但对于🤗 Transformers中的TensorFlow文本生成模型，我们注意到加速了约100倍。本文将解释如何为这些模型使用XLA以获得最大性能。如果您有兴趣了解更多关于基准测试和我们XLA集成背后的设计理念，我们还将提供额外资源的链接。

使用XLA运行TF函数

让我们考虑以下在TensorFlow中的模型：

import tensorflow as tf

model = tf.keras.Sequential(
    [tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"), tf.keras.layers.Dense(5, activation="softmax")]
)

上述模型接受的输入维度为 (10, )。我们可以像这样使用模型进行前向传递：

# Generate random inputs for the model.
batch_size = 16
input_vector_dim = 10
random_inputs = tf.random.normal((batch_size, input_vector_dim))

# Run a forward pass.
_ = model(random_inputs)

为了使用XLA编译的函数运行前向传递，我们需要执行以下操作：

xla_fn = tf.function(model, jit_compile=True)
_ = xla_fn(random_inputs)

默认的call()函数用于编译XLA图。但如果你想将任何其他模型函数编译到XLA中，也可以通过以下方式实现：

my_xla_fn = tf.function(model.my_xla_fn, jit_compile=True)

使用XLA从🤗 Transformers运行TF文本生成模型

要在🤗 Transformers中启用XLA加速生成，您需要安装最新版本的transformers。您可以通过运行以下命令来安装：

pip install transformers --upgrade

然后你可以运行以下代码：

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

# Will error if the minimal version of Transformers is not installed.
from transformers.utils import check_min_version

check_min_version("4.21.0")


tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]

# One line to create an XLA generation function
xla_generate = tf.function(model.generate, jit_compile=True)

tokenized_input = tokenizer(input_string, return_tensors="tf")
generated_tokens = xla_generate(**tokenized_input, num_beams=2)

decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")
# Generated -- TensorFlow is an open-source, open-source, distributed-source application # framework for the

正如你所注意到的，在generate()上启用XLA只需一行代码。其余代码保持不变。然而，上述代码片段中有一些特定于XLA的注意事项。你需要了解这些，才能实现XLA带来的加速。我们将在下一节讨论这些内容。

需要注意的事项

当你第一次执行一个启用了XLA的函数（如上面的xla_generate()）时，它会在内部尝试推断计算图，这个过程是耗时的。这个过程被称为“tracing”。

你可能会注意到生成时间并不快。连续调用xla_generate()（或任何其他启用XLA的函数）时，如果函数的输入与最初构建计算图时的形状相同，就不需要再次推断计算图。虽然这对于具有固定输入形状的模式（例如，图像）来说不是问题，但如果你正在处理具有可变输入形状的模式（例如，文本），则必须注意。

为了确保 xla_generate() 始终使用相同的输入形状操作，你可以在调用分词器时指定 padding 参数。

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]

xla_generate = tf.function(model.generate, jit_compile=True)

# Here, we call the tokenizer with padding options.
tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf")

generated_tokens = xla_generate(**tokenized_input, num_beams=2)
decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")

这样，您可以确保xla_generate()的输入始终会接收到与其跟踪时相同的形状，从而加快生成时间。您可以使用以下代码验证这一点：

import time
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")

xla_generate = tf.function(model.generate, jit_compile=True)

for input_string in ["TensorFlow is", "TensorFlow is a", "TFLite is a"]:
    tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf")
    start = time.time_ns()
    generated_tokens = xla_generate(**tokenized_input, num_beams=2)
    end = time.time_ns()
    print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")

在 Tesla T4 GPU 上，您可以期待如下输出：

Execution time -- 30819.6 ms

Execution time -- 79.0 ms

Execution time -- 78.9 ms

第一次调用xla_generate()由于追踪的原因会耗时较长，但后续的调用会快得多。请记住，生成选项的任何更改都会触发重新追踪，从而导致生成时间变慢。

我们没有涵盖🤗 Transformers在本文档中提供的所有文本生成选项。我们鼓励您阅读文档以了解高级用例。

附加资源

在这里，如果你想更深入地了解🤗 Transformers中的XLA以及一般情况，我们为你提供了一些额外的资源。

This Colab Notebook 提供了一个交互式演示，如果你想尝试与XLA兼容的编码器-解码器（如T5）和仅解码器（如GPT2）文本生成模型。
这篇博客文章提供了XLA兼容模型的比较基准概述，并对TensorFlow中的XLA进行了友好的介绍。
这篇博客文章讨论了我们在🤗 Transformers中为TensorFlow模型添加XLA支持的设计理念。
推荐阅读以了解更多关于XLA和TensorFlow图的文章：

< > Update on GitHub

←Debugging Optimize inference using `torch.compile()`→