DeepSpeed 稀疏注意力

在本教程中，我们将介绍如何使用DeepSpeed稀疏注意力（SA）及其构建块内核。使用SA的最简单方法是通过DeepSpeed启动器。我们将在如何使用DeepSpeed启动器使用稀疏注意力部分通过一个示例来描述这一点。但在那之前，我们将在下一节中介绍DeepSpeed SA提供的模块。

注意： 目前，DeepSpeed 稀疏注意力只能在 NVIDIA V100 或 A100 GPU 上使用，且需要 Torch >= 1.6 和 CUDA 10.1、10.2、11.0 或 11.1。

稀疏注意力模块

MatMul: 该模块处理块稀疏矩阵-矩阵乘法。目前它支持SDD、DSD和DDS，如DeepSpeed Sparse Attention部分所述。
Softmax: 该模块应用块稀疏softmax。它处理前向和后向传播。
SparseSelfAttention: 该模块使用MatMul和Softmax内核，并根据Query、Keys和Values生成Context Layer输出。它是任何自注意力层中常见操作的简化版本。它还可以应用：
- Relative position embedding
- Attention mask
- Key padding mask 在中间注意力分数上。有关自注意力的更多详细信息，请查看MultiHeadAttention。
BertSparseSelfAttention: 该模块包含一个简化的BertSelfAttention层，可以替代原始的密集Bert自注意力层。我们的实现基于DeepSpeedExample。
SparseAttentionUtils: 该模块提供了一些实用函数来处理适应预训练模型的稀疏注意力：
- replace_model_self_attention_with_sparse_self_attention: 如果你已经加载了一个模型，并希望将自注意力模块替换为稀疏自注意力模块，你可以简单地使用这个函数来处理。它目前处理基于BERT和RoBERTa的预训练模型，但如果你的模型类型与这两种不同，你可以根据你的模型类型进行扩展。你还需要扩展位置嵌入以处理新的序列长度；这可以通过使用extend_position_embedding函数来完成。
- update_tokenizer_model_max_length: 这个函数简单地用新值更新你的分词器中的最大位置嵌入。
- extend_position_embedding: 这个函数根据当前值扩展位置嵌入。例如，如果你有一个最大序列长度为128的模型，并将其扩展到1k序列长度，它会复制当前嵌入8次以初始化新的嵌入。实验表明，这种初始化比从头开始初始化效果更好；导致更快的收敛。
- pad_to_block_size: 这个函数在序列长度维度上填充输入标记和注意力掩码，使其成为块大小的倍数；这是SA的要求。
- unpad_sequence_output: 如果模型的输入被填充过，这个函数会取消填充序列输出。
SparsityConfig: 这是一个用于稀疏结构的抽象类。任何稀疏结构都需要扩展这个类并编写自己的稀疏模式构建；make_layout 函数。DeepSpeed 目前提供了以下结构，这些结构将在如何配置稀疏结构部分中描述：
- FixedSparsityConfig
- BSLongformerSparsityConfig
- BigBirdSparsityConfig
- VariableSparsityConfig
- DenseSparsityConfig

注意： 目前 DeepSpeed Transformer Kernels 不支持稀疏注意力。要使用稀疏注意力，您需要禁用 Transformer Kernels！

如何使用DeepSpeed启动器进行稀疏注意力

在本节中，我们将描述如何通过我们的bing_bert代码使用DeepSpeed稀疏注意力。

更新注意力模块：首先，您需要基于稀疏计算更新您的注意力模块。在这里，我们使用BertSparseSelfAttention，这是我们bing_bert代码中BertSelfAttention的稀疏版本。它重写了BertSelfAttention，在其中替换了：

attention_scores = torch.matmul(query_layer, key_layer)
attention_scores = attention_scores / math.sqrt(
    self.attention_head_size)

# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask

pdtype = attention_scores.dtype
# Normalize the attention scores to probabilities.
attention_probs = self.softmax(attention_scores)

# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)

context_layer = torch.matmul(attention_probs, value_layer)

包含：

context_layer =
  self.sparse_self_attention(
	query_layer,
	key_layer,
	value_layer,
	key_padding_mask=attention_mask)

其中sparse_self_attention是SparseSelfAttention的一个实例。该模块通过稀疏注意力计算注意力上下文，用其等效的稀疏版本替换底层的矩阵乘法和softmax。您可以类似地更新任何其他注意力模块。

在模型中设置稀疏注意力配置: 你需要设置稀疏注意力配置。在我们的示例中，这是在 BertModel 中完成的。

self.pad_token_id = config.pad_token_id if hasattr(
   config, 'pad_token_id') and config.pad_token_id is not None else 0
# set sparse_attention_config if it has been selected
self.sparse_attention_config = get_sparse_attention_config(
   args, config.num_attention_heads)
self.encoder = BertEncoder(
   config, args, sparse_attention_config=self.sparse_attention_config)

更新编码器模型: 此外，当启用SA时，您需要更新您的编码器模型以在注意力层使用SA。请查看我们的bing_bert 示例，其中我们在启用SA时使用BertSparseSelfAttention而不是BertSelfAttention。

if sparse_attention_config is not None:
    from deepspeed.ops.sparse_attention import BertSparseSelfAttention

    layer.attention.self = BertSparseSelfAttention(
         config, sparsity_config=sparse_attention_config)

填充和取消填充输入数据: 您可能还需要将input_ids和attention_mask的序列维度填充为稀疏块大小的倍数。如上文模块部分所述，DeepSpeed提供了用于填充和取消填充的实用函数。请查看我们的bing_bert 示例以了解在何处以及如何填充和取消填充模型的输入或输出。

if self.sparse_attention_config is not None:
   pad_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = SparseAttentionUtils.pad_to_block_size(
      block_size=self.sparse_attention_config.block,
      input_ids=input_ids,
      attention_mask=extended_attention_mask,
      token_type_ids=token_type_ids,
      position_ids=None,
      inputs_embeds=None,
      pad_token_id=self.pad_token_id,
      model_embeddings=self.embeddings)
.
.
.
# If BertEncoder uses sparse attention, and input_ids were padded, sequence output needs to be unpadded to original length
if self.sparse_attention_config is not None and pad_len > 0:
   encoded_layers[-1] = SparseAttentionUtils.unpad_sequence_output(
      pad_len, encoded_layers[-1])

*启用稀疏注意力: 要使用DeepSpeed稀疏注意力，你需要在启动脚本中通过deepspeed_sparse_attention参数启用它：

--deepspeed_sparse_attention

请查看我们的bing_bert运行脚本作为如何使用DeepSpeed启动器启用SA的示例。

添加稀疏配置: 稀疏配置可以通过DeepSpeed JSON 配置文件进行设置。在这个例子中，我们使用了fixed稀疏模式，这将在如何配置稀疏结构部分进行描述。

"sparse_attention": {
    "mode": "fixed",
    "block": 16,
    "different_layout_per_head": true,
    "num_local_blocks": 4,
    "num_global_blocks": 1,
    "attention": "bidirectional",
    "horizontal_global_attention": false,
    "num_different_global_patterns": 4
}

如何使用单个内核

如上所述，DeepSpeed 稀疏注意力可以作为 DeepSpeed 的一个功能使用，或者简单地作为自注意力模块与任何 Transformer 模型集成。此外，构建块内核、矩阵乘法和 softmax 可以单独使用。要单独使用稀疏注意力，您可以简单地安装 DeepSpeed 并导入 modules 部分中描述的任何模块；例如：

from deepspeed.ops.sparse_attention import SparseSelfAttention

请参阅文档字符串以了解如何单独使用每个模块的详细信息。

如何配置稀疏结构

接下来我们描述支持的结构稀疏性、它们的参数集以及在自注意力层上添加任意稀疏模式的灵活性。您可以使用任何支持的结构稀疏性更新DeepSpeed配置文件，并相应地设置参数。

SparsityConfig: 此模块是所有稀疏结构的父类，包含所有稀疏结构的共享特性。它接受以下参数：
- num_heads: 一个整数，确定层的注意力头数量。
- block: 一个整数，确定块大小。当前稀疏自注意力的实现基于分块稀疏矩阵。此参数定义了此类方形块的大小；Block X Block。
- different_layout_per_head: 一个布尔值，确定是否应为每个头分配不同的稀疏布局；默认值为false，这将根据可用性来满足。
Fixed (FixedSparsityConfig): This structure is based on 使用稀疏变换器进行生成建模 from OpenAI, in which local and global attention is fixed by the given parameters:
- num_local_blocks: an integer determining the number of blocks in local attention window. As it is illustrated in the below figure (adapted from original paper), tokens in a local window, attend to all tokens local to them. In the case of autoregressive model, as in the figure, tokens attend to tokens appearing before them in the local window. And in the case of Masked model such as BERT, attention is bidirectional.
- num_global_blocks: an integer determining how many consecutive blocks in a local window is used as the representative of the window for global attention; illustrated in the figure below as well.
- attention: a string determining attention type. Attention can be unidirectional, such as autoregressive models, in which tokens attend only to tokens appear before them in the context. Considering that, the upper triangular of attention matrix is empty as above figure. Or it can be bidirectional, such as BERT, in which tokens can attend to any other tokens before or after them. Then, the upper triangular part of the attention matrix is mirror of the lower triangular in the above figure.
- horizontal_global_attention: a boolean determining if blocks that are global representative of a local window, also attend to all other blocks. This is valid only if attention type is bidirectional. Looking at the attention matrix, that means global attention not only includes the vertical blocks, but also horizontal blocks.
- num_different_global_patterns: an integer determining number of different global attentions layouts. While global attention can be fixed by which block/s are representative of any local window, since there are multi-heads, each head can use a different global representative. For example, with 4 blocks constructing local window and global attention size of a single block, we can have 4 different versions in which the first, second, third, or forth block of each local window can be global representative of that window. This parameter determines how many of such patterns we want. Of course, there is a limitation based on num_local_blocks and num_global_blocks. Further, if you set this to more than one, you need to set different_layout_per_head to True.

Fixed sparsity structure

BSLongformer (BSLongformerSparsityConfig): 此结构是Longformer: 长文档转换器的编辑版本，其中我们提供了令牌块的稀疏性，而不是单个令牌的稀疏性。定义此模式的参数有：
- num_sliding_window_blocks: 一个整数，确定滑动局部注意力窗口中的块数。
- global_block_indices: 一个整数列表，确定哪些块被视为全局注意力。给定索引，确定所有其他令牌块关注的块，并且它们关注所有其他令牌块。请注意，如果设置了global_block_end_indices参数，则此参数用作每个全局窗口的起始索引。
- global_block_end_indices: 一个整数列表，确定全局窗口块的结束索引。默认情况下不使用此参数。但如果设置了此参数，则它必须与global_block_indices参数的大小相同，并且结合这两个参数，对于每个索引i，从global_block_indices[i]到global_block_end_indices[i]（不包括）的块被视为全局注意力块。
BigBird (BigBirdSparsityConfig): 这个结构基于Big Bird: 用于更长序列的Transformers。它某种程度上结合了fixed和longformer模式以及随机注意力的思想。以下参数定义了这个结构：
- num_random_blocks: 一个整数，确定每行块中有多少块是随机被注意的。
- num_sliding_window_blocks: 一个整数，确定滑动局部注意力窗口中的块数。
- num_global_blocks: 一个整数，确定从索引0开始有多少连续块被视为全局注意力。全局块标记将被所有其他块标记注意，并且也将注意所有其他块标记。
Variable (VariableSparsityConfig): This structure also combines the idea of local, global and random attention. Further, it has the flexibility of defining variable size local windows. Following is the list of parameters that define this structure:
- num_random_blocks: an integer determining how many blocks in each row block are attended randomly.
- local_window_blocks: a list of integers determining the number of blocks in each local attention window. It assumes first number determines # of blocks in the first local window, second number the second window, …, and the last number determines the number of blocks in the remaining local windows.
- global_block_indices: a list of integers determining which blocks are considered as global attention. Given indices, determine the blocks that all other token blocks attend to and they attend to all other token blocks. Notice that if global_block_end_indices parameter is set, this parameter is used as starting index of each global window.
- global_block_end_indices: a list of integers determining end indices of global window blocks. By default this is not used. But if it is set, it must have the same size as global_block_indices parameter, and combining this two parameters, for each index i, blocks from global_block_indices[i] to global_block_end_indices[i] (exclusive) are considered as global attention block.
- attention: a string determining attention type. Attention can be unidirectional, such as autoregressive models, in which tokens attend only to tokens appear before them in the context. Considering that, the upper triangular of attention matrix is empty as above figure. Or it can be bidirectional, such as BERT, in which tokens can attend to any other tokens before or after them. Then, the upper triangular part of the attention matrix is mirror of the lower triangular in the above figure.
- horizontal_global_attention: a boolean determining if blocks that are global representative of a local window, also attend to all other blocks. This is valid only if attention type is bidirectional. Looking at the attention matrix, that means global attention not only includes the vertical blocks, but also horizontal blocks Figure below illustrates an example of variable sparsity, in which blue, orange and green blocks illustrate local, global, and random attention blocks respectively.

Variable sparsity structure

此外，我们提供了一个dense模式（DenseSparsityConfig），它可以用于测试目的，因为它代表了完整的注意力。

如何支持新的用户定义的稀疏结构

我们的基础内核，基于块的MatMul和Softmax，可以接受任何基于块的稀疏性。这提供了将任何基于块的稀疏模式应用于注意力分数的灵活性。要定义和应用新的稀疏模式，您可以简单地遵循上述任何稀疏结构。您需要添加一个新类来扩展SparsityConfig，并根据您的稀疏结构定义make_layout函数。您可以添加任何您可能需要的额外参数，或者只使用父类的默认参数。