DeepSpeed 数据效率：一个可组合的库，可以更好地利用数据，提高训练效率，并提升模型质量

什么是DeepSpeed数据效率： DeepSpeed数据效率是一个专门构建的库，旨在更好地利用数据，提高训练效率，并改善模型质量。

为什么使用DeepSpeed数据效率： DeepSpeed数据效率提供了新颖的数据效率技术，以实现更好的训练效率和/或更好的模型质量。DeepSpeed数据效率考虑了可扩展性、灵活性和可组合性，这使得定制技术、将技术应用于各种训练任务以及将多种技术组合在一起变得更加容易。我们强烈建议您也阅读我们的博客，以了解更多关于（在高层次上）我们为什么构建DeepSpeed数据效率以及它为用户提供了哪些好处。更多技术细节可以在我们的论文中找到，“Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers”描述了随机-LTD技术，以及“DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing”描述了课程学习技术和整体DeepSpeed数据效率框架。

如何使用DeepSpeed数据效率： 在以下教程中，前两部分将介绍库支持的数据效率技术。第三部分将介绍如何组合这两种技术以实现更好的训练效率/模型质量。

1. 课程学习

1.1 什么是课程学习

课程学习（由Yoshua Bengio等人提出）旨在通过在训练早期呈现相对较容易或较简单的样本来提高训练收敛速度。构建课程学习解决方案通常需要两个组件：难度度量（即如何量化每个数据样本的难度）和进度函数（即如何在采样下一个训练数据批次时决定课程难度范围）。

1.2 何时使用课程学习

课程学习已成功应用于各种训练任务（详见例如这篇综述论文），去年我们还发布了一种特定的课程学习技术（序列长度预热）用于GPT风格模型的预训练（详见我们发表在NeurIPS 2022的论文“稳定性与效率的困境：研究序列长度预热用于训练GPT模型”以及此传统课程学习功能的教程）。DeepSpeed数据效率中的这个新的通用课程学习库使用户能够以最大扩展性将课程学习应用于他们的模型：用户可以轻松地根据各种可定制的策略分析、索引和采样他们的训练数据。使用这个库，我们能够探索GPT-3和BERT预训练的不同课程学习策略，并确定最佳解决方案，该方案提供了高达1.5倍的数据节省，同时仍保持相似的模型质量。

1.3 如何使用课程学习

1.3.1 GPT-3 和 BERT 预训练

在我们的Megatron-DeepSpeed 仓库中，examples_deepspeed/data_efficiency 目录包含了我们如何将课程学习应用于 GPT-3 和 BERT 预训练的示例。共有三个步骤：数据分析、预训练和评估/微调。

数据分析：课程学习需要在预训练之前进行数据分析，计算每个数据样本的难度（基于用户提供的指标），并构建一个将难度值映射到相应数据样本的索引。（也有例外：例如基于截断的序列长度指标可以通过数据后处理实现，无需数据分析。）我们提供了一个数据分析器来执行仅限CPU的离线数据分析。

examples_deepspeed/data_efficiency/gpt/ds_analyze_*.sh and examples_deepspeed/data_efficiency/bert/ds_analyze_*.sh are example scripts for GPT-3 and BERT’s data analysis. Our data analyzer employs a simple Map-Reduce scheme. First, at the Map stage the ds_analyze_*_data_map.sh is used to split the dataset and compute the difficulty value for each data sample. User would need to provide a function to compute the metric (we implement ours in examples_deepspeed/data_efficiency/analyze_data.py), the raw training dataset, and other configurations such as number of CPU nodes and number of threads per node. Then the data analyzer will automatically splits the dataset based on number of workers, compute the difficulty values in a batched fashion, and write the results to two indexes: one index maps each data sample to its difficulty value, and another index maps each distinct difficulty value to the corresponding samples. Second, at the Reduce stage the ds_analyze_*_data_reduce.sh is used to merge the index files produced by all workers. One thing to note is that in order to enable speedup by distribution yet still being able to merge all the output, the Map stage will potentially generate a lot of output files, which is proportional to number of CPU nodes, number of threads per node, and number of possible metric values. Thus to avoid generating too much output files, we recommend to start with a smaller number of nodes/threads (in the output log we provide an estimate required time for users to judge if they want to increase number of workers), and we recommend to limit number of possible difficulty values when designing your difficulty metric (our experience shows that a few thousands of distinct values is already sufficient to enjoy the benefit of curriculum learning).

Pretraining examples_deepspeed/data_efficiency/gpt/pretrain and examples_deepspeed/data_efficiency/bert/pretrain include the example pretraining scripts with curriculum learning feature. Several changes are needed to enable curriculum learning during pretraining: (1) User need to provide a DeepSpeed json config file which includes configurations for curriculum learning (see list of configuration for details). We provide tested example configurations in examples_deepspeed/data_efficiency/gpt/pretrain/ds_pretrain_gpt_1.3B_dense_run.sh and examples_deepspeed/data_efficiency/bert/pretrain/ds_pretrain_bert_336M_run.sh. (2) When initializing the DeepSpeed engine via deepspeed.initialize, user needs to provide the train dataset and use the dataloader returned by the initialization (this dataloader includes the curriculum learning capability). We provide an example implementation of this change in megatron/training.py function setup_model_and_optimizer. (3) If the curriculum learning metric requires data postprocessing (such as truncation-based sequence length), user needs to use the DeepSpeed engine’s set_data_post_process_func API to provide the postprocessing function. We provide an example implementation of this change in megatron/training.py, pretrain_bert.py, and pretrain_gpt.py. (4) If the curriculum learning metric requires a custom scheduling strategy (the pacing function), user needs to use the DeepSpeed engine’s set_custom_curriculum_learning_schedule API to provide the function to update the max accepted difficulty during training. DeepSpeed engine will provide a global train step input to this callback function.

评估/微调 examples_deepspeed/data_efficiency/gpt/eval/ 和 examples_deepspeed/data_效率/bert/finetune 包含了用于GPT-3模型的零/少样本评估和BERT模型微调的示例脚本。如果您按照我们的示例脚本执行预训练/评估/微调，我们的论文包含了参考的评估/微调结果。

1.3.2 GPT-2 微调

在我们的DeepSpeedExamples 仓库中，data_efficiency/gpt_finetuning 目录包含了如何将课程学习应用于 GPT-2 微调的示例。data_efficiency/gpt_finetuning/finetune/ds_finetune_gpt2_run.sh 是示例微调脚本。对于需要数据分析的 CL 指标（例如，词汇稀有度指标），您需要首先使用 data_efficiency/gpt_finetuning/finetune/ds_analyze_gpt_data_* 来分析和索引数据集，类似于上述 1.3.1 中描述的 GPT-3 预训练情况。

2. 随机分层令牌丢弃（random-LTD）

2.1 什么是随机-LTD

Random-LTD 是一种应用于每一层的高效令牌丢弃方法，采用随机分配。具体来说，对于每一层，与基线相比，random-LTD 随机选择一部分令牌并将其输入到 transformer 层中。之后，我们将 transformer 层的输出与丢弃的令牌结合，以恢复完整的序列长度。因此，下一层仍然接收完整的序列，并且可以重复此过程。有关更多技术细节，请阅读我们的 random-LTD 论文。

2.2 何时使用随机-LTD

当您想要预训练/微调基于transformer的模型时，尝试random-LTD总是一个好主意，因为在相同的计算成本下，它可以实现比标准基线训练更好的性能。如果您的资源有限，random-LTD可以达到与原始基线方法相似的准确度，同时理论上节省高达33.3%的成本和高达25.6%的挂钟时间。特别是，如果您需要训练一个更大的模型，层数>=24且序列长度>=2048，我们的方法将比基线方法高效得多。

2.3 如何使用 random-LTD

2.3.1 GPT-3 和 BERT 预训练

在我们的Megatron-DeepSpeed 仓库中，examples_deepspeed/data_efficiency 目录包含了我们如何将 random-LTD 应用于 GPT-3 和 BERT 预训练的例子。

examples_deepspeed/data_efficiency/gpt/pretrain and examples_deepspeed/data_efficiency/bert/pretrain include the example pretraining scripts with random-LTD feature. Several changes are needed to enable random-LTD during pretraining: (1) User need to provide a DeepSpeed json config file which includes configurations for random-LTD (see list of configuration for details). We provide tested example configurations in examples_deepspeed/data_efficiency/gpt/pretrain/ds_pretrain_gpt_1.3B_dense_run.sh and examples_deepspeed/data_efficiency/bert/pretrain/ds_pretrain_bert_336M_run.sh. (2) After initializing the DeepSpeed engine via deepspeed.initialize, user needs to use the convert_to_random_ltd API to convert and wrap the model layers in order to enable the random-LTD feature. We provide an example implementation of this change in megatron/training.py function setup_model_and_optimizer. (3) In order for random-LTD to understand the input argument mapping of the forward function, user need to change all the input arguments (except the hidden_states input) into keyword/named argument. For example, in megatron/model/transformer.py we changed the forward function from def forward(self, hidden_states, attention_mask, encoder_output=None, enc_dec_attn_mask=None, layer_past=None, get_key_value=False): to def forward(self, hidden_states, attention_mask=None, encoder_output=None, enc_dec_attn_mask=None, layer_past=None, get_key_value=False):. (4) When saving model checkpoints, (especially if the state dictionary has non-traditional structure) user needs to use the remove_random_ltd_state_dict API to convert the random-LTD-wrapped layers back to original model layers. We provide an example implementation of this change in megatron/model/language_model.py.

对于预训练模型的评估/微调，请参阅上一节了解如何使用我们的示例脚本。

2.3.2 GPT-2 和 ViT 微调

在我们的DeepSpeedExamples 仓库中，data_efficiency 目录包含了我们如何将 random-LTD 应用于 GPT-2 和 ViT 微调的示例。

就像预训练的情况一样，为了在微调中启用随机LTD，需要进行类似的更改：(1) DeepSpeed json配置文件。(2) 使用convert_to_random_ltd API来转换和包装模型层。(3) 在保存模型检查点时，使用remove_random_ltd_state_dict API将随机LTD包装的层转换回原始模型层。

可以通过以下方式运行我们的GPT微调示例：

DeepSpeedExamples/data_efficiency/gpt_finetuning$ pip install -r requirement.txt
DeepSpeedExamples/data_efficiency/gpt_finetuning$ bash ./bash_script/run_base_random_ltd.sh
DeepSpeedExamples/data_efficiency/gpt_finetuning$ bash ./bash_script/run_medium_random_ltd.sh

参考最终结果是：

For run_base_random_ltd.sh:
End of training epoch 3 step 1344 consumed_token 2148032 best perplexity 22.552324221233757 time 0.17486039188173083 hr

For run_medium_random_ltd.sh:
End of training epoch 3 step 1373 consumed_token 2147024 best perplexity 17.332243199130996 time 0.4661190489927928 hr

可以通过以下方式运行我们的ViT微调示例：

DeepSpeedExamples/data_efficiency/vit_finetuning$ pip install -r requirement.txt
DeepSpeedExamples/data_efficiency/vit_finetuning$ bash ./bash_script/run_cifar.sh
DeepSpeedExamples/data_efficiency/vit_finetuning$ bash ./bash_script/run_imagenet.sh

参考最终结果是：

For run_cifar.sh:
13 epoch at time 480.6546013355255s | reserved_length 197
iter 5474 | LR [0.0001]| val_acc 97.97000122070312 | layer_token 305784192

3. 结合课程学习和随机LTD以实现更多

3.1 GPT-3 和 BERT 预训练

在我们的Megatron-DeepSpeed 仓库中，examples_deepspeed/data_efficiency 目录包含了我们如何组合课程学习随机LTD，并将它们应用于GPT-3和BERT预训练的示例。

所需的更改与前两节中描述的相同，因为DeepSpeed数据效率已经处理了组合这两种技术时的复杂性。然而，需要注意的是，由于随机LTD和一些课程学习指标会改变序列长度，可能需要一些额外的代码来计算每一步的有效序列长度。我们在megatron/training.py中的train函数中提供了一个实现此更改的示例，其中我们计算了actual_seq_length。

3.2 GPT-2 微调

在我们的DeepSpeedExamples 仓库中，data_efficiency/gpt_finetuning 目录包含了我们如何为 GPT-2 微调组合课程学习和随机-LTD 的示例。data_efficiency/gpt_finetuning/finetune/ds_finetune_gpt2_run.sh 是示例微调脚本。