使用PyTorch Profiler与DeepSpeed进行性能调试

本教程描述了如何将PyTorch Profiler与DeepSpeed一起使用。

PyTorch Profiler 是一个开源工具，能够对大规模深度学习模型进行准确且高效的性能分析和故障排除。分析结果可以输出为 .json 跟踪文件，并在 Google 的 Perfetto 跟踪查看器 (https://ui.perfetto.dev) 中查看。 Microsoft Visual Studio Code 的 Python 扩展将 TensorBoard 集成到代码编辑器中，包括对 PyTorch Profiler 的支持。

更多详情，请参阅PYTORCH PROFILER。

分析模型训练循环

下面展示了如何通过在分析器上下文管理器中包装代码来分析训练循环。分析器假设训练过程由多个步骤组成（从零开始编号）。PyTorch 分析器接受多个参数，例如 schedule, on_trace_ready, with_stack 等。

在下面的示例中，分析器将跳过前5步，使用接下来的2步作为预热，并主动记录接下来的6步。由于repeat设置为2，分析器将在前两个周期后停止记录。有关schedule的详细用法，请参阅使用分析器分析长时间运行的作业。

from torch.profiler import profile, record_function, ProfilerActivity

with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=5, # During this phase profiler is not active.
        warmup=2, # During this phase profiler starts tracing, but the results are discarded.
        active=6, # During this phase profiler traces and records data.
        repeat=2), # Specifies an upper bound on the number of cycles.
    on_trace_ready=tensorboard_trace_handler,
    with_stack=True # Enable stack tracing, adds extra profiling overhead.
) as profiler:
    for step, batch in enumerate(data_loader):
        print("step:{}".format(step))

        #forward() method
        loss = model_engine(batch)

        #runs backpropagation
        model_engine.backward(loss)

        #weight update
        model_engine.step()
        profiler.step() # Send the signal to the profiler that the next step has started.

标记任意代码范围

record_function 上下文管理器可用于用用户提供的名称标记任意代码范围。例如，以下代码将 "model_forward" 标记为一个标签：

with profile(record_shapes=True) as prof: # record_shapes indicates whether to record shapes of the operator inputs.
    with record_function("""):"
        model_engine(inputs)

分析CPU或GPU活动

传递给activities参数的Profiler指定了在使用性能分析器上下文管理器包装的代码范围内执行时要分析的活动的列表：

ProfilerActivity.CPU - PyTorch 操作符、TorchScript 函数和用户定义的代码标签 (record_function)。
ProfilerActivity.CUDA - 设备上的CUDA内核。注意，CUDA分析会产生不可忽视的开销。

下面的示例分析了模型前向传递中的CPU和GPU活动，并打印了按总CUDA时间排序的摘要表。

with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_forward"):
        model_engine(inputs)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

分析内存消耗

通过将profile_memory=True传递给PyTorch分析器，我们启用了内存分析功能，该功能记录了在执行模型操作符期间分配（或释放）的内存量（由模型的张量使用）。例如：

with profile(activities=[ProfilerActivity.CUDA],
        profile_memory=True, record_shapes=True) as prof:
    model(inputs)

print(prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10))

self 内存对应于由操作符分配（释放）的内存，不包括对其他操作符的子调用。