使用Intel Gaudi进行ResNet模型训练#

在这个Jupyter笔记本中,我们将训练一个ResNet-50模型来分类蚂蚁和蜜蜂的图像,使用HPU。我们将使用PyTorch进行模型训练,并使用Ray进行分布式训练。数据集将使用torchvision的datasets和transforms进行下载和处理。

Intel Gaudi AI处理器(HPUs)是由Intel Habana Labs设计的AI硬件加速器。有关更多信息,请参见 Gaudi架构Gaudi开发文档

配置#

运行此示例需要安装有Gaudi/Gaudi2的节点。Gaudi和Gaudi2均具有8个HPU。我们将使用2个工作进程来训练模型,每个工作进程使用1个HPU。

我们建议使用预构建的容器来运行这些示例。要运行容器,您需要Docker。有关安装说明,请参见 安装Docker引擎

接下来,请按照 使用容器运行 的说明安装Gaudi驱动程序和容器运行时。

接下来,启动Gaudi容器:

docker pull vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest

在容器内,安装Ray和Jupyter以运行此笔记本。

pip install ray[train] notebook
import os
from typing import Dict
from tempfile import TemporaryDirectory

import torch
from filelock import FileLock
from torch import nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
from tqdm import tqdm

import ray
import ray.train as train
from ray.train import ScalingConfig, Checkpoint
from ray.train.torch import TorchTrainer
from ray.train.torch import TorchConfig
from ray.runtime_env import RuntimeEnv

import habana_frameworks.torch.core as htcore

定义数据变换#

我们将设置数据变换,以对训练和验证的图像进行预处理。这包括对训练集进行随机裁剪、翻转和归一化,对验证集进行调整大小和归一化。

# 数据增强与训练中的归一化处理
# 仅验证的归一化
data_transforms = {
    "train": transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]),
    "val": transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]),
}

数据集下载函数#

我们将定义一个函数来下载Hymenoptera数据集。该数据集包含蚂蚁和蜜蜂的图像,用于二元分类问题。

def download_datasets():
    os.system("wget https://download.pytorch.org/tutorial/hymenoptera_data.zip >/dev/null 2>&1")
    os.system("unzip hymenoptera_data.zip >/dev/null 2>&1")

数据集准备函数#

下载数据集后,我们需要构建用于训练和验证的PyTorch数据集。build_datasets 函数将应用之前定义的转换,并创建数据集。

def build_datasets():
    torch_datasets = {}
    for split in ["train", "val"]:
        torch_datasets[split] = datasets.ImageFolder(
            os.path.join("./hymenoptera_data", split), data_transforms[split]
        )
    return torch_datasets

模型初始化函数#

我们将定义两个函数来初始化我们的模型。initialize_model 函数将加载一个预训练的 ResNet-50 模型,并替换最后的分类层以适应我们的二分类任务。initialize_model_from_checkpoint 函数将在可用时从保存的检查点加载模型。

def initialize_model():
    # 加载预训练模型参数
    model = models.resnet50(pretrained=True)

    # 将原始分类器替换为一个新的线性层
    num_features = model.fc.in_features
    model.fc = nn.Linear(num_features, 2)

    # 确保在微调过程中更新所有参数
    for param in model.parameters():
        param.requires_grad = True
    return model

评估函数#

为了在训练期间评估我们模型的性能,我们定义了一个 evaluate 函数。该函数通过将预测标签与真实标签进行比较,计算正确预测的数量。

def evaluate(logits, labels):
    _, preds = torch.max(logits, 1)
    corrects = torch.sum(preds == labels).item()
    return corrects

训练循环函数#

该函数定义了将由每个工作者执行的训练循环。它包括下载数据集、准备数据加载器、初始化模型以及运行训练和验证阶段。与GPU的训练函数相比,迁移到HPU不需要更改。内部上,Ray Train执行以下操作:

  • 检测HPU并设置设备。

  • 初始化habana PyTorch后端。

  • 初始化habana分布式后端。

def train_loop_per_worker(configs):
    import warnings

    warnings.filterwarnings("ignore")

    # 计算单个工作者的批处理大小
    worker_batch_size = configs["batch_size"] // train.get_context().get_world_size()

    # 在本地排名为0的工作者上一次性下载数据集
    if train.get_context().get_local_rank() == 0:
        download_datasets()
    torch.distributed.barrier()

    # 在每个工作节点上构建数据集
    torch_datasets = build_datasets()

    # 为每个工作者准备数据加载器
    dataloaders = dict()
    dataloaders["train"] = DataLoader(
        torch_datasets["train"], batch_size=worker_batch_size, shuffle=True
    )
    dataloaders["val"] = DataLoader(
        torch_datasets["val"], batch_size=worker_batch_size, shuffle=False
    )

    # 分配
    dataloaders["train"] = train.torch.prepare_data_loader(dataloaders["train"])
    dataloaders["val"] = train.torch.prepare_data_loader(dataloaders["val"])

    # 自动获取HPU设备
    device = train.torch.get_device()

    # 准备DDP模型、优化器和损失函数
    model = initialize_model()
    model = model.to(device)

    optimizer = optim.SGD(
        model.parameters(), lr=configs["lr"], momentum=configs["momentum"]
    )
    criterion = nn.CrossEntropyLoss()

    # 开始训练循环
    for epoch in range(configs["num_epochs"]):
        # 每个时期都包含训练和验证阶段
        for phase in ["train", "val"]:
            if phase == "train":
                model.train()  # 将模型设置为训练模式
            else:
                model.eval()  # 将模型设置为评估模式

            running_loss = 0.0
            running_corrects = 0

            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # 将参数梯度归零
                optimizer.zero_grad()

                # 向前
                with torch.set_grad_enabled(phase == "train"):
                    # 获取模型输出并计算损失
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)

                    # 仅在训练阶段进行反向传播和优化
                    if phase == "train":
                        loss.backward()
                        optimizer.step()

                # 计算统计数据
                running_loss += loss.item() * inputs.size(0)
                running_corrects += evaluate(outputs, labels)

            size = len(torch_datasets[phase]) // train.get_context().get_world_size()
            epoch_loss = running_loss / size
            epoch_acc = running_corrects / size

            if train.get_context().get_world_rank() == 0:
                print(
                    "Epoch {}-{} Loss: {:.4f} Acc: {:.4f}".format(
                        epoch, phase, epoch_loss, epoch_acc
                    )
                )

            # 每个周期报告指标并保存检查点
            if phase == "val":
                train.report(
                    metrics={"loss": epoch_loss, "acc": epoch_acc},
                )

主要训练函数#

train_resnet函数设置使用Ray的分布式训练环境,并开始训练过程。它指定了批量大小、纪元数、学习率和SGD优化器的动量。为了使用HPU进行训练,我们只需要进行以下更改:

  • 在ScalingConfig中为每个工作者要求一个HPU

  • 在TorchConfig中将后端设置为”hccl”

def train_resnet(num_workers=2):
    global_batch_size = 16

    train_loop_config = {
        "input_size": 224,  # 输入图像尺寸(224 x 224)
        "batch_size": 32,  # 训练批次大小
        "num_epochs": 10,  # 训练的轮数
        "lr": 0.001,  # 学习率
        "momentum": 0.9,  # SGD优化器的动量
    }
    # 配置计算资源
    # 在ScalingConfig中,要求每个worker配备一个HPU。
    scaling_config = ScalingConfig(num_workers=num_workers, resources_per_worker={"CPU": 1, "HPU": 1})
    # 将TorchConfig中的后端设置为hccl
    torch_config = TorchConfig(backend = "hccl")
    
    ray.init()
    
    # 初始化一个 Ray TorchTrainer
    trainer = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        train_loop_config=train_loop_config,
        torch_config=torch_config,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

开始训练#

最后,我们调用train_resnet函数来启动训练过程。您可以调整使用的工作线程数量。在运行此单元格之前,请确保您的环境中的Ray已正确设置,以处理分布式训练。

注意:以下警告是正常的,并在SynapseAI版本1.14.0+中得到解决:

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:252: UserWarning: Device capability of hccl unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.
train_resnet(num_workers=2) 

Tune Status

Current time:2024-02-28 07:31:55
Running for: 00:00:55.04
Memory: 389.2/1007.5 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 3.0/160 CPUs, 0/0 GPUs (2.0/8.0 HPU, 0.0/1.0 TPU)

Trial Status

Trial name status loc iter total time (s) loss acc
TorchTrainer_521db_00000TERMINATED172.17.0.3:109080 10 49.30960.1546480.986842
(pid=109080) /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:252: UserWarning: Device capability of hccl unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.
(pid=109080)   warnings.warn(
(RayTrainWorker pid=115673) Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=109080) Started distributed worker processes: 
(TorchTrainer pid=109080) - (ip=172.17.0.3, pid=115673) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=109080) - (ip=172.17.0.3, pid=115678) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=115673) /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:252: UserWarning: Device capability of hccl unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=115673)   warnings.warn( [repeated 2x across cluster]
(RayTrainWorker pid=115673) ============================= HABANA PT BRIDGE CONFIGURATION =========================== 
(RayTrainWorker pid=115673)  PT_HPU_LAZY_MODE = 1
(RayTrainWorker pid=115673)  PT_RECIPE_CACHE_PATH = 
(RayTrainWorker pid=115673)  PT_CACHE_FOLDER_DELETE = 0
(RayTrainWorker pid=115673)  PT_HPU_RECIPE_CACHE_CONFIG = 
(RayTrainWorker pid=115673)  PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
(RayTrainWorker pid=115673)  PT_HPU_LAZY_ACC_PAR_MODE = 1
(RayTrainWorker pid=115673)  PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
(RayTrainWorker pid=115673) ---------------------------: System Configuration :---------------------------
(RayTrainWorker pid=115673) Num CPU Cores : 160
(RayTrainWorker pid=115673) CPU RAM       : 1056389756 KB
(RayTrainWorker pid=115673) ------------------------------------------------------------------------------
(RayTrainWorker pid=115673) Epoch 0-train Loss: 0.6667 Acc: 0.6148
(RayTrainWorker pid=115673) Epoch 0-val Loss: 0.5717 Acc: 0.6053
(RayTrainWorker pid=115673) Epoch 1-train Loss: 0.5248 Acc: 0.7295
(RayTrainWorker pid=115673) Epoch 1-val Loss: 0.3194 Acc: 0.9605
(RayTrainWorker pid=115673) Epoch 2-train Loss: 0.3100 Acc: 0.9016
(RayTrainWorker pid=115673) Epoch 2-val Loss: 0.2336 Acc: 0.9474
(RayTrainWorker pid=115673) Epoch 3-train Loss: 0.2391 Acc: 0.9180
(RayTrainWorker pid=115673) Epoch 3-val Loss: 0.1789 Acc: 0.9737
(RayTrainWorker pid=115673) Epoch 4-train Loss: 0.1780 Acc: 0.9508
(RayTrainWorker pid=115673) Epoch 4-val Loss: 0.1696 Acc: 0.9605
(RayTrainWorker pid=115673) Epoch 5-train Loss: 0.1447 Acc: 0.9754
(RayTrainWorker pid=115673) Epoch 5-val Loss: 0.1534 Acc: 0.9737
(RayTrainWorker pid=115673) Epoch 6-train Loss: 0.1398 Acc: 0.9426
(RayTrainWorker pid=115673) Epoch 6-val Loss: 0.1606 Acc: 0.9605
(RayTrainWorker pid=115673) Epoch 7-train Loss: 0.1398 Acc: 0.9590
(RayTrainWorker pid=115673) Epoch 7-val Loss: 0.1582 Acc: 0.9605
(RayTrainWorker pid=115673) Epoch 8-train Loss: 0.0856 Acc: 0.9754
(RayTrainWorker pid=115673) Epoch 8-val Loss: 0.1552 Acc: 0.9605
(RayTrainWorker pid=115673) Epoch 9-train Loss: 0.0602 Acc: 0.9836
(RayTrainWorker pid=115673) Epoch 9-val Loss: 0.1546 Acc: 0.9868
2024-02-28 07:31:55,645	INFO tune.py:1042 -- Total run time: 55.08 seconds (55.04 seconds for the tuning loop).
Training result: Result(
  metrics={'loss': 0.15464812321098229, 'acc': 0.9868421052631579},
  path='/root/ray_results/TorchTrainer_2024-02-28_07-31-00/TorchTrainer_521db_00000_0_2024-02-28_07-31-00',
  filesystem='local',
  checkpoint=None
)