在Kubernetes上使用CPU训练Fashion MNIST的PyTorch模型#

此示例使用 Ray Train 在 Fashion MNIST 上进行 PyTorch 模型的分布式训练。更多详情请参见在 Fashion MNIST 上训练 PyTorch 模型。

步骤 1：创建一个 Kubernetes 集群#

此步骤使用 Kind 创建一个本地 Kubernetes 集群。如果您已经有一个 Kubernetes 集群，可以跳过此步骤。

kind create cluster --image=kindest/node:v1.26.0

步骤 2: 安装 KubeRay 操作员#

按照这个文档从 Helm 仓库安装最新稳定版本的 KubeRay 操作员。

步骤 3：创建一个 RayJob#

一个 RayJob 由一个 RayCluster 自定义资源和一个可以提交到 RayCluster 的作业组成。使用 RayJob，KubeRay 在集群准备好时创建一个 RayCluster 并提交一个作业。以下是一个仅使用 CPU 的 RayJob 描述 YAML 文件，用于在 PyTorch 模型上进行 MNIST 训练。

# Download `ray-job.pytorch-mnist.yaml`
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml

您可能需要调整 RayJob 描述 YAML 文件中的某些字段，以便它能在您的环境中运行：

rayClusterSpec 中的 workerGroupSpecs 下的 replicas：此字段指定 KubeRay 调度到 Kubernetes 集群的工作者 Pod 的数量。每个工作者 Pod 需要 3 个 CPU，而头部 Pod 需要 1 个 CPU，如 template 字段中所述。一个 RayJob 提交者 Pod 需要 1 个 CPU。例如，如果你的机器有 8 个 CPU，replicas 的最大值为 2，以使所有 Pod 都能达到 Running 状态。
spec 中的 runtimeEnvYAML 下的 NUM_WORKERS：此字段指示要启动的 Ray 执行者数量（有关更多信息，请参阅此文档中的 ScalingConfig）。每个 Ray 执行者必须由 Kubernetes 集群中的一个工作 Pod 提供服务。因此，NUM_WORKERS 必须小于或等于 replicas。
CPUS_PER_WORKER: 这必须设置为小于或等于 (每个工作Pod的CPU资源请求) - 1。例如，在示例YAML文件中，每个工作Pod的CPU资源请求是3个CPU，因此 CPUS_PER_WORKER 必须设置为2或更少。

# `replicas` and `NUM_WORKERS` set to 2.
# Create a RayJob.
kubectl apply -f ray-job.pytorch-mnist.yaml

# Check existing Pods: According to `replicas`, there should be 2 worker Pods.
# Make sure all the Pods are in the `Running` status.
kubectl get pods
# NAME                                                             READY   STATUS    RESTARTS   AGE
# kuberay-operator-6dddd689fb-ksmcs                                1/1     Running   0          6m8s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx   1/1     Running   0          5m32s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm   1/1     Running   0          5m32s
# rayjob-pytorch-mnist-nxmj2                                       1/1     Running   0          4m17s
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl                 1/1     Running   0          5m32s

检查 RayJob 是否处于 RUNNING 状态：

kubectl get rayjob
# NAME                   JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
# rayjob-pytorch-mnist   RUNNING      Running             2024-06-17T04:08:25Z              11m

步骤 4：等待 RayJob 完成并检查训练结果#

等待 RayJob 完成。可能需要几分钟时间。

kubectl get rayjob
# NAME                   JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
# rayjob-pytorch-mnist   SUCCEEDED    Complete            2024-06-17T04:08:25Z   2024-06-17T04:22:21Z   16m

在看到 JOB_STATUS 标记为 SUCCEEDED 后，您可以检查训练日志：

# Check Pods name.
kubectl get pods
# NAME                                                             READY   STATUS      RESTARTS   AGE
# kuberay-operator-6dddd689fb-ksmcs                                1/1     Running     0          113m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx   1/1     Running     0          38m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm   1/1     Running     0          38m
# rayjob-pytorch-mnist-nxmj2                                       0/1     Completed   0          38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl                 1/1     Running     0          38m

# Check training logs.
kubectl logs -f rayjob-pytorch-mnist-nxmj2

# 2024-06-16 22:23:01,047 INFO cli.py:36 -- Job submission server address: http://rayjob-pytorch-mnist-raycluster-rkdmq-head-svc.default.svc.cluster.local:8265
# 2024-06-16 22:23:01,844 SUCC cli.py:60 -- -------------------------------------------------------
# 2024-06-16 22:23:01,844 SUCC cli.py:61 -- Job 'rayjob-pytorch-mnist-l6ccc' submitted successfully
# 2024-06-16 22:23:01,844 SUCC cli.py:62 -- -------------------------------------------------------
# ...
# (RayTrainWorker pid=1138, ip=10.244.0.18)
#   0%|          | 0/26421880 [00:00<?, ?it/s]
# (RayTrainWorker pid=1138, ip=10.244.0.18)
#   0%|          | 32768/26421880 [00:00<01:27, 301113.97it/s]
# ...
# Training finished iteration 10 at 2024-06-16 22:33:05. Total running time: 7min 9s
# ╭───────────────────────────────╮
# │ Training result               │
# ├───────────────────────────────┤
# │ checkpoint_dir_name           │
# │ time_this_iter_s      28.2635 │
# │ time_total_s          423.388 │
# │ training_iteration         10 │
# │ accuracy               0.8748 │
# │ loss                  0.35477 │
# ╰───────────────────────────────╯

# Training completed after 10 iterations at 2024-06-16 22:33:06. Total running time: 7min 10s

# Training result: Result(
#   metrics={'loss': 0.35476621258825347, 'accuracy': 0.8748},
#   path='/home/ray/ray_results/TorchTrainer_2024-06-16_22-25-55/TorchTrainer_122aa_00000_0_2024-06-16_22-25-55',
#   filesystem='local',
#   checkpoint=None
# )
# ...

清理#

使用以下命令删除您的 RayJob：

kubectl delete -f ray-job.pytorch-mnist.yaml