在 Kubernetes 上使用 GPU 训练 PyTorch ResNet 模型#

本指南在 Kubernetes 基础设施上运行一个使用 GPU 的 Ray 机器学习训练工作负载示例。它使用 1 GB 的训练集运行 Ray 的 PyTorch 图像训练基准

备注

要学习Ray在Kubernetes上的基础知识,我们建议先查看 入门指南

请注意,Kubernetes 和 Kubectl 需要至少 1.19 的版本。

端到端工作流程#

以下脚本总结了GPU训练的端到端工作流程。这些说明适用于GCP,但类似的设置也适用于任何主要的云提供商。以下脚本包括:

  • 步骤 1:在 GCP 上设置 Kubernetes 集群。

  • 步骤 2:使用 KubeRay 操作符在 Kubernetes 上部署 Ray 集群。

  • 步骤 3:运行 PyTorch 图像训练基准测试。

# Step 1: Set up a Kubernetes cluster on GCP
# Create a node-pool for a CPU-only head node
# e2-standard-8 => 8 vCPU; 32 GB RAM
gcloud container clusters create gpu-cluster-1 \
    --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
    --zone=us-central1-c --machine-type e2-standard-8

# Create a node-pool for GPU. The node is for a GPU Ray worker node.
# n1-standard-8 => 8 vCPU; 30 GB RAM
gcloud container node-pools create gpu-node-pool \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --zone us-central1-c --cluster gpu-cluster-1 \
  --num-nodes 1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
  --machine-type n1-standard-8

# Install NVIDIA GPU device driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

# Step 2: Deploy a Ray cluster on Kubernetes with the KubeRay operator.
# Please make sure you are connected to your Kubernetes cluster. For GCP, you can do so by:
#   (Method 1) Copy the connection command from the GKE console
#   (Method 2) "gcloud container clusters get-credentials <your-cluster-name> --region <your-region> --project <your-project>"
#   (Method 3) "kubectl config use-context ..."

# Install both CRDs and KubeRay operator v1.0.0.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0

# Create a Ray cluster
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml

# Set up port-forwarding
kubectl port-forward services/raycluster-head-svc 8265:8265

# Step 3: Run the PyTorch image training benchmark.
# Install Ray if needed
pip3 install -U "ray[default]"

# Download the Python script
curl https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/doc_code/pytorch_training_e2e_submit.py -o pytorch_training_e2e_submit.py

# Submit the training job to your ray cluster
python3 pytorch_training_e2e_submit.py

# Use the following command to follow this Job's logs:
# Substitute the Ray Job's submission id.
ray job logs 'raysubmit_xxxxxxxxxxxxxxxx' --address http://127.0.0.1:8265 --follow

在本文档的其余部分,我们将对上述工作流程进行更详细的分解。

步骤 1:在 GCP 上设置 Kubernetes 集群。#

在本节中,我们设置了一个带有CPU和GPU节点池的Kubernetes集群。这些说明适用于GCP,但对于任何主要的云提供商,类似的设置也可以工作。如果你已经有一个带有GPU的Kubernetes集群,你可以忽略这一步。

如果你是Kubernetes的新手,并且计划在托管的Kubernetes服务上部署Ray工作负载,我们建议首先查看这个入门指南

没有必要使用具有大量RAM(以下命令中每个节点>30GB)的集群来运行此示例。可以随意更新选项 machine-typeray-cluster.gpu.yaml 中的资源需求。

在第一个命令中,我们创建了一个 Kubernetes 集群 gpu-cluster-1,其中包含一个 CPU 节点(e2-standard-8:8 个 vCPU;32 GB 内存)。在第二个命令中,我们向集群添加了一个新的节点(n1-standard-8:8 个 vCPU;30 GB 内存),并带有 GPU(nvidia-tesla-t4)。

# Step 1: Set up a Kubernetes cluster on GCP.
# e2-standard-8 => 8 vCPU; 32 GB RAM
gcloud container clusters create gpu-cluster-1 \
    --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
    --zone=us-central1-c --machine-type e2-standard-8

# Create a node-pool for GPU
# n1-standard-8 => 8 vCPU; 30 GB RAM
gcloud container node-pools create gpu-node-pool \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --zone us-central1-c --cluster gpu-cluster-1 \
  --num-nodes 1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
  --machine-type n1-standard-8

# Install NVIDIA GPU device driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

步骤 2:使用 KubeRay 操作符在 Kubernetes 上部署 Ray 集群。#

要执行以下步骤,请确保您已连接到您的 Kubernetes 集群。对于 GCP,您可以通过以下方式进行连接:

  • 从 GKE 控制台复制连接命令

  • gcloud container clusters get-credentials <your-cluster-name> --region <your-region> --project <your-project> (链接)

  • kubectl config use-context (链接)

第一个命令将部署 KubeRay (ray-operator) 到您的 Kubernetes 集群。第二个命令将在 KubeRay 的帮助下创建一个 ray 集群。

第三个命令用于将 ray-head pod 的 8265 端口映射到 127.0.0.1:8265。你可以检查 127.0.0.1:8265 以查看仪表板。最后一个命令用于通过提交一个简单的作业来测试你的 Ray 集群。这是可选的。

# Step 2: Deploy a Ray cluster on Kubernetes with the KubeRay operator.
# Create the KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0

# Create a Ray cluster
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml

# port forwarding
kubectl port-forward services/raycluster-head-svc 8265:8265

# Test cluster (optional)
ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

步骤 3:运行 PyTorch 图像训练基准测试。#

我们将使用 Ray Job Python SDK 来提交 PyTorch 工作负载。

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://127.0.0.1:8265")

kick_off_pytorch_benchmark = (
    # Clone ray. If ray is already present, don't clone again.
    "git clone -b ray-2.2.0 https://github.com/ray-project/ray || true;"
    # Run the benchmark.
    "python ray/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py"
    " --data-size-gb=1 --num-epochs=2 --num-workers=1"
)


submission_id = client.submit_job(
    entrypoint=kick_off_pytorch_benchmark,
)

print("Use the following command to follow this Job's logs:")
print(f"ray job logs '{submission_id}' --address http://127.0.0.1:8265 --follow")

要提交工作负载,请运行上述Python脚本。该脚本可在 Ray仓库 中找到。

# Step 3: Run the PyTorch image training benchmark.
# Install Ray if needed
pip3 install -U "ray[default]"

# Download the Python script
curl https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/doc_code/pytorch_training_e2e_submit.py -o pytorch_training_e2e_submit.py

# Submit the training job to your ray cluster
python3 pytorch_training_e2e_submit.py
# Example STDOUT:
# Use the following command to follow this Job's logs:
# ray job logs 'raysubmit_jNQxy92MJ4zinaDX' --follow

# Track job status
# Substitute the Ray Job's submission id.
ray job logs 'raysubmit_xxxxxxxxxxxxxxxx' --address http://127.0.0.1:8265 --follow

清理#

使用以下命令删除您的 Ray 集群和 KubeRay:

kubectl delete raycluster raycluster

# Please make sure the ray cluster has already been removed before delete the operator.
helm uninstall kuberay-operator

如果你在公共云上,别忘了清理底层节点组和/或Kubernetes集群。