RayCluster 快速入门#

本指南展示了如何在Kubernetes上管理和与Ray集群交互。

准备#

  • 安装 kubectl (>= 1.23)、Helm (>= v3.4)、KindDocker

  • 确保您的 Kubernetes 集群至少有 4 个 CPU 和 4 GB 内存。

步骤 1:创建一个 Kubernetes 集群#

此步骤使用 Kind 创建一个本地 Kubernetes 集群。如果您已经有一个 Kubernetes 集群,可以跳过此步骤。

kind create cluster --image=kindest/node:v1.26.0

步骤 2:部署 KubeRay 操作员#

使用 Helm chart 仓库 部署 KubeRay 操作员。

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

# Install both CRDs and KubeRay operator v1.1.1.
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1

# Confirm that the operator is running in the namespace `default`.
kubectl get pods
# NAME                                READY   STATUS    RESTARTS   AGE
# kuberay-operator-7fbdbf8c89-pt8bk   1/1     Running   0          27s

KubeRay 提供了多种操作符安装选项,如 Helm、Kustomize 和单命名空间操作符。更多信息,请参阅 KubeRay 文档中的安装说明

步骤 3:部署 RayCluster 自定义资源#

一旦 KubeRay 操作员运行起来,我们就可以准备部署一个 RayCluster。为此,我们在 default 命名空间中创建一个 RayCluster 自定义资源 (CR)。

# Deploy a sample RayCluster CR from the KubeRay Helm chart repo:
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set 'image.tag=2.9.0-aarch64'
# Deploy a sample RayCluster CR from the KubeRay Helm chart repo:
helm install raycluster kuberay/ray-cluster --version 1.1.1
# Once the RayCluster CR has been created, you can view it by running:
kubectl get rayclusters

# NAME                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
# raycluster-kuberay   1                 1                   2      3G       0      ready    95s

KubeRay 操作员将检测 RayCluster 对象。操作员随后会通过创建头节点和工作节点来启动您的 Ray 集群。要查看 Ray 集群的 Pod,请运行以下命令:

# View the pods in the RayCluster named "raycluster-kuberay"
kubectl get pods --selector=ray.io/cluster=raycluster-kuberay

# NAME                                          READY   STATUS    RESTARTS   AGE
# raycluster-kuberay-head-vkj4n                 1/1     Running   0          XXs
# raycluster-kuberay-worker-workergroup-xvfkr   1/1     Running   0          XXs

等待 Pod 达到 Running 状态。这可能需要几分钟时间——大部分时间用于下载 Ray 镜像。如果你的 Pod 卡在 Pending 状态,你可以通过 kubectl describe pod raycluster-kuberay-xxxx-xxxxx 检查错误,并确保你的 Docker 资源限制设置得足够高。请注意,在生产环境中,你可能会希望使用更大的 Ray Pod。实际上,将每个 Ray Pod 的大小设置为占用整个 Kubernetes 节点是有优势的。更多详情请参见 配置指南

步骤 4:在 RayCluster 上运行应用程序#

现在,让我们与已部署的 RayCluster 进行交互。

方法1:在头节点Pod中执行Ray作业#

实验 RayCluster 最直接的方法是直接进入头节点 pod。首先,识别你的 RayCluster 的头节点 pod:

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
echo $HEAD_POD
# raycluster-kuberay-head-vkj4n

# Print the cluster resources.
kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

# 2023-04-07 10:57:46,472 INFO worker.py:1243 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
# 2023-04-07 10:57:46,472 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...
# 2023-04-07 10:57:46,482 INFO worker.py:1550 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265
# {'object_store_memory': 802572287.0, 'memory': 3000000000.0, 'node:10.244.0.6': 1.0, 'CPU': 2.0, 'node:10.244.0.7': 1.0}

方法2:通过 ray 作业提交 SDK 向 RayCluster 提交 Ray 作业#

与方法1不同,此方法不需要你在Ray头节点中执行命令。相反,你可以使用Ray作业提交SDK通过Ray Dashboard端口(默认8265)将Ray作业提交到RayCluster,Ray在此端口监听作业请求。KubeRay操作符配置了一个Kubernetes服务,该服务指向Ray头Pod。

kubectl get service raycluster-kuberay-head-svc

# NAME                          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                         AGE
# raycluster-kuberay-head-svc   ClusterIP   10.96.93.74   <none>        8265/TCP,8080/TCP,8000/TCP,10001/TCP,6379/TCP   15m

既然我们已经有了服务的名称,我们可以使用端口转发来访问 Ray Dashboard 端口(默认是8265)。

# Execute this in a separate shell.
kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265

现在我们已经可以访问仪表板端口,我们可以向 RayCluster 提交作业:

# The following job's logs will show the Ray cluster's total resource capacity, including 2 CPUs.
ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

步骤 5:访问 Ray 仪表板#

在浏览器中访问 ${YOUR_IP}:8265 以查看仪表板。例如,127.0.0.1:8265。您可以在如下所示的 最近作业 窗格中查看在步骤4中提交的作业。

Ray 仪表盘

步骤 6:清理#

# [Step 6.1]: Delete the RayCluster CR
# Uninstall the RayCluster Helm chart
helm uninstall raycluster
# release "raycluster" uninstalled

# Note that it may take several seconds for the Ray pods to be fully terminated.
# Confirm that the RayCluster's pods are gone by running
kubectl get pods

# NAME                                READY   STATUS    RESTARTS   AGE
# kuberay-operator-7fbdbf8c89-pt8bk   1/1     Running   0          XXm

# [Step 6.2]: Delete the KubeRay operator
# Uninstall the KubeRay operator Helm chart
helm uninstall kuberay-operator
# release "kuberay-operator" uninstalled

# Confirm that the KubeRay operator pod is gone by running
kubectl get pods
# No resources found in default namespace.

# [Step 6.3]: Delete the Kubernetes cluster
kind delete cluster