KubeRay 中的 GCS 容错#

全局控制服务 (GCS) 管理集群级别的元数据。默认情况下，GCS 缺乏容错能力，因为它将所有数据存储在内存中，故障可能导致整个 Ray 集群失败。要使 GCS 具有容错能力，您必须拥有一个高可用的 Redis。这样，在 GCS 重启时，它可以从 Redis 实例中检索所有数据并恢复其常规功能。

没有GCS容错的命运共享

在没有 GCS 容错的情况下，Ray 集群、GCS 进程和 Ray 头 Pod 是命运共享的。如果 GCS 进程死亡，Ray 头 Pod 也会在 RAY_gcs_rpc_server_reconnect_timeout_s 秒后死亡。如果根据 Pod 的 restartPolicy 重新启动 Ray 头 Pod，工作 Pod 会尝试重新连接到新的头 Pod。工作 Pod 会被新的头 Pod 终止；在没有启用 GCS 容错的情况下，集群状态会丢失，工作 Pod 会被新的头 Pod 视为“未知工作节点”。这对于大多数 Ray 应用来说是足够的；然而，对于 Ray Serve 来说并不理想，特别是如果你的用例对高可用性至关重要。因此，我们建议在 RayService 自定义资源上启用 GCS 容错，以确保高可用性。更多信息请参见 Ray Serve 端到端容错文档。

用例#

Ray Serve: 推荐的配置是在 RayService 自定义资源上启用 GCS 容错，以确保高可用性。更多信息请参阅 Ray Serve 端到端容错文档。
其他工作负载：不推荐使用GCS容错，且兼容性不保证。

先决条件#

Ray 2.0.0+
KubeRay 0.6.0+
Redis: 单个分片，一个或多个副本

快速入门#

步骤 1：使用 Kind 创建一个 Kubernetes 集群#

kind create cluster --image=kindest/node:v1.26.0

步骤 2：安装 KubeRay 操作员#

按照此文档通过 Helm 仓库安装最新稳定版本的 KubeRay 操作员。

步骤 3：安装启用了 GCS FT 的 RayCluster#

curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray-cluster.external-redis.yaml
kubectl apply -f ray-cluster.external-redis.yaml

步骤 4：验证 Kubernetes 集群状态#

# Step 4.1: List all Pods in the `default` namespace.
# The expected output should be 4 Pods: 1 head, 1 worker, 1 KubeRay, and 1 Redis.
kubectl get pods

# Step 4.2: List all ConfigMaps in the `default` namespace.
kubectl get configmaps

# [Example output]
# NAME                  DATA   AGE
# ray-example           2      5m45s
# redis-config          1      5m45s
# ...

The ray-cluster.external-redis.yaml 文件定义了 RayCluster、Redis 和 ConfigMaps 的 Kubernetes 资源。在这个例子中有两个 ConfigMaps：ray-example 和 redis-config。ray-example ConfigMap 包含两个 Python 脚本：detached_actor.py 和 increment_counter.py。

detached_actor.py 是一个 Python 脚本，它创建了一个名为 counter_actor 的分离角色。

import ray

@ray.remote(num_cpus=1)
class Counter:
    def __init__(self):
        self.value = 0

    def increment(self):
        self.value += 1
        return self.value

ray.init(namespace="default_namespace")
Counter.options(name="counter_actor", lifetime="detached").remote()

increment_counter.py 是一个递增计数器的 Python 脚本。

import ray

ray.init(namespace="default_namespace")
counter = ray.get_actor("counter_actor")
print(ray.get(counter.increment.remote()))

步骤 5：创建一个分离的演员#

# Step 4.1: Create a detached actor with the name, `counter_actor`.
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py

# Step 4.2: Increment the counter.
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py

# 2023-09-07 16:01:41,925 INFO worker.py:1313 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
# 2023-09-07 16:01:41,926 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.244.0.17:6379...
# 2023-09-07 16:01:42,000 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at http://10.244.0.17:8265
# 1

步骤 6：检查 Redis 中的数据#

# Step 6.1: Check the RayCluster's UID.
kubectl get rayclusters.ray.io raycluster-external-redis -o=jsonpath='{.metadata.uid}'
# [Example output]: 864b004c-6305-42e3-ac46-adfa8eb6f752

# Step 6.2: Check the head Pod's environment variable `RAY_external_storage_namespace`.
kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq
# [Example output]:
# [
#   {
#     "name": "RAY_external_storage_namespace",
#     "value": "864b004c-6305-42e3-ac46-adfa8eb6f752"
#   },
#   ...
# ]

# Step 6.3: Log into the Redis Pod.
export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers)
# The password `5241590000000000` is defined in the `redis-config` ConfigMap.
kubectl exec -it $REDIS_POD -- redis-cli -a "5241590000000000"

# Step 6.4: Check the keys in Redis.
KEYS *
# [Example output]:
# 1) "864b004c-6305-42e3-ac46-adfa8eb6f752"

# Step 6.5: Check the value of the key.
HGETALL 864b004c-6305-42e3-ac46-adfa8eb6f752

在 ray-cluster.external-redis.yaml 中，ray.io/external-storage-namespace 注解没有为 RayCluster 设置。因此，KubeRay 会自动将环境变量 RAY_external_storage_namespace 注入到 RayCluster 管理的所有 Ray Pod 中，默认情况下使用 RayCluster 的 UID 作为外部存储命名空间。请参阅本节以了解更多关于该注解的信息。

步骤7：在头部Pod中终止GCS进程#

# Step 7.1: Check the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable in both the head Pod and worker Pod.
kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq
# [Expected result]:
# No `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable is set. Hence, the Ray head uses its default value of `60`.
kubectl get pods $YOUR_WORKER_POD -o=jsonpath='{.spec.containers[0].env}' | jq
# [Expected result]:
# KubeRay injects the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable with the value `600` to the worker Pod.
# [
#   {
#     "name": "RAY_gcs_rpc_server_reconnect_timeout_s",
#     "value": "600"
#   },
#   ...
# ]

# Step 7.2: Kill the GCS process in the head Pod.
kubectl exec -it $HEAD_POD -- pkill gcs_server

# Step 7.3: The head Pod fails and restarts after `RAY_gcs_rpc_server_reconnect_timeout_s` (60) seconds.
# In addition, the worker Pod isn't terminated by the new head after reconnecting because GCS fault
# tolerance is enabled.
kubectl get pods -l=ray.io/is-ray-node=yes
# [Example output]:
# NAME                                                 READY   STATUS    RESTARTS      AGE
# raycluster-external-redis-head-xxxxx                 1/1     Running   1 (64s ago)   xxm
# raycluster-external-redis-worker-small-group-yyyyy   1/1     Running   0             xxm

在 ray-cluster.external-redis.yaml 中，RAY_gcs_rpc_server_reconnect_timeout_s 环境变量在 RayCluster 的头 Pod 或工作 Pod 的规范中都没有设置。因此，KubeRay 会自动将 RAY_gcs_rpc_server_reconnect_timeout_s 环境变量以值 600 注入到工作 Pod 中，并为头 Pod 使用默认值 60。工作 Pod 的超时值必须比头 Pod 的超时值长，以确保工作 Pod 在头 Pod 从故障中重启之前不会终止。

步骤 8：再次访问分离的演员#

kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py
# 2023-09-07 17:31:17,793 INFO worker.py:1313 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
# 2023-09-07 17:31:17,793 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.244.0.21:6379...
# 2023-09-07 17:31:17,875 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at http://10.244.0.21:8265
# 2

在这个例子中，分离的执行者总是位于工作 Pod 上。

头部 Pod 的 rayStartParams 设置为 num-cpus: "0"。因此，不会在头部 Pod 上调度任务或角色。

启用 GCS 容错后，即使 GCS 进程死亡并重新启动，您仍然可以访问分离的执行者。请注意，容错不会持久化执行者的状态。结果为 2 而不是 1 的原因是分离的执行者位于始终运行的工作 Pod 上。另一方面，如果头 Pod 托管分离的执行者，increment_counter.py 脚本在此步骤中会产生 1 的结果。

步骤 9：删除 RayCluster 时移除存储在 Redis 中的键#

# Step 9.1: Delete the RayCluster custom resource.
kubectl delete raycluster raycluster-external-redis

# Step 9.2: KubeRay operator deletes all Pods in the RayCluster.
# Step 9.3: KubeRay operator creates a Kubernetes Job to delete the Redis key after the head Pod is terminated.

# Step 9.4: Check whether the RayCluster has been deleted.
kubectl get raycluster
# [Expected output]: No resources found in default namespace.

# Step 9.5: Check Redis keys after the Kubernetes Job finishes.
export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $REDIS_POD -- redis-cli -a "5241590000000000"
KEYS *
# [Expected output]: (empty list or set)

在 KubeRay v1.0.0 中，KubeRay 操作员为启用了 GCS 容错功能的 RayCluster 添加了一个 Kubernetes 终结器，以确保 Redis 清理。KubeRay 仅在 Kubernetes Job 成功清理 Redis 后才移除此终结器。

换句话说，如果 Kubernetes Job 失败，RayCluster 不会被删除。在这种情况下，您应该移除 finalizer 并手动清理 Redis。
```
kubectl patch rayclusters.ray.io raycluster-external-redis --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'
```

从 KubeRay v1.1.0 开始，KubeRay 将 Redis 清理行为从强制性改为尽力而为。如果 Kubernetes Job 失败，KubeRay 仍然会从 RayCluster 中移除 Kubernetes 终结器，从而解除 RayCluster 的删除阻塞。

用户可以通过设置特性门值 ENABLE_GCS_FT_REDIS_CLEANUP 来关闭此功能。更多详情请参阅 KubeRay GCS 故障容错配置部分。

步骤 10：删除 Kubernetes 集群#

kind delete cluster

KubeRay GCS 故障容错配置#

在快速入门示例中使用的 ray-cluster.external-redis.yaml 包含了关于配置选项的详细注释。请结合 YAML 文件阅读本节。

1. Enable GCS fault tolerance#

ray.io/ft-enabled: 在 RayCluster 自定义资源中添加 ray.io/ft-enabled: "true" 注解以启用 GCS 容错。

kind: RayCluster
metadata:
annotations:
    ray.io/ft-enabled: "true" # <- Add this annotation to enable GCS fault tolerance

2. Connect to an external Redis#

redis-password 在头部的 rayStartParams 中：使用此选项来指定 Redis 服务的密码，从而允许 Ray 头节点连接到它。在 ray-cluster.external-redis.yaml 中，RayCluster 自定义资源使用环境变量 REDIS_PASSWORD 从 Kubernetes 密钥中存储密码。

rayStartParams:
  redis-password: $REDIS_PASSWORD
template:
  spec:
    containers:
      - name: ray-head
        env:
          # This environment variable is used in the `rayStartParams` above.
          - name: REDIS_PASSWORD
            valueFrom:
              secretKeyRef:
                name: redis-password-secret
                key: password

RAY_REDIS_ADDRESS 环境变量在头部的 Pod 中：Ray 读取 RAY_REDIS_ADDRESS 环境变量以建立与 Redis 服务器的连接。在 ray-cluster.external-redis.yaml 中，RayCluster 自定义资源使用 redis Kubernetes ClusterIP 服务名称作为连接到 Redis 服务器的连接点。ClusterIP 服务也由 YAML 文件创建。
```
template:
  spec:
    containers:
      - name: ray-head
        env:
          - name: RAY_REDIS_ADDRESS
            value: redis:6379
```

3. Use an external storage namespace#

ray.io/external-storage-namespace 注解（可选）：KubeRay 使用此注解的值来设置所有由 RayCluster 管理的 Ray Pod 的环境变量 RAY_external_storage_namespace。在大多数情况下，您不需要设置 ray.io/external-storage-namespace，因为 KubeRay 会自动将其设置为 RayCluster 的 UID。只有在完全理解 GCS 容错和 RayService 的行为时才修改此注解，以避免此问题。更多详情请参阅此部分在之前的快速入门示例中。
```
kind: RayCluster
metadata:
annotations:
    ray.io/external-storage-namespace: "my-raycluster-storage" # <- Add this annotation to specify a storage namespace
```

4. Turn off Redis cleanup#

ENABLE_GCS_FT_REDIS_CLEANUP: 默认为 True。你可以通过在 KubeRay 操作员的 Helm 图表中设置环境变量来关闭此功能。

Redis 上的键驱逐设置

如果你禁用了 ENABLE_GCS_FT_REDIS_CLEANUP 但希望 Redis 自动移除 GCS 元数据，请在你的 redis.conf 文件或 redis-server 命令的命令行选项中设置这两个选项 (示例)：

maxmemory=<你的内存限制>
maxmemory-policy=allkeys-lru

这两个选项指示 Redis 在达到 maxmemory 限制时删除最近最少使用的键。更多信息请参见 Redis 的键驱逐。

请注意，Redis 执行此驱逐操作，并且不能保证 Ray 不会使用已删除的键。

下一步#

更多信息请参见 Ray Serve 端到端容错文档。
更多信息请参见 Ray Core GCS 容错文档。