RayService 故障排除#

RayService 是一个为 Ray Serve 设计的自定义资源定义（CRD）。在 KubeRay 中，创建 RayService 会首先创建一个 RayCluster，然后在 RayCluster 准备好后创建 Ray Serve 应用程序。如果问题涉及数据平面，特别是您的 Ray Serve 脚本或 Ray Serve 配置（serveConfigV2），故障排除可能会很困难。本节提供了一些提示，帮助您调试这些问题。

可观测性#

方法1：检查 KubeRay 操作符的日志以查找错误#

kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log

上述命令将操作员的日志重定向到一个名为 operator-log 的文件中。然后，您可以在该文件中搜索错误。

方法2：检查 RayService CR 状态#

kubectl describe rayservice $RAYSERVICE_NAME -n $YOUR_NAMESPACE

您可以检查 RayService CR 的状态和事件，以查看是否存在任何错误。

方法3：检查 Ray Pods 的日志#

您还可以通过访问pod上的日志文件直接查看Ray Serve日志。这些日志文件包含来自Serve控制器和HTTP代理的系统级日志，以及访问日志和用户级日志。更多详情请参见Ray Serve日志记录和Ray日志记录。

kubectl exec -it $RAY_POD -n $YOUR_NAMESPACE -- bash
# Check the logs under /tmp/ray/session_latest/logs/serve/

方法4：检查仪表板#

kubectl port-forward $RAY_POD -n $YOUR_NAMESPACE 8265:8265
# Check $YOUR_IP:8265 in your browser

关于 Ray Serve 在仪表板上的可观测性更多详情，您可以参考文档和 YouTube 视频。

方法 5：Ray State CLI#

你可以在头节点上使用 Ray State CLI 来检查 Ray Serve 应用程序的状态。

# Log into the head Pod
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- ray summary actors

# [Example output]:
# ======== Actors Summary: 2023-07-11 17:58:24.625032 ========
# Stats:
# ------------------------------------
# total_actors: 14


# Table (group by class):
# ------------------------------------
#     CLASS_NAME                          STATE_COUNTS
# 0   ServeController                     ALIVE: 1
# 1   ServeReplica:fruit_app_OrangeStand  ALIVE: 1
# 2   ProxyActor                          ALIVE: 3
# 4   ServeReplica:math_app_Multiplier    ALIVE: 1
# 5   ServeReplica:math_app_create_order  ALIVE: 1
# 7   ServeReplica:fruit_app_FruitMarket  ALIVE: 1
# 8   ServeReplica:math_app_Adder         ALIVE: 1
# 9   ServeReplica:math_app_Router        ALIVE: 1
# 10  ServeReplica:fruit_app_MangoStand   ALIVE: 1
# 11  ServeReplica:fruit_app_PearStand    ALIVE: 1

常见问题#

问题 1: Ray Serve 脚本不正确。
问题 2：serveConfigV2 不正确。
问题 3-1：Ray 镜像不包含所需的依赖项。
问题 3-2：解决依赖问题示例。
问题 4：错误的 import_path。
问题 5：无法创建/更新 Serve 应用程序。
问题 6: runtime_env
问题 7：无法获取 Serve 应用程序状态。
问题 8：当 Kubernetes 集群资源耗尽时，RayCluster 会发生重启循环。（KubeRay v0.6.1 或更早版本）
问题 9：从 Ray Serve 的单应用程序 API 升级到其多应用程序 API 而不停机
问题 10：在启用 GCS 容错的情况下升级 RayService 而不停机

问题 1: Ray Serve 脚本不正确。#

我们强烈建议您在部署到 RayService 之前，先在本地或 RayCluster 中测试您的 Ray Serve 脚本。更多详情请参阅 rayserve-dev-doc.md。

问题 2：`serveConfigV2` 不正确。#

为了灵活性，我们将 serveConfigV2 设置为 RayService CR 中的 YAML 多行字符串。这意味着 serveConfigV2 字段中的 Ray Serve 配置没有严格的类型检查。以下是一些帮助你调试 serveConfigV2 字段的提示：

查看文档以了解关于Ray Serve多应用API PUT "/api/serve/applications/" 的架构。
与 serveConfig 不同，serveConfigV2 遵循蛇形命名约定。例如，numReplicas 用于 serveConfig，而 num_replicas 用于 serveConfigV2。

问题 3-1：Ray 镜像不包含所需的依赖项。#

您有两种解决此问题的选项：

使用所需依赖项构建您自己的 Ray 镜像。
通过 serveConfigV2 字段中的 runtime_env 指定所需的依赖项。
- 例如，MobileNet 示例需要 python-multipart，而该库不包含在 Ray 镜像 rayproject/ray-ml:2.5.0 中。因此，YAML 文件在运行时环境中包含了 python-multipart。更多详情，请参阅 MobileNet 示例。

问题 3-2：解决依赖问题示例。#

注意：我们强烈建议在部署到 RayService 之前，先在本地或 RayCluster 中测试您的 Ray Serve 脚本。这有助于在早期阶段识别任何依赖性问题。更多详情请参阅 rayserve-dev-doc.md。

在 MobileNet 示例中，mobilenet.py 包含两个函数：__init__() 和 __call__()。函数 __call__() 仅在 Serve 应用程序接收到请求时被调用。

示例 1：从 MobileNet YAML 的运行时环境中移除 python-multipart。

python-multipart 库仅在 __call__ 方法中需要。因此，我们只能在向应用程序发送请求时观察到依赖问题。

示例错误消息：

Unexpected error, traceback: ray::ServeReplica:mobilenet_ImageClassifier.handle_request() (pid=226, ip=10.244.0.9)
  .
  .
  .
  File "...", line 24, in __call__
    request = await http_request.form()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/starlette/requests.py", line 256, in _get_form
    ), "The `python-multipart` library must be installed to use form parsing."
AssertionError: The `python-multipart` library must be installed to use form parsing..

示例 2：将 MobileNet YAML 中的镜像从 rayproject/ray-ml:2.5.0 更新为 rayproject/ray:2.5.0。后者的镜像不包含 tensorflow。

tensorflow 库在 mobilenet.py 中被导入。

示例错误消息：

kubectl describe rayservices.ray.io rayservice-mobilenet

# Example error message:
Pending Service Status:
  Application Statuses:
    Mobilenet:
      ...
      Message:                  Deploying app 'mobilenet' failed:
        ray::deploy_serve_application() (pid=279, ip=10.244.0.12)
            ...
          File ".../mobilenet/mobilenet.py", line 4, in <module>
            from tensorflow.keras.preprocessing import image
        ModuleNotFoundError: No module named 'tensorflow'

问题 4：错误的 `import_path`。#

您可以参考文档以获取更多关于 import_path 格式的详细信息。以 MobileNet YAML 文件为例，import_path 是 mobilenet.mobilenet:app。第一个 mobilenet 是 working_dir 中目录的名称，第二个 mobilenet 是目录 mobilenet/ 中 Python 文件的名称，而 app 是 Python 文件中表示 Ray Serve 应用程序的变量名称。

  serveConfigV2: |
    applications:
      - name: mobilenet
        import_path: mobilenet.mobilenet:app
        runtime_env:
          working_dir: "https://github.com/ray-project/serve_config_examples/archive/b393e77bbd6aba0881e3d94c05f968f05a387b96.zip"
          pip: ["python-multipart==0.0.6"]

问题 5：无法创建/更新 Serve 应用程序。#

当 KubeRay 尝试创建/更新 Serve 应用程序时，您可能会遇到以下错误消息：

错误信息 1: `connect: 连接被拒绝`#

Put "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused

对于 RayService，一旦头 Pod 准备就绪，KubeRay 操作员会向 RayCluster 提交创建 Serve 应用程序的请求。需要注意的是，在头 Pod 准备就绪后，仪表板、仪表板代理和 GCS 可能需要几秒钟才能启动。因此，在必要组件完全运行之前，请求可能会失败几次。

如果在等待1分钟后仍然遇到此问题，可能是仪表板或仪表板代理未能启动。有关更多信息，您可以检查位于头Pod上/tmp/ray/session_latest/logs/的dashboard.log和dashboard_agent.log文件。

错误信息 2: `i/o 超时`#

Put "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: i/o timeout"

这个问题的一个可能原因是 Kubernetes NetworkPolicy 阻止了 Ray Pods 和仪表盘代理端口（即 52365）之间的通信。

问题 6: `runtime_env`#

在 serveConfigV2 中，您可以通过 runtime_env 指定 Ray Serve 应用程序的运行时环境。与 runtime_env 相关的一些常见问题：

working_dir 指向一个私有的 AWS S3 存储桶，但 Ray Pod 没有访问该存储桶所需的权限。
NetworkPolicy 阻止了 Ray Pod 与 runtime_env 中指定的外部 URL 之间的流量。

问题 7：无法获取 Serve 应用程序状态。#

当 KubeRay 尝试获取 Serve 应用程序状态时，您可能会遇到以下错误消息：

Get "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused"

如问题5所述，KubeRay 操作员在头 Pod 准备好后，会向 RayCluster 提交一个 Put 请求以创建 Serve 应用程序。在成功向仪表盘代理提交 Put 请求后，会向仪表盘代理端口（即 52365）发送一个 Get 请求。成功提交表明包括仪表盘代理在内的所有必要组件都已完全运行。因此，与问题5不同，Get 请求的失败是不应该发生的。

如果你持续遇到这个问题，可能有以下几种原因：

头节点Pod上的仪表盘代理进程未运行。您可以检查位于头节点Pod上/tmp/ray/session_latest/logs/目录下的dashboard_agent.log文件以获取更多信息。此外，您还可以通过手动终止头节点Pod上的仪表盘代理进程来重现此问题。

# Step 1: Log in to the head Pod
kubectl exec -it $HEAD_POD -n $YOUR_NAMESPACE -- bash

# Step 2: Check the PID of the dashboard agent process
ps aux
# [Example output]
# ray          156 ... 0:03 /.../python -u /.../ray/dashboard/agent.py --

# Step 3: Kill the dashboard agent process
kill 156

# Step 4: Check the logs
cat /tmp/ray/session_latest/logs/dashboard_agent.log

# [Example output]
# 2023-07-10 11:24:31,962 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:31 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
# 2023-07-10 11:24:34,001 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:33 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
# 2023-07-10 11:24:36,043 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:36 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
# 2023-07-10 11:24:38,082 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:38 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
# 2023-07-10 11:24:38,590 WARNING agent.py:531 -- Exiting with SIGTERM immediately...

# Step 5: Open a new terminal and check the logs of the KubeRay operator
kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log

# [Example output]
# Get \"http://rayservice-sample-raycluster-rqlsl-head-svc.default.svc.cluster.local:52365/api/serve/applications/\": dial tcp 10.96.7.154:52365: connect: connection refused

问题 8：当 Kubernetes 集群资源耗尽时，RayCluster 会发生重启循环。（KubeRay v0.6.1 或更早版本）#

注意：目前，KubeRay 操作员没有明确的计划来处理 Kubernetes 集群资源耗尽的情况。因此，我们建议确保 Kubernetes 集群有足够的资源来容纳 serve 应用程序。

如果一个serve应用的状态持续超过serviceUnhealthySecondThreshold秒仍未处于RUNNING状态，KubeRay操作员将认为RayCluster不健康，并开始准备一个新的RayCluster。此问题的常见原因是Kubernetes集群没有足够的资源来容纳serve应用。在这种情况下，KubeRay操作员可能会继续重启RayCluster，导致重启循环。

我们也可以进行一个实验来重现这种情况：

一个拥有8个CPU节点的Kubernetes集群
ray-service.insufficient-resources.yaml
- RayCluster:
  - 集群有一个带有4个物理CPU的头Pod，但在rayStartParams中num-cpus被设置为0，以防止任何服务副本被调度到头Pod上。
  - 集群默认也有一个带有1个CPU的工作Pod。
- serveConfigV2 指定了5个服务部署，每个部署有1个副本，并且需要1个CPU。

# Step 1: Get the number of CPUs available on the node
kubectl get nodes -o custom-columns=NODE:.metadata.name,ALLOCATABLE_CPU:.status.allocatable.cpu

# [Example output]
# NODE                 ALLOCATABLE_CPU
# kind-control-plane   8

# Step 2: Install a KubeRay operator.

# Step 3: Create a RayService with autoscaling enabled.
kubectl apply -f ray-service.insufficient-resources.yaml

# Step 4: The Kubernetes cluster will not have enough resources to accommodate the serve application.
kubectl describe rayservices.ray.io rayservice-sample -n $YOUR_NAMESPACE

# [Example output]
# fruit_app_FruitMarket:
#   Health Last Update Time:  2023-07-11T02:10:02Z
#   Last Update Time:         2023-07-11T02:10:35Z
#   Message:                  Deployment "fruit_app_FruitMarket" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 1.0}, resources available: {}.
#   Status:                   UPDATING

# Step 5: A new RayCluster will be created after `serviceUnhealthySecondThreshold` (300s here) seconds.
# Check the logs of the KubeRay operator to find the reason for restarting the RayCluster.
kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log

# [Example output]
# 2023-07-11T02:14:58.109Z	INFO	controllers.RayService	Restart RayCluster	{"appName": "fruit_app", "restart reason": "The status of the serve application fruit_app has not been RUNNING for more than 300.000000 seconds. Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster."}
# 2023-07-11T02:14:58.109Z	INFO	controllers.RayService	Restart RayCluster	{"deploymentName": "fruit_app_FruitMarket", "appName": "fruit_app", "restart reason": "The status of the serve deployment fruit_app_FruitMarket or the serve application fruit_app has not been HEALTHY/RUNNING for more than 300.000000 seconds. Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster. The message of the serve deployment is: Deployment \"fruit_app_FruitMarket\" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {\"CPU\": 1.0}, resources available: {}."}
# .
# .
# .
# 2023-07-11T02:14:58.122Z	INFO	controllers.RayService	Restart RayCluster	{"ServiceName": "default/rayservice-sample", "AvailableWorkerReplicas": 1, "DesiredWorkerReplicas": 5, "restart reason": "The serve application is unhealthy, restarting the cluster. If the AvailableWorkerReplicas is not equal to DesiredWorkerReplicas, this may imply that the Autoscaler does not have enough resources to scale up the cluster. Hence, the serve application does not have enough resources to run. Please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details.", "RayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "default", "name": "rayservice-sample-raycluster-hvd9f"}}

问题 9：从 Ray Serve 的单应用程序 API 升级到其多应用程序 API 而不停机#

KubeRay v0.6.0 已经开始通过在 RayService CRD 中暴露 serveConfigV2 来支持 Ray Serve API V2（多应用）。然而，Ray Serve 不支持在集群中同时部署 API V1 和 API V2。因此，如果用户希望通过将 serveConfig 替换为 serveConfigV2 来进行原地升级，他们可能会遇到以下错误信息：

ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application
config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either
redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a
multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the
the multi-app API endpoint `/api/serve/applications/`.

要解决此问题，您可以将 serveConfig 替换为 serveConfigV2，并将对 Ray 版本 2.0.0 或更高版本无影响的 rayVersion 修改为 2.100.0。这将触发新的 RayCluster 准备，而不是就地更新。

如果在按照上述步骤操作后，您仍然看到错误消息并且启用了GCS容错，这可能是因为旧的和新的RayClusters的ray.io/external-storage-namespace注解相同。您可以移除该注解，KubeRay会自动为每个RayCluster自定义资源生成一个唯一的键。更多详情请参见kuberay#1297。

问题 10：在启用 GCS 容错的情况下升级 RayService 而不停机#

KubeRay 使用注解 ray.io/external-storage-namespace 的值来为 RayCluster 管理的所有 Ray Pod 分配环境变量 RAY_external_storage_namespace。该值表示 Ray 集群元数据所在的 Redis 存储命名空间。在头部 Pod 恢复过程中，头部 Pod 尝试使用 RAY_external_storage_namespace 值重新连接到 Redis 服务器以恢复集群数据。

然而，在 RayService 中指定 RAY_external_storage_namespace 值可能会在零停机升级期间导致停机。具体来说，新的 RayCluster 访问与旧集群相同的 Redis 存储命名空间以获取集群元数据。这种配置可能导致 KubeRay 操作员认为 Ray Serve 应用程序是可操作的，因为 Redis 中存在现有的元数据。因此，操作员可能会认为可以安全地停用旧的 RayCluster 并将流量重定向到新的集群，尽管后者可能仍需要时间来初始化 Ray Serve 应用程序。

推荐的解决方案是从 RayService CRD 中移除 ray.io/external-storage-namespace 注解。如果未设置该注解，KubeRay 会自动使用每个 RayCluster 自定义资源的 UID 作为 RAY_external_storage_namespace 的值。因此，旧的和新的 RayCluster 具有不同的 RAY_external_storage_namespace 值，新的 RayCluster 无法访问旧集群的元数据。另一种解决方案是为每个 RayCluster 自定义资源手动设置 RAY_external_storage_namespace 值为唯一值。更多详情请参见 kuberay#1296。