在 Kubernetes 上使用 vLLM 服务大型语言模型#

本指南演示了如何使用 KubeRay 在 Kubernetes 上使用 vLLM 服务大型语言模型。本指南中的示例在 Google Kubernetes Engine (GKE) 上部署了来自 Hugging Face 的 meta-llama/Meta-Llama-3-8B-Instruct 模型。

先决条件#

此示例从 Hugging Face 下载模型权重。要成功完成本指南，您需要完成以下先决条件：

一个 Hugging Face 账户
一个具有对 gated repos 读取权限的 Hugging Face 访问令牌。
访问 Llama 3 8B 模型。通常需要签署 Hugging Face 上的协议才能访问此模型。访问 Llama 3 模型页面了解更多详情。

在 GKE 上创建一个 Kubernetes 集群#

创建一个带有GPU节点池的GKE集群：

gcloud container clusters create kuberay-gpu-cluster \
    --machine-type=g2-standard-24 \
    --location=us-east4-c \
    --num-nodes=2 \
    --accelerator=type=nvidia-l4,count=2,gpu-driver-version=latest

此示例使用 L4 GPU。每个模型副本使用 2 个 L4 GPU，通过 vLLM 的张量并行处理。

安装 KubeRay 操作员#

按照部署 KubeRay 操作员来从 Helm 仓库安装最新稳定的 KubeRay 操作员。如果你正确地为 GPU 节点池设置了污点，那么 KubeRay 操作员 Pod 必须在 CPU 节点上。

创建一个包含您的 Hugging Face 访问令牌的 Kubernetes Secret#

创建一个包含您的 Hugging Face 访问令牌的 Kubernetes Secret：

export HF_TOKEN=<Hugging Face access token>
kubectl create secret generic hf-secret   --from-literal=hf_api_token=${HF_TOKEN}   --dry-run=client -o yaml | kubectl apply -f -

本指南将此秘密作为环境变量引用在下一步骤中使用的 RayCluster 中。

部署一个 RayService#

创建一个 RayService 自定义资源：

kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/vllm/ray-service.vllm.yaml

此步骤配置 RayService 以部署一个 Ray Serve 应用，运行 vLLM 作为 Llama 3 8B Instruct 模型的服务引擎。你可以在 GitHub 上找到此示例的代码。你可以检查 Serve 配置以获取有关 Serve 部署的更多详细信息：

  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ray-operator.config.samples.vllm.serve:model
      deployments:
      - name: VLLMDeployment
        num_replicas: 1
        ray_actor_options:
          num_cpus: 8
          # NOTE: num_gpus is set automatically based on TENSOR_PARALLELISM
      runtime_env:
        working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
        pip: ["vllm==0.5.4"]
        env_vars:
          MODEL_ID: "meta-llama/Meta-Llama-3-8B-Instruct"
          TENSOR_PARALLELISM: "2"

等待 RayService 资源准备就绪。您可以通过运行以下命令来检查其状态：

$ kubectl get rayservice llama-3-8b -o yaml

输出应包含以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-08-08T22:56:50Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-08-08T22:56:50Z"
            status: HEALTHY
        status: RUNNING

发送一个提示#

确认 Ray Serve 部署是健康的，然后你可以为 Serve 应用建立一个端口转发会话：

$ kubectl port-forward svc/llama-3-8b-serve-svc 8000

请注意，KubeRay 会在 Serve 应用准备就绪并运行后创建此 Kubernetes 服务。在 RayCluster 中的所有 Pod 运行后，此过程可能需要几分钟时间。

现在你可以向模型发送提示：

$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Provide a brief sentence describing the Ray open-source project."}
      ],
      "temperature": 0.7
    }'

输出应类似于以下内容，包含模型生成的响应：

{"id":"cmpl-ce6585cd69ed47638b36ddc87930fded","object":"chat.completion","created":1723161873,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The Ray open-source project is a high-performance distributed computing framework that allows users to scale Python applications and machine learning models to thousands of nodes, supporting distributed data processing, distributed machine learning, and distributed analytics."},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":74,"completion_tokens":42}}