集群 YAML 配置选项#

集群配置定义在一个YAML文件中,该文件将由Cluster Launcher用于启动头节点,并由Autoscaler用于启动工作节点。一旦集群配置定义完成,您将需要使用 Ray CLI 来执行任何操作,例如启动和停止集群。

语法#

cluster_name: str
max_workers: int
upscaling_speed: float
idle_timeout_minutes: int
docker:
    docker
provider:
    provider
auth:
    auth
available_node_types:
    node_types
head_node_type: str
file_mounts:
    file_mounts
cluster_synced_files:
    - str
rsync_exclude:
    - str
rsync_filter:
    - str
initialization_commands:
    - str
setup_commands:
    - str
head_setup_commands:
    - str
worker_setup_commands:
    - str
head_start_ray_commands:
    - str
worker_start_ray_commands:
    - str

自定义类型#

Docker#

image: str
head_image: str
worker_image: str
container_name: str
pull_before_run: bool
run_options:
    - str
head_run_options:
    - str
worker_run_options:
    - str
disable_automatic_runtime_detection: bool
disable_shm_size_detection: bool

认证#

ssh_user: str

提供者#

安全组#

vSphere 配置#

vSphere 凭证#

user: str
password: str
server: str

vSphere 冻结虚拟机配置#

name: str
library_item: str
resource_pool: str
cluster: str
datastore: str

vSphere GPU 配置#

节点类型#

available_nodes_types 对象的键表示不同节点类型的名称。

available_node_types 中删除一个节点类型并通过 ray up 更新将导致自动缩放器缩减该类型的所有节点。特别是,更改节点类型对象的键将导致删除与旧键对应的节点;然后根据集群配置和 Ray 资源需求创建具有新键名的节点。

<node_type_1_name>:
    node_config:
        Node config
    resources:
        Resources
    min_workers: int
    max_workers: int
    worker_setup_commands:
        - str
    docker:
        Node Docker
<node_type_2_name>:
    ...
...

节点配置#

特定云环境的节点类型配置。

修改 node_config 并通过 ray up 更新将导致自动扩展器缩减该节点类型的所有现有节点;然后,将根据集群配置和 Ray 资源需求创建应用了新 node_config 的节点。

一个符合 EC2 create_instances API 的 YAML 对象,详见 AWS 文档

一个在 部署模板 中定义的 YAML 对象,其资源在 Azure 文档 中定义。

GCP 文档 中定义的 YAML 对象。

# The resource pool where the head node should live, if unset, will be
# the frozen VM's resource pool.
resource_pool: str
# The datastore to store the vmdk of the head node vm, if unset, will be
# the frozen VM's datastore.
datastore: str

Node Docker#

worker_image: str
pull_before_run: bool
worker_run_options:
    - str
disable_automatic_runtime_detection: bool
disable_shm_size_detection: bool

资源#

CPU: int
GPU: int
object_store_memory: int
memory: int
<custom_resource1>: int
<custom_resource2>: int
...

文件挂载#

<path1_on_remote_machine>: str # Path 1 on local machine
<path2_on_remote_machine>: str # Path 2 on local machine
...

属性和定义#

cluster_name#

集群的名称。这是集群的命名空间。

  • 必需:

  • 重要性:

  • 类型: 字符串

  • 默认值: “default”

  • 模式: [a-zA-Z0-9_]+

max_workers#

集群在任何给定时间将拥有的最大工作线程数。

  • 必填:

  • 重要性:

  • 类型: 整数

  • 默认值: 2

  • 最小值: 0

  • 最大值: 无界

upscaling_speed#

允许作为当前节点数量倍数的待处理节点数量。例如,如果设置为1.0,集群在任何时候最多可以增长100%,因此如果集群当前有20个节点,最多允许20个待处理的启动。请注意,尽管自动扩展器会缩减到`min_workers`(可能为0),但在扩展时,它总是会至少扩展到5个节点。

  • 必填:

  • 重要性: 中等

  • 类型: 浮点数

  • 默认值: 1.0

  • 最小值: 0.0

  • 最大值: 无界

idle_timeout_minutes#

在Autoscaler移除空闲工作节点之前需要经过的分钟数。

  • 必填:

  • 重要性: 中等

  • 类型: 整数

  • 默认值: 5

  • 最小值: 0

  • 最大值: 无界

docker#

配置 Ray 在 Docker 容器中运行。

  • 必填:

  • 重要性:

  • 类型: Docker

  • 默认值: {}

在极少数情况下,当系统默认未安装 Docker 时(例如,AMI 配置错误),请将以下命令添加到 初始化命令 以安装它。

initialization_commands:
    - curl -fsSL https://get.docker.com -o get-docker.sh
    - sudo sh get-docker.sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f

provider#

云服务提供商特定的配置属性。

  • 必需:

  • 重要性:

  • 类型: 提供者

auth#

Ray 用于启动节点的身份验证凭据。

  • 必需:

  • 重要性:

  • 类型: 认证

available_node_types#

告诉自动缩放器允许的节点类型及其提供的资源。每种节点类型由用户指定的键标识。

  • 必填:

  • 重要性:

  • 类型: 节点类型

  • 默认值:

available_node_types:
  ray.head.default:
      node_config:
        InstanceType: m5.large
        BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 140
      resources: {"CPU": 2}
  ray.worker.default:
      node_config:
        InstanceType: m5.large
        InstanceMarketOptions:
            MarketType: spot
      resources: {"CPU": 2}
      min_workers: 0

head_node_type#

available_node_types 中的一种节点类型的键。此节点类型将用于启动头节点。

如果 head_node_type 字段被更改并且执行了 ray up 更新,当前运行的头节点将被视为过时。用户将收到一个提示,要求确认缩减过时的头节点,并且集群将重新启动一个新的头节点。更改 node_config 中键为 head_node_typenode_type 也将导致在用户提示后集群重新启动。

  • 必需:

  • 重要性:

  • 类型: 字符串

  • 模式: [a-zA-Z0-9_]+

file_mounts#

要复制到主节点和工作节点的文件或目录。

  • 必填:

  • 重要性:

  • 类型: 文件挂载

  • 默认值: []

cluster_synced_files#

要从头节点复制到工作节点的文件或目录路径列表。头节点上的相同路径将被复制到工作节点。此行为是 file_mounts 行为的一个子集,因此在绝大多数情况下,应直接使用 file_mounts

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

rsync_exclude#

在运行 rsync uprsync down 时要排除的文件模式列表。过滤器仅应用于源目录。

列表中模式的示例:**/.git/**

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

rsync_filter#

在运行 rsync uprsync down 时要排除的文件模式列表。过滤器应用于源目录,并递归地应用于所有子目录。

列表中模式的示例:.gitignore

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

initialization_commands#

在执行 设置命令 之前将运行的一系列命令。如果启用了 Docker,这些命令将在容器外运行,并在设置 Docker 之前执行。

  • 必填:

  • 重要性: 中等

  • 类型: 字符串列表

  • 默认值: []

setup_commands#

一组用于设置节点的命令。这些命令将始终在头节点和工作节点上运行,并将与 头节点设置命令 合并用于头节点,与 工作节点设置命令 合并用于工作节点。

  • 必填:

  • 重要性: 中等

  • 类型: 字符串列表

  • 默认值:

# Default setup_commands:
setup_commands:
  - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc
  - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
  • 设置命令理想情况下应该是 幂等的 (即,可以多次运行而不改变结果);这使得 Ray 在节点创建后可以安全地更新节点。通常可以通过小的修改使命令幂等,例如 git clone foo 可以重写为 test -e foo || git clone foo,它会首先检查仓库是否已经克隆。

  • 设置命令是按顺序但分别运行的。例如,如果你使用的是 anaconda,你需要运行 conda activate env && pip install -U ray,因为将命令拆分为两个设置命令将不起作用。

  • 理想情况下,您应该通过创建一个预装了所有依赖项的Docker镜像来避免使用setup_commands,以最小化启动时间。

  • 提示:如果您还想在设置期间运行 apt-get 命令,请添加以下命令列表:

    setup_commands:
      - sudo pkill -9 apt-get || true
      - sudo pkill -9 dpkg || true
      - sudo dpkg --configure -a
    

head_setup_commands#

设置主节点的命令列表。这些命令将与一般的 设置命令 合并。

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

worker_setup_commands#

用于设置工作节点的命令列表。这些命令将与一般的 设置命令 合并。

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

head_start_ray_commands#

在头节点上启动 Ray 的命令。你不需要更改这个。

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值:

head_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands#

在 worker 节点上启动 ray 的命令。你不需要更改这个。

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值:

worker_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

docker.image#

默认的 Docker 镜像用于拉取头节点和工作节点。这可以通过 head_imageworker_image 字段来覆盖。如果既没有指定 image,也没有指定 (head_imageworker_image),Ray 将不会使用 Docker。

  • 必需: 是(如果使用 Docker。)

  • 重要性:

  • 类型: 字符串

Ray 项目在 DockerHub 上提供了 Docker 镜像。该仓库包括以下镜像:

  • rayproject/ray-ml:latest-gpu: 支持CUDA,包含机器学习依赖项。

  • rayproject/ray:latest-gpu: 支持CUDA,无机器学习依赖。

  • rayproject/ray-ml:latest: 不支持 CUDA,包含 ML 依赖项。

  • rayproject/ray:latest: 不支持 CUDA,不包含机器学习依赖项。

docker.head_image#

用于覆盖默认 docker 镜像 的头节点 Docker 镜像。

  • 必填:

  • 重要性:

  • 类型: 字符串

docker.worker_image#

用于覆盖默认 docker 镜像 的工作节点 Docker 镜像。

  • 必填:

  • 重要性:

  • 类型: 字符串

docker.container_name#

启动 Docker 容器时使用的名称。

  • 必需: 是(如果使用 Docker。)

  • 重要性:

  • 类型: 字符串

  • 默认值: ray_container

docker.pull_before_run#

如果启用,启动 Docker 时将拉取最新版本的镜像。如果禁用,docker run 只有在没有缓存版本时才会拉取镜像。

  • 必填:

  • 重要性: 中等

  • 类型: 布尔值

  • 默认值: True

docker.run_options#

传递给 docker run 的额外选项。

  • 必填:

  • 重要性: 中等

  • 类型: 字符串列表

  • 默认值: []

docker.head_run_options#

仅传递给 docker run 的头节点的额外选项。

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

docker.worker_run_options#

仅传递给工作节点的 docker run 的额外选项。

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

docker.disable_automatic_runtime_detection#

如果启用,Ray 将不会尝试在存在 GPU 时使用 NVIDIA 容器运行时。

  • 必填:

  • 重要性:

  • 类型: 布尔值

  • 默认值: False

docker.disable_shm_size_detection#

如果启用,Ray 将不会自动指定启动容器的 /dev/shm 大小,并将使用运行时的默认值(Docker 为 64MiB)。如果手动在 run_options 中添加 --shm-size=<>,这将 自动 设置为 True,意味着 Ray 将遵从用户提供的值。

  • 必填:

  • 重要性:

  • 类型: 布尔值

  • 默认值: False

auth.ssh_user#

Ray 在启动新节点时将进行身份验证的用户。

  • 必需:

  • 重要性:

  • 类型: 字符串

auth.ssh_private_key#

Ray 使用的现有私钥的路径。如果未配置,Ray 将创建一个新的私钥对(默认行为)。如果配置了,该密钥必须添加到项目范围的元数据中,并且在 节点配置 中必须定义 KeyName

  • 必填:

  • 重要性:

  • 类型: 字符串

Ray 使用的现有私钥的路径。

  • 必需:

  • 重要性:

  • 类型: 字符串

你可以使用 ssh-keygen -t rsa -b 4096 来生成一个新的 SSH 密钥对。

Ray 使用的现有私钥的路径。如果未配置,Ray 将创建一个新的私钥对(默认行为)。如果配置了,该密钥必须添加到项目范围的元数据中,并且在 节点配置 中必须定义 KeyName

  • 必填:

  • 重要性:

  • 类型: 字符串

不可用。vSphere 提供者期望密钥位于固定路径 ~/ray-bootstrap-key.pem

auth.ssh_public_key#

不可用。

Ray 使用的现有公钥的路径。

  • 必需:

  • 重要性:

  • 类型: 字符串

不可用。

不可用。

provider.type#

云服务提供商。对于AWS,这必须设置为 aws

  • 必需:

  • 重要性:

  • 类型: 字符串

云服务提供商。对于 Azure,这必须设置为 azure

  • 必需:

  • 重要性:

  • 类型: 字符串

云服务提供商。对于GCP,这必须设置为 gcp

  • 必需:

  • 重要性:

  • 类型: 字符串

云服务提供商。对于 vSphere 和 VCF,这必须设置为 vsphere

  • 必需:

  • 重要性:

  • 类型: 字符串

provider.region#

用于部署 Ray 集群的区域。

  • 必需:

  • 重要性:

  • 类型: 字符串

  • 默认值: us-west-2

不可用。

用于部署 Ray 集群的区域。

  • 必需:

  • 重要性:

  • 类型: 字符串

  • 默认值: us-west1

不可用。

provider.availability_zone#

指定一个逗号分隔的可用区列表,节点可以在这些可用区中启动。节点将首先在列出的第一个可用区中启动,如果启动失败,将在后续的可用区中尝试启动。

  • 必填:

  • 重要性:

  • 类型: 字符串

  • 默认值: us-west-2a,us-west-2b

不可用。

指定一个逗号分隔的可用区列表,节点可以在此列表中的可用区启动。

  • 必填:

  • 重要性:

  • 类型: 字符串

  • 默认值: us-west1-a

不可用。

provider.location#

不可用。

用于部署 Ray 集群的位置。

  • 必需:

  • 重要性:

  • 类型: 字符串

  • 默认值: westus2

不可用。

不可用。

provider.resource_group#

不可用。

用于部署 Ray 集群的资源组。

  • 必需:

  • 重要性:

  • 类型: 字符串

  • 默认值: ray-集群

不可用。

不可用。

provider.subscription_id#

不可用。

用于部署 Ray 集群的订阅 ID。如果未指定,Ray 将使用 Azure CLI 中的默认值。

  • 必填:

  • 重要性:

  • 类型: 字符串

  • 默认值: ""

不可用。

不可用。

provider.msi_name#

不可用。

用于部署Ray集群的托管身份名称。如果未指定,Ray将创建一个默认的用户分配的托管身份。

  • 必填:

  • 重要性:

  • 类型: 字符串

  • 默认: ray-default-msi

不可用。

不可用。

provider.msi_resource_group#

不可用。

用于部署 Ray 集群的托管标识的资源组名称,与 msi_name 一起使用。如果未指定,Ray 将在提供程序配置中指定的资源组中创建一个默认的用户分配托管标识。

  • 必填:

  • 重要性:

  • 类型: 字符串

  • 默认值: ray-集群

不可用。

不可用。

provider.project_id#

不可用。

不可用。

用于部署 Ray 集群的全球唯一项目 ID。

  • 必需:

  • 重要性:

  • 类型: 字符串

  • 默认值: null

不可用。

provider.cache_stopped_nodes#

如果启用,节点将在集群缩减时被 停止 。如果禁用,节点将被 终止 。停止的节点比终止的节点启动更快。

  • 必填:

  • 重要性:

  • 类型: 布尔值

  • 默认值: True

provider.use_internal_ips#

如果启用,Ray 将使用私有 IP 地址在节点之间进行通信。如果你的网络接口使用公共 IP 地址,则应省略此项。

如果启用,Ray CLI 命令(例如 ray up)必须从与集群相同的 VPC 中的机器上运行。

此选项不影响节点公共IP地址的存在,它只影响Ray使用的IP地址。公共IP地址的存在由您的云提供商的配置控制。

  • 必填:

  • 重要性:

  • 类型: 布尔值

  • 默认值: False

provider.use_external_head_ip#

不可用。

如果启用,Ray 将为与头节点的通信提供并使用一个公共 IP 地址,无论 use_internal_ips 的值如何。此选项可以与 use_internal_ips 结合使用,以避免为工作节点提供过多的公共 IP(即,节点之间使用私有 IP 进行通信,但仅为头节点通信提供一个公共 IP)。如果 use_internal_ipsFalse,则此选项无效。

  • 必填:

  • 重要性:

  • 类型: 布尔值

  • 默认值: False

不可用。

不可用。

provider.security_group#

一个可以用来指定自定义入站规则的安全组。

  • 必填:

  • 重要性: 中等

  • 类型: 安全组

不可用。

不可用。

不可用。

provider.vsphere_config#

不可用。

不可用。

不可用。

用于连接 vCenter Server 的 vSphere 配置。如果未配置,将使用 VSPHERE_* 环境变量。

security_group.GroupName#

安全组的名称。此名称在VPC内必须是唯一的。

  • 必填:

  • 重要性:

  • 类型: 字符串

  • 默认值: "ray-autoscaler-{cluster-name}"

security_group.IpPermissions#

与安全组关联的入站规则。

vsphere_config.credentials#

连接到 vSphere vCenter Server 的凭证。

vsphere_config.credentials.user#

连接到 vCenter Server 的用户名。

  • 必填:

  • 重要性:

  • 类型: 字符串

vsphere_config.credentials.password#

连接到 vCenter Server 的用户密码。

  • 必填:

  • 重要性:

  • 类型: 字符串

vsphere_config.credentials.server#

vSphere vCenter Server 的地址。

  • 必填:

  • 重要性:

  • 类型: 字符串

vsphere_config.frozen_vm#

与冻结虚拟机相关的配置。

如果存在冻结的虚拟机,则应取消设置 library_item。可以通过 name 指定现有的冻结虚拟机,或者通过 resource_pool 指定每个 ESXi (https://docs.vmware.com/en/VMware-vSphere/index.html) 主机上的冻结虚拟机资源池。

如果要从OVF模板部署冻结的虚拟机,则必须将 library_item 设置为指向内容库中的OVF模板(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-AFEDC48B-C96F-4088-9C1F-4F0A30E965DE.html)。在这种情况下,name 必须设置为指示冻结虚拟机的名称或名称前缀。然后,应设置 resource_pool 以指示将在资源池的每个ESXi主机上创建一组冻结的虚拟机,或者应设置 cluster 以指示在vSphere集群中创建单个冻结的虚拟机。在这种情况下,配置 ``datastore``(https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-D5AB2BAD-C69A-4B8D-B468-25D86B8D39CE.html)是强制性的。

有效示例:

  1. ray up 在从OVF模板部署的冻结虚拟机上:

    frozen_vm:
        name: single-frozen-vm
        library_item: frozen-vm-template
        cluster: vsanCluster
        datastore: vsanDatastore
    
  2. 在现有的冻结虚拟机上 ray up

    frozen_vm:
        name: existing-single-frozen-vm
    
  3. ray up 在一个冻结的虚拟机资源池上,从OVF模板部署:

    frozen_vm:
        name: frozen-vm-prefix
        library_item: frozen-vm-template
        resource_pool: frozen-vm-resource-pool
        datastore: vsanDatastore
    
  4. 在现有的冻结虚拟机资源池上使用 ray up

    frozen_vm:
        resource_pool: frozen-vm-resource-pool
    

上述示例中未涵盖的其他情况是无效的。

vsphere_config.frozen_vm.name#

冻结的虚拟机的名称或名称前缀。

只有在 resource_pool 被设置并且指向一个现有的冻结虚拟机资源池时,才能被取消设置。

  • 必填:

  • 重要性: 中等

  • 类型: 字符串

vsphere_config.frozen_vm.library_item#

冻结的VM的OVF模板的库项(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-D3DD122F-16A5-4F36-8467-97994A854B16.html#GUID-D3DD122F-16A5-4F36-8467-97994A854B16)。如果设置,冻结的VM或一组冻结的VM将从由``library_item``指定的OVF模板中部署。否则,冻结的VM(s)应该是已存在的。

访问 Ray 项目的 VM Packer(vmware-ai-labs/vm-packer-for-ray)以了解如何为冻结的虚拟机创建 OVF 模板。

  • 必填:

  • 重要性:

  • 类型: 字符串

vsphere_config.frozen_vm.resource_pool#

冻结的虚拟机的资源池名称,可以指向现有的冻结虚拟机的资源池。否则,必须指定 library_item ,并且将在每个 ESXi 主机上部署一组冻结的虚拟机。

冻结的虚拟机将被命名为“{frozen_vm.name}-{虚拟机的IP地址}”

  • 必填:

  • 重要性: 中等

  • 类型: 字符串

vsphere_config.frozen_vm.cluster#

vSphere 集群名称,仅在 library_item 设置且 resource_pool 未设置时生效。表示从 OVF 模板在 vSphere 集群上部署一个单独的冻结虚拟机。

  • 必填:

  • 重要性: 中等

  • 类型: 字符串

vsphere_config.frozen_vm.datastore#

用于存储从OVF模板部署的冻结虚拟机文件的目标vSphere数据存储名称。仅在设置 library_item 时生效。如果也设置了 resource_pool ,则此数据存储必须是ESXi主机之间的共享数据存储。

  • 必填:

  • 重要性:

  • 类型: 字符串

vsphere_config.gpu_config#

vsphere_config.gpu_config.dynamic_pci_passthrough#

控制从 ESXi 主机绑定 GPU 到 Ray 节点 VM 的方式的开关。默认值为 False,表示常规的 PCI 直通。如果设置为 True,将启用动态 PCI 直通(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-esxi-host-client/GUID-2B6D43A6-9598-47C4-A2E7-5924E3367BB6.html)用于 GPU。启用动态 PCI 直通的 VM 仍然支持 vSphere DRS(https://www.vmware.com/products/vsphere/drs-dpm.html)。

  • 必填:

  • 重要性:

  • 类型: 布尔值

available_node_types.<node_type_name>.node_type.node_config#

用于在云服务提供商上启动节点的配置。其中,这将指定要启动的实例类型。

available_node_types.<node_type_name>.node_type.resources#

节点类型提供的资源,使自动扩展器能够根据应用程序的资源需求自动选择合适的节点类型来启动。指定的资源将通过环境变量自动传递给节点的 ray start 命令。如果没有提供,自动扩展器只能为AWS/Kubernetes云提供商自动检测它们。更多信息,请参阅 资源需求调度器

  • 必需: 是(AWS/K8s 除外)

  • 重要性:

  • 类型: 资源

  • 默认值: {}

在某些情况下,添加没有任何资源的特殊节点可能是可取的。这样的节点可以用作连接到集群以启动作业的驱动程序。为了手动将节点添加到自动扩展的集群中,应设置 ray-cluster-name 标签,并将 ray-node-type 标签设置为 unmanaged。可以通过将资源设置为 {} 并将 最大工作节点数 设置为 0 来创建未管理的节点。自动缩放器不会尝试启动、停止或更新未管理的节点。用户负责正确设置和清理未管理的节点。

available_node_types.<node_type_name>.node_type.min_workers#

无论利用率如何,为此节点类型维护的最小工作线程数。

  • 必填:

  • 重要性:

  • 类型: 整数

  • 默认值: 0

  • 最小值: 0

  • 最大值: 无界

available_node_types.<node_type_name>.node_type.max_workers#

无论利用率如何,此节点类型的集群中允许的最大工作节点数。这优先于 最小工作节点数。默认情况下,节点类型的工作节点数是无限制的,仅受集群范围的 max_workers 限制。(在 Ray 1.3.0 之前,此字段的默认值为 0。)

注意,对于类型为 head_node_type 的节点,默认的最大工作线程数为 0。

available_node_types.<node_type_name>.node_type.worker_setup_commands#

一组用于设置此类工作节点的命令。这些命令将替换节点的常规 工作节点设置命令

  • 必填:

  • 重要性:

  • 类型: 字符串列表

  • 默认值: []

available_node_types.<node_type_name>.node_type.resources.CPU#

此节点提供的CPU数量。如果未配置,Autoscaler 只能自动检测 AWS/Kubernetes 云提供商的 CPU。

  • 必需: 是(AWS/K8s 除外)

  • 重要性:

  • 类型: 整数

此节点提供的CPU数量。

  • 必需:

  • 重要性:

  • 类型: 整数

此节点提供的CPU数量。

  • 必填:

  • 重要性:

  • 类型: 整数

此节点提供的CPU数量。如果未配置,节点将使用与冻结的虚拟机相同的设置。

  • 必填:

  • 重要性:

  • 类型: 整数

available_node_types.<node_type_name>.node_type.resources.GPU#

此节点提供的GPU数量。如果未配置,Autoscaler 只能自动检测 AWS/Kubernetes 云提供商的GPU。

  • 必填:

  • 重要性:

  • 类型: 整数

此节点提供的GPU数量。

  • 必填:

  • 重要性:

  • 类型: 整数

此节点提供的GPU数量。

  • 必填:

  • 重要性:

  • 类型: 整数

此节点提供的GPU数量。

  • 必填:

  • 重要性:

  • 类型: 整数

available_node_types.<node_type_name>.node_type.resources.memory#

节点上为python工作进程堆内存分配的内存字节数。如果未配置,Autoscaler将自动检测AWS/Kubernetes节点上的RAM量,并为其分配70%的堆内存。

  • 必填:

  • 重要性:

  • 类型: 整数

节点上为Python工作进程堆内存分配的内存字节数。

  • 必填:

  • 重要性:

  • 类型: 整数

节点上为Python工作进程堆内存分配的内存字节数。

  • 必填:

  • 重要性:

  • 类型: 整数

节点上为Python工作进程堆内存分配的内存(以兆字节为单位)。如果未配置,节点将使用与冻结虚拟机相同的内存设置。

  • 必填:

  • 重要性:

  • 类型: 整数

available_node_types.<node_type_name>.node_type.resources.object-store-memory#

节点上为对象存储分配的内存(以字节为单位)。如果未配置,Autoscaler 将自动检测 AWS/Kubernetes 节点的 RAM 量,并为其分配 30% 的内存用于对象存储。

  • 必填:

  • 重要性:

  • 类型: 整数

节点上为对象存储分配的内存字节数。

  • 必填:

  • 重要性:

  • 类型: 整数

节点上为对象存储分配的内存字节数。

  • 必填:

  • 重要性:

  • 类型: 整数

节点上为对象存储分配的内存字节数。

  • 必填:

  • 重要性:

  • 类型: 整数

available_node_types.<node_type_name>.docker#

对顶级 Docker 配置的一组覆盖。

  • 必填:

  • 重要性:

  • 类型: docker

  • 默认值: {}

示例#

最小配置#

# An unique identifier for the head node and workers of this cluster.
cluster_name: aws-example-minimal

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 3

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 3
        # The maximum number of worker nodes of this type to launch.
        # This parameter takes precedence over min_workers.
        max_workers: 3
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# The maximum number of workers nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 1

# Cloud-provider specific configuration.
provider:
    type: azure
    location: westus2
    resource_group: ray-cluster

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub
auth:
  ssh_user: ubuntu
cluster_name: minimal
provider:
  availability_zone: us-west1-a
  project_id: null # TODO: set your GCP project ID here
  region: us-west1
  type: gcp
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# Cloud-provider specific configuration.
provider:
    type: vsphere

完整配置#

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-cpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes will be launched in the first listed availability zone and will
    # be tried in the subsequent availability zones if launching fails.
    availability_zone: us-west-2a,us-west-2b
    # Whether to allow node reuse. If set to False, nodes will be terminated
    # instead of stopped.
    cache_stopped_nodes: True # If not present, the default is True.

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            # Default AMI for us-west-2.
            # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
            # for default images for other zones.
            ImageId: ami-0387d929287ab193e
            # You can provision additional disk space with a conf as follows
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 140
                      VolumeType: gp3
            # Additional options in the boto docs.
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            # Default AMI for us-west-2.
            # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
            # for default images for other zones.
            ImageId: ami-0387d929287ab193e
            # Run workers on spot by default. Comment this out to use on-demand.
            # NOTE: If relying on spot instances, it is best to specify multiple different instance
            # types to avoid interruption when one instance type is experiencing heightened demand.
            # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
            InstanceMarketOptions:
                MarketType: spot
                # Additional options can be found in the boto docs, e.g.
                #   SpotOptions:
                #       MaxPrice: MAX_HOURLY_PRICE
            # Additional options in the boto docs.

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: westus2
    resource_group: ray-cluster
    # set subscription id otherwise the default from az cli will be used
    # subscription_id: 00000000-0000-0000-0000-000000000000
    # set unique subnet mask or a random mask will be used
    # subnet_mask: 10.0.0.0/16
    # set unique id for resources in this cluster
    # if not set a default id will be generated based on the resource group and cluster name
    # unique_id: RAY1
    # set managed identity name and resource group
    # if not set, a default user-assigned identity will be generated in the resource group specified above
    # msi_name: ray-cluster-msi
    # msi_resource_group: other-rg
    # Set provisioning and use of public/private IPs for head and worker nodes. If both options below are true,
    # only the head node will have a public IP address provisioned.
    # use_internal_ips: True
    # use_external_head_ip: True

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: []
    # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
  image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
  container_name: "ray_container"
  # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
  # if no cached version is present.
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536

  # Example of running a GPU head with CPU workers
  # head_image: "rayproject/ray-ml:latest-gpu"
  # Allow Ray to automatically detect GPUs

  # worker_image: "rayproject/ray-ml:latest-cpu"
  # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: null # Globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_default:
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

            # Additional options can be found in in the compute docs at
            # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

            # If the network interface is specified as below in both head and worker
            # nodes, the manual network config is used.  Otherwise an existing subnet is
            # used.  To use a shared subnet, ask the subnet owner to grant permission
            # for 'compute.subnetworks.use' to the ray autoscaler account...
            # networkInterfaces:
            #   - kind: compute#networkInterface
            #     subnetwork: path/to/subnet
            #     aliasIpRanges: []
    ray_worker_small:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
            # Un-Comment this to launch workers with the Service Account of the Head Node
            # serviceAccounts:
            # - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
            #   scopes:
            #   - https://www.googleapis.com/auth/cloud-platform

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"


# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest"
    # image: rayproject/ray:latest   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: vsphere

# Credentials configured here will take precedence over credentials set in the
# environment variables.
    vsphere_config:
#       credentials:
#           user: vc_username
#           password: vc_password
#           server: vc_address
        # The frozen VM related configurations. If "library_item" is unset, then either an existing frozen VM should be
        # specified by "name" of a resource pool name of Frozen VMs on every ESXi host should be specified by
        # "resource_pool". If "library_item" is set, then "name" must be set to indicate the name or the name prefix of
        # the frozen VM, and "resource_pool" can be set to indicate that a set of frozen VMs should be created on each
        # ESXi host.
        frozen_vm:
            # The name of the frozen VM, or the prefix for a set of frozen VMs. Can only be unset when
            # "frozen_vm.resource_pool" is set and pointing to an existing resource pool of Frozen VMs.
            name: frozen-vm
            # The library item of the OVF template of the frozen VM. If set, the frozen VM or a set of frozen VMs will
            # be deployed from an OVF template specified by library item.
            library_item:
            # The resource pool name of the frozen VMs, can point to an existing resource pool of frozen VMs.
            # Otherwise, "frozen_vm.library_item" must be specified and a set of frozen VMs will be deployed
            # on each ESXi host. The frozen VMs will be named as "{frozen_vm.name}-{the vm's ip address}"
            resource_pool:
            # The vSphere cluster name, only makes sense when "frozen_vm.library_item" is set and
            # "frozen_vm.resource_pool" is unset. Indicates to deploy a single frozen VM on the vSphere cluster
            # from OVF template.
            cluster:
            # The target vSphere datastore name for storing the vmdk of the frozen VM to be deployed from OVF template.
            # Will take effect only when "frozen_vm.library_item" is set. If "frozen_vm.resource_pool" is also set,
            # this datastore must be a shared datastore among the ESXi hosts.
            datastore:
        # The GPU related configurations
        gpu_config:
            # If using dynamic PCI passthrough to bind the physical GPU on an ESXi host to a Ray node VM.
            # Dynamic PCI passthrough can support vSphere DRS, otherwise using regular PCI passthrough will not support
            # vSphere DRS.
            dynamic_pci_passthrough: False

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ray
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and Memory resources are by default the same as the frozen VM.
        # You can override the resources here. Adding GPU to the head node is not recommended.
        # resources: { "CPU": 2, "Memory": 4096}
        resources: {}
        node_config:
            # The resource pool where the head node should live, if unset, will be
            # the frozen VM's resource pool.
            resource_pool:
            # The datastore to store the vmdk of the head node vm, if unset, will be
            # the frozen VM's datastore.
            datastore:
    worker:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The node type's CPU and Memory resources are by default the same as the frozen VM.
        # You can override the resources here. For GPU, currently only Nvidia GPU is supported. If no ESXi host can
        # fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
        # minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
        # mount the first k random available Nvidia GPU to the VM, if the user set {"GPU": k}.
        # resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
        resources: {}
        node_config:
            # The resource pool where the worker node should live, if unset, will be
            # the frozen VM's resource pool.
            resource_pool:
            # The datastore to store the vmdk(s) of the worker node vm(s), if unset, will be
            # the frozen VM's datastore.
            datastore:

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude: []

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter: []

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

TPU 配置#

可以在 GCP 上使用 TPU VM。目前,`TPU pods <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#pods>`_(除 v2-8、v3-8 和 v4-8 以外的 TPU)不受支持。

在使用带有TPU的配置之前,请确保 为您的GCP项目启用了TPU API

# A unique identifier for the head node and workers of this cluster.
cluster_name: tputest

# The maximum number of worker nodes to launch in addition to the head node.
max_workers: 7

available_node_types:
    ray_head_default:
        resources: {"TPU": 1}  # use TPU custom resource in your code
        node_config:
            # Only v2-8, v3-8 and v4-8 accelerator types are currently supported.
            # Support for TPU pods will be added in the future.
            acceleratorType: v2-8
            runtimeVersion: v2-alpha
            schedulingConfig:
                # Set to false to use non-preemptible TPUs
                preemptible: false
    ray_tpu:
        min_workers: 1
        resources: {"TPU": 1}  # use TPU custom resource in your code
        node_config:
            acceleratorType: v2-8
            runtimeVersion: v2-alpha
            schedulingConfig:
                preemptible: true

provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-b
    project_id: null # Replace this with your GCP project ID.

setup_commands:
  - sudo apt install python-is-python3 -y
  - pip3 install --upgrade pip
  - pip3 install -U "ray[default]"

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default