在 YARN 上部署#

警告

在 YARN 上运行 Ray 仍在进行中。如果您有改进此文档的建议或希望请求缺失的功能，请随时创建一个拉取请求或使用下方有问题或建议？部分中的一个渠道联系我们。

本文档假设您可以访问一个 YARN 集群，并将引导您使用 Skein 部署一个 YARN 作业，该作业启动一个 Ray 集群并在其上运行一个示例脚本。

Skein 使用声明性规范（可以写成 yaml 文件或使用 Python API），并允许用户无需编写 Java 代码即可启动作业和扩展应用程序。

首先，您需要安装 Skein：pip install skein。

此处使用的 Skein yaml 文件和示例 Ray 程序在 Ray 仓库中提供，以帮助您入门。请参考提供的 yaml 文件，以确保您维护了 Ray 正常运行所需的重要配置选项。

Skein 配置#

一个 Ray 作业被配置为运行两个 Skein 服务：

ray-head 服务启动 Ray 头节点，然后运行应用程序。
ray-worker 服务启动加入 Ray 集群的工作节点。您可以通过 skein container scale 在此配置中或在运行时更改实例数量，以扩展或缩减集群。

每个服务的规范包括启动服务所需运行的必要文件和命令。

services:
    ray-head:
        # There should only be one instance of the head node per cluster.
        instances: 1
        resources:
            # The resources for the worker node.
            vcores: 1
            memory: 2048
        files:
            ...
        script:
            ...
    ray-worker:
        # Number of ray worker nodes to start initially.
        # This can be scaled using 'skein container scale'.
        instances: 3
        resources:
            # The resources for the worker node.
            vcores: 1
            memory: 2048
        files:
            ...
        script:
            ...

打包依赖项#

使用 files 选项来指定将被复制到 YARN 容器中的文件，供应用程序使用。更多信息请参阅 Skein 文件分发页面。

services:
    ray-head:
        # There should only be one instance of the head node per cluster.
        instances: 1
        resources:
            # The resources for the head node.
            vcores: 1
            memory: 2048
        files:
            # ray/doc/yarn/example.py
            example.py: example.py
        #     # A packaged python environment using `conda-pack`. Note that Skein
        #     # doesn't require any specific way of distributing files, but this
        #     # is a good one for python projects. This is optional.
        #     # See https://jcrist.github.io/skein/distributing-files.html
        #     environment: environment.tar.gz

YARN 中的 Ray 设置#

以下是用于启动 ray-head 和 ray-worker 服务的 bash 命令的演练。请注意，此配置将为每个应用程序启动一个新的 Ray 集群，而不是重用相同的集群。

头节点命令#

首先激活一个现有的环境来进行依赖管理。

source environment/bin/activate

在 Skein 键值存储中注册工作节点所需的 Ray 头地址。

skein kv put --key=RAY_HEAD_ADDRESS --value=$(hostname -i) current

在 ray 头节点上启动所有需要的进程。默认情况下，我们将对象存储内存和堆内存设置为大约 200 MB。这是保守的，应根据应用程序需求进行设置。

ray start --head --port=6379 --object-store-memory=200000000 --memory 200000000 --num-cpus=1

执行包含 Ray 程序的用户脚本。

python example.py

即使应用程序失败或被终止，也要清理所有已启动的进程。

ray stop
skein application shutdown current

将所有内容放在一起，我们有：

    ray-head:
        # There should only be one instance of the head node per cluster.
        instances: 1
        resources:
            # The resources for the head node.
            vcores: 1
            memory: 2048
        files:
            # ray/doc/source/cluster/doc_code/yarn/example.py
            example.py: example.py
        #     # A packaged python environment using `conda-pack`. Note that Skein
        #     # doesn't require any specific way of distributing files, but this
        #     # is a good one for python projects. This is optional.
        #     # See https://jcrist.github.io/skein/distributing-files.html
        #     environment: environment.tar.gz
        script: |
            # Activate the packaged conda environment
            #  - source environment/bin/activate

            # This stores the Ray head address in the Skein key-value store so that the workers can retrieve it later.
            skein kv put current --key=RAY_HEAD_ADDRESS --value=$(hostname -i)

            # This command starts all the processes needed on the ray head node.
            # By default, we set object store memory and heap memory to roughly 200 MB. This is conservative
            # and should be set according to application needs.
            #
            ray start --head --port=6379 --object-store-memory=200000000 --memory 200000000 --num-cpus=1

            # This executes the user script.
            python example.py

            # After the user script has executed, all started processes should also die.
            ray stop
            skein application shutdown current

工作节点命令#

从Skein键值存储中获取头节点的地址。

RAY_HEAD_ADDRESS=$(skein kv get current --key=RAY_HEAD_ADDRESS)

在 ray 工作节点上启动所有需要的进程，阻塞直到被 Skein/YARN 通过 SIGTERM 信号终止。接收到 SIGTERM 信号后，所有已启动的进程也应终止（ray stop）。

ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:6379 --block; ray stop

将所有内容放在一起，我们有：

    ray-worker:
        # The number of instances to start initially. This can be scaled
        # dynamically later.
        instances: 4
        resources:
            # The resources for the worker node
            vcores: 1
            memory: 2048
        # files:
        #     environment: environment.tar.gz
        depends:
            # Don't start any worker nodes until the head node is started
            - ray-head
        script: |
            # Activate the packaged conda environment
            #  - source environment/bin/activate

            # This command gets any addresses it needs (e.g. the head node) from
            # the skein key-value store.
            RAY_HEAD_ADDRESS=$(skein kv get --key=RAY_HEAD_ADDRESS current)

            # The below command starts all the processes needed on a ray worker node, blocking until killed with sigterm.
            # After sigterm, all started processes should also die (ray stop).
            ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:6379 --block; ray stop

运行作业#

在你的 Ray 脚本中，使用以下代码连接到启动的 Ray 集群：

    ray.init(address="localhost:6379")
    main()

您可以使用以下命令根据 Skein YAML 文件启动应用程序。

skein application submit [TEST.YAML]

一旦提交后，你可以在 YARN 仪表板上看到正在运行的作业。

清理#

要清理正在运行的作业，请使用以下命令（使用应用程序ID）：

skein application shutdown $appid

有问题或疑问吗？#

您可以通过以下渠道提出问题、发布问题或反馈：

讨论板: 用于 关于Ray使用的疑问 或 功能请求。
GitHub Issues: 用于 错误报告。
Ray Slack: 用于 联系 Ray 维护者。
StackOverflow: 使用 [ray] 标签 关于 Ray 的问题。