RLlib 的新 API 栈#

概述#

从 Ray 2.10 开始，您可以选择加入“新 API 堆栈”的 alpha 版本，这是对架构、设计原则、代码库和面向用户的 API 进行根本性重构的版本。以下是可用的精选算法和设置。

功能/算法（在新API堆栈上）	PPO	SAC
单一代理	是	是
多智能体	是	不
全连接（MLP）	是	是
图像输入（CNN）	是	不
RNN 支持 (LSTM)	是	不
复杂输入（扁平化）	是	是

在接下来的几个月里，Ray 团队将继续测试、基准测试、修复错误，并进一步完善这些新 API，同时推出越来越多的算法，您可以在任一栈中运行。目标是达到新栈可以完全取代旧栈的状态。

请记住，由于其处于alpha阶段，在使用新堆栈时，您可能会遇到问题和遇到不稳定性。此外，请放心，您可以在可预见的未来（超过Ray 3.0）继续在旧API堆栈上使用您的自定义类和设置。

什么是新的API堆栈？#

新的 API 栈是重新从头编写 RLlib 核心 API 的结果，并将面向用户的类从十几个关键类减少到只有几个类。在从头设计这些新接口的过程中，Ray 团队严格遵循了以下原则：

假设一个简单的心理模型作为新API的基础
类必须在 RLlib 之外可用
尽可能地分离关注点。尝试回答：“**什么**应该在**何时**由**谁**完成？”
提供细粒度的模块化、完全的互操作性，以及无摩擦的类插拔性

应用上述原则，Ray 团队将普通 RLlib 用户必须了解的重要 必须知道 类从旧堆栈中的七个减少到新堆栈中的四个。新的核心 API 堆栈类包括：

RLModule (替代 ModelV2 和 PolicyMap API)
Learner (取代了 RolloutWorker 和部分 Policy)
SingleAgentEpisode 和 MultiAgentEpisode (取代了 ViewRequirement、SampleCollector、Episode 和 EpisodeV2)
ConnectorV2 (取代 Connector 以及部分 RolloutWorker 和 Policy)

The AlgorithmConfig 和 Algorithm API 保持不变。这些已经是旧堆栈上已建立的 API。

谁应该使用新的 API 栈？#

最终，所有 RLlib 用户都应该切换到使用新的 API 堆栈来运行实验和开发他们的自定义类。

目前，它仅适用于少数算法和设置（见上表），但是，如果您使用 PPO（单代理或多代理）或 SAC（单代理），您应该尝试一下。

以下部分列出了迁移到新堆栈的一些令人信服的理由。

请注意这些在早期阶段不使用它的指示：

1) You’re using a custom ModelV2 class and aren’t interested right now in moving it into the new RLModule API. 1) You’re using a custom Policy class (e.g., with a custom loss function and aren’t interested right now in moving it into the new Learner API. 1) You’re using custom Connector classes and aren’t interested right now in moving them into the new ConnectorV2 API.

如果上述任何一项适用于您，请暂时不要迁移，并继续使用旧的API堆栈运行。当您准备好重写代码的某些小部分时，再迁移到新的堆栈。

与旧API堆栈的比较#

此表比较了新旧API堆栈之间的功能和设计选择：

	新 API 栈	旧API栈
降低代码复杂度（适用于初学者和高级用户）	5 个面向用户的类 (`AlgorithmConfig`, `RLModule`, `Learner`, `ConnectorV2`, `Episode`)	8 个面向用户的类 (`AlgorithmConfig`, `ModelV2`, `Policy`, `build_policy`, `Connector`, `RolloutWorker`, `BaseEnv`, `ViewRequirement`)
类可以在 RLlib 之外使用	是	部分地
关注点分离设计（例如，在采样期间，只需计算动作）	是	不
分布式/可扩展样本收集	是	是
对（多）代理轨迹的完整360°读/写访问	是	不
多GPU和多节点/多GPU	是	是 & 否
支持共享（多代理）模型组件（例如，通信通道、共享价值函数等）	是	不
使用 `gym.vector.Env` 进行环境向量化	是	否（RLlib 自己的解决方案）

如何使用新的API堆栈？#

新的API栈默认对所有算法禁用。要为PPO（单代理和多代理）或SAC（仅单代理）激活它，请在您的 AlgorithmConfig 对象中更改以下内容：

单一代理 PPO

from ray.rllib.algorithms.ppo import PPOConfig


config = (
    PPOConfig()
    .environment("CartPole-v1")
    # Switch both the new API stack flags to True (both False by default).
    # This enables the use of
    # a) RLModule (replaces ModelV2) and Learner (replaces Policy)
    # b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
    # and enables ConnectorV2 support.
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .resources(
        num_cpus_for_main_process=1,
    )
    # We are using a simple 1-CPU setup here for learning. However, as the new stack
    # supports arbitrary scaling on the learner axis, feel free to set
    # `num_learners` to the number of available GPUs for multi-GPU training (and
    # `num_gpus_per_learner=1`).
    .learners(
        num_learners=0,  # <- in most cases, set this value to the number of GPUs
        num_gpus_per_learner=0,  # <- set this to 1, if you have at least 1 GPU
    )
    # When using RLlib's default models (RLModules) AND the new EnvRunners, you should
    # set this flag in your model config. Having to set this, will no longer be required
    # in the near future. It does yield a small performance advantage as value function
    # predictions for PPO are no longer required to happen on the sampler side (but are
    # now fully located on the learner side, which might have GPUs available).
    .training(model={"uses_new_env_runners": True})
)

多智能体 PPO

from ray.rllib.algorithms.ppo import PPOConfig  # noqa
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole  # noqa


# A typical multi-agent setup (otherwise using the exact same parameters as before)
# looks like this.
config = (
    PPOConfig()
    .environment(MultiAgentCartPole, env_config={"num_agents": 2})
    # Switch both the new API stack flags to True (both False by default).
    # This enables the use of
    # a) RLModule (replaces ModelV2) and Learner (replaces Policy)
    # b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
    # and enables ConnectorV2 support.
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .resources(
        num_cpus_for_main_process=1,
    )
    # We are using a simple 1-CPU setup here for learning. However, as the new stack
    # supports arbitrary scaling on the learner axis, feel free to set
    # `num_learners` to the number of available GPUs for multi-GPU training (and
    # `num_gpus_per_learner=1`).
    .learners(
        num_learners=0,  # <- in most cases, set this value to the number of GPUs
        num_gpus_per_learner=0,  # <- set this to 1, if you have at least 1 GPU
    )
    # When using RLlib's default models (RLModules) AND the new EnvRunners, you should
    # set this flag in your model config. Having to set this, will no longer be required
    # in the near future. It does yield a small performance advantage as value function
    # predictions for PPO are no longer required to happen on the sampler side (but are
    # now fully located on the learner side, which might have GPUs available).
    .training(model={"uses_new_env_runners": True})
    # Because you are in a multi-agent env, you have to set up the usual multi-agent
    # parameters:
    .multi_agent(
        policies={"p0", "p1"},
        # Map agent 0 to p0 and agent 1 to p1.
        policy_mapping_fn=lambda agent_id, episode, **kwargs: f"p{agent_id}",
    )
)

单智能体 SAC

from ray.rllib.algorithms.sac import SACConfig  # noqa


config = (
    SACConfig()
    .environment("Pendulum-v1")
    # Switch both the new API stack flags to True (both False by default).
    # This enables the use of
    # a) RLModule (replaces ModelV2) and Learner (replaces Policy)
    # b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
    # and enables ConnectorV2 support.
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .resources(
        num_cpus_for_main_process=1,
    )
    # We are using a simple 1-CPU setup here for learning. However, as the new stack
    # supports arbitrary scaling on the learner axis, feel free to set
    # `num_learners` to the number of available GPUs for multi-GPU training (and
    # `num_gpus_per_learner=1`).
    .learners(
        num_learners=0,  # <- in most cases, set this value to the number of GPUs
        num_gpus_per_learner=0,  # <- set this to 1, if you have at least 1 GPU
    )
    # When using RLlib's default models (RLModules) AND the new EnvRunners, you should
    # set this flag in your model config. Having to set this, will no longer be required
    # in the near future. It does yield a small performance advantage as value function
    # predictions for PPO are no longer required to happen on the sampler side (but are
    # now fully located on the learner side, which might have GPUs available).
    .training(
        model={"uses_new_env_runners": True},
        replay_buffer_config={"type": "EpisodeReplayBuffer"},
        # Note, new API stack SAC uses its own learning rates specific to actor,
        # critic, and alpha. `lr` therefore needs to be set to `None`. See `actor_lr`,
        # `critic_lr`, and `alpha_lr` for the specific learning rates, respectively.
        lr=None,
    )
)