备注

Ray 2.10.0 引入了 RLlib 的“新 API 栈”的 alpha 阶段。Ray 团队计划将算法、示例脚本和文档迁移到新的代码库中，从而在 Ray 3.0 之前的后续小版本中逐步替换“旧 API 栈”（例如，ModelV2、Policy、RolloutWorker）。

然而，请注意，到目前为止，只有 PPO（单代理和多代理）和 SAC（仅单代理）支持“新 API 堆栈”，并且默认情况下继续使用旧 API 运行。您可以继续使用现有的自定义（旧堆栈）类。

请参阅此处以获取有关如何使用新API堆栈的更多详细信息。

RL 模块 (Alpha)#

备注

这是一个实验性模块，作为 ModelV2 的通用替代品，可能会发生变化。它最终将匹配前一个堆栈的功能。如果你只使用高级 RLlib API，如 Algorithm，你应该不会遇到显著的变化，除了配置对象的几个新参数。如果你以前使用过自定义模型或策略，你需要将它们迁移到新模块。查看迁移指南了解更多信息。

下表显示了已迁移的算法及其当前支持的功能列表，随着我们的进展，这些内容将会更新。

算法	独立的多智能体强化学习	全连接	图像输入（CNN）	RNN 支持 (LSTM)	复杂观测 (ComplexNet)
PPO
Impala
APPO

RL Module 是一个神经网络容器，实现了三个公共方法：forward_train()、forward_exploration() 和 forward_inference()。每个方法对应一个不同的强化学习阶段。

forward_exploration() 处理行为和数据收集，平衡探索和利用。另一方面，forward_inference() 在评估期间服务于学习到的模型，通常较少随机。

forward_train() 管理训练阶段，处理计算损失专用的计算，例如在DQN模型中计算学习Q值。

在配置中启用 RL 模块#

通过我们的配置对象启用 RL 模块：AlgorithmConfig.api_stack(enable_rl_module_and_learner=True)。

import torch
from pprint import pprint

from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .api_stack(enable_rl_module_and_learner=True)
    .framework("torch")
    .environment("CartPole-v1")
)

algorithm = config.build()

# run for 2 training steps
for _ in range(2):
    result = algorithm.train()
    pprint(result)

构建RL模块#

RLModule API 提供了一种统一的方式来在 RLlib 中定义自定义强化学习模型。此 API 使您能够设计和实现自己的模型以满足特定需求。

为了保持一致性和可用性，RLlib 提供了一种标准化的方法来定义单智能体和多智能体强化学习环境的模块对象。这是通过 RLModuleSpec 和 MultiRLModuleSpec 类实现的。RLlib 内置的 RLModules 遵循这种一致的设计模式，使您更容易理解和使用这些模块。

单一代理

import gymnasium as gym
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.core.testing.torch.bc_module import DiscreteBCTorchModule

env = gym.make("CartPole-v1")

spec = RLModuleSpec(
    module_class=DiscreteBCTorchModule,
    observation_space=env.observation_space,
    action_space=env.action_space,
    model_config_dict={"fcnet_hiddens": [64]},
)

module = spec.build()

多智能体

import gymnasium as gym
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec
from ray.rllib.core.testing.torch.bc_module import DiscreteBCTorchModule

spec = MultiRLModuleSpec(
    module_specs={
        "module_1": RLModuleSpec(
            module_class=DiscreteBCTorchModule,
            observation_space=gym.spaces.Box(low=-1, high=1, shape=(10,)),
            action_space=gym.spaces.Discrete(2),
            model_config_dict={"fcnet_hiddens": [32]},
        ),
        "module_2": RLModuleSpec(
            module_class=DiscreteBCTorchModule,
            observation_space=gym.spaces.Box(low=-1, high=1, shape=(5,)),
            action_space=gym.spaces.Discrete(2),
            model_config_dict={"fcnet_hiddens": [16]},
        ),
    },
)

multi_rl_module = spec.build()

你可以将 RL 模块规格传递给算法配置，以供算法使用。

单一代理

import gymnasium as gym
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.core.testing.torch.bc_module import DiscreteBCTorchModule
from ray.rllib.core.testing.bc_algorithm import BCConfigTest


config = (
    BCConfigTest()
    .api_stack(enable_rl_module_and_learner=True)
    .environment("CartPole-v1")
    .rl_module(
        model_config_dict={"fcnet_hiddens": [32, 32]},
        rl_module_spec=RLModuleSpec(module_class=DiscreteBCTorchModule),
    )
)

algo = config.build()

备注

对于传递 RL 模块规格，所有字段不必全部填写，因为它们是根据描述的环境或其他算法配置参数填充的（即，observation_space、action_space、model_config_dict 在向算法配置传递自定义 RL 模块规格时不是必需的字段。）

多智能体

import gymnasium as gym
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec
from ray.rllib.core.testing.torch.bc_module import DiscreteBCTorchModule
from ray.rllib.core.testing.bc_algorithm import BCConfigTest
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole


config = (
    BCConfigTest()
    .api_stack(enable_rl_module_and_learner=True)
    .environment(MultiAgentCartPole, env_config={"num_agents": 2})
    .rl_module(
        model_config_dict={"fcnet_hiddens": [32, 32]},
        rl_module_spec=MultiRLModuleSpec(
            module_specs=RLModuleSpec(module_class=DiscreteBCTorchModule)
        ),
    )
)

编写自定义单智能体强化学习模块#

对于单智能体算法（例如，PPO，DQN）或独立的多智能体算法（例如，PPO-MultiAgent），使用 RLModule。对于具有智能体间共享通信的更高级多智能体用例，扩展 MultiRLModule 类。

RLlib 将单智能体模块视为 MultiRLModule 的一个特例，即只有一个模块。通过调用 as_multi_rl_module() 创建所有 RLModules 的多智能体表示。例如：

import gymnasium as gym
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.core.testing.torch.bc_module import DiscreteBCTorchModule

env = gym.make("CartPole-v1")
spec = RLModuleSpec(
    module_class=DiscreteBCTorchModule,
    observation_space=env.observation_space,
    action_space=env.action_space,
    model_config_dict={"fcnet_hiddens": [64]},
)

module = spec.build()
multi_rl_module = module.as_multi_rl_module()

RLlib 实现了以下特定于框架的抽象基类：

TorchRLModule: 用于基于PyTorch的RL模块。
TfRLModule: 用于基于TensorFlow的RL模块。

最低要求是 RLModule 的子类实现以下方法：

_forward_train(): 训练的前向传播。
_forward_inference(): 推理的前向传递。
_forward_exploration(): 探索的前向传递。

对于自定义的 forward_exploration() 和 forward_inference() 方法，你必须返回一个字典，该字典包含键 “actions” 和/或键 “action_dist_inputs”。

如果你返回“actions”键：

RLlib 将按原样使用提供的动作。
如果你也返回了 “action_dist_inputs” 键：RLlib 还会根据该键下的分布参数创建一个 Distribution 对象，并且在 forward_exploration() 的情况下，自动从给定的动作中计算动作概率和 logp 值。

如果你不返回“actions”键：

你必须从你的 forward_exploration() 和 forward_inference() 方法中返回“action_dist_inputs”键。
RLlib 将根据该键下的分布参数创建一个 Distribution 对象，并从生成的分布中采样动作。
在 forward_exploration() 的情况下，RLlib 还会自动从采样的动作中计算动作概率和 logp 值。

备注

在 forward_inference() 的情况下，生成的分布（从返回的键“action_dist_inputs”）将首先通过 to_deterministic() 工具使其确定性化，然后再进行可能的动作采样步骤。因此，例如，从分类分布中采样将简化为仅从分布的对数/概率中选择argmax动作。

常用的分布实现可以在 ray.rllib.models.tf.tf_distributions 下找到，用于 tensorflow，以及在 ray.rllib.models.torch.torch_distributions 下找到，用于 torch。你可以选择通过创建一个确定性分布实例来返回确定性动作。

返回“actions”键

"""
An RLModule whose forward_exploration/inference methods return the
"actions" key.
"""

class MyRLModule(TorchRLModule):
    ...

    def _forward_inference(self, batch):
        ...
        return {
            "actions": ...  # actions will be used as-is
        }

    def _forward_exploration(self, batch):
        ...
        return {
            "actions": ...  # actions will be used as-is (no sampling step!)
            "action_dist_inputs": ...  # optional: If provided, will be used to compute action probs and logp.
        }

不返回“actions”键

"""
An RLModule whose forward_exploration/inference methods don't return the
"actions" key.
"""

class MyRLModule(TorchRLModule):
    ...

    def _forward_inference(self, batch):
        ...
        return {
            # RLlib will:
            # - Generate distribution from these parameters.
            # - Convert distribution to a deterministic equivalent.
            # - "sample" from the deterministic distribution.
            "action_dist_inputs": ...
        }

    def _forward_exploration(self, batch):
        ...
        return {
            # RLlib will:
            # - Generate distribution from these parameters.
            # - "sample" from the (stochastic) distribution.
            # - Compute action probs/logs automatically using the sampled
            #   actions and the generated distribution object.
            "action_dist_inputs": ...
        }

此外，RLModule 类的构造函数需要一个名为 ~ray.rllib.core.rl_module.rl_module.RLModuleConfig 的 dataclass 配置对象，该对象包含以下字段：

observation_space: 环境的观察空间（处理过的或原始的）。
action_space: 环境的动作空间。
model_config_dict: 算法的模型配置字典。模型超参数，如层数、激活类型等，在此定义。
catalog_class: 算法的 Catalog 对象。

在编写 RL 模块时，您需要使用这些字段来构建您的模型。

单一代理 (torch)

from typing import Any, Dict
from ray.rllib.core.rl_module.torch.torch_rl_module import TorchRLModule
from ray.rllib.core.rl_module.rl_module import RLModuleConfig

import torch
import torch.nn as nn


class DiscreteBCTorchModule(TorchRLModule):
    def __init__(self, config: RLModuleConfig) -> None:
        super().__init__(config)

    def setup(self):
        input_dim = self.config.observation_space.shape[0]
        hidden_dim = self.config.model_config_dict["fcnet_hiddens"][0]
        output_dim = self.config.action_space.n

        self.policy = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )

        self.input_dim = input_dim

    def _forward_inference(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        with torch.no_grad():
            return self._forward_train(batch)

    def _forward_exploration(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        with torch.no_grad():
            return self._forward_train(batch)

    def _forward_train(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        action_logits = self.policy(batch["obs"])
        return {"action_dist": torch.distributions.Categorical(logits=action_logits)}

单一代理 (tensorflow)

from typing import Mapping, Any
from ray.rllib.core.rl_module.tf.tf_rl_module import TfRLModule
from ray.rllib.core.rl_module.rl_module import RLModuleConfig

import tensorflow as tf


class DiscreteBCTfModule(TfRLModule):
    def __init__(self, config: RLModuleConfig) -> None:
        super().__init__(config)

    def setup(self):
        input_dim = self.config.observation_space.shape[0]
        hidden_dim = self.config.model_config_dict["fcnet_hiddens"][0]
        output_dim = self.config.action_space.n

        self.policy = tf.keras.Sequential(
            [
                tf.keras.layers.Dense(hidden_dim, activation="relu"),
                tf.keras.layers.Dense(output_dim),
            ]
        )

        self.input_dim = input_dim

    def _forward_inference(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        return self._forward_train(batch)

    def _forward_exploration(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        return self._forward_train(batch)

    def _forward_train(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        action_logits = self.policy(batch["obs"])
        return {"action_dist": tf.distributions.Categorical(logits=action_logits)}

在 RLModule 中，你可以强制检查某些输入或输出键在传递到 RL 模块和从 RL 模块传递出的数据中是否存在。这有多个用途：

对于每个方法的I/O需求，要求其自我文档化。
为了快速发现故障。如果用户扩展模块并实现与I/O规范假设不匹配的内容，检查报告将显示缺少的键及其预期格式。例如，RLModule 应在输入批次中始终包含 obs 键，在输出中包含 action_dist 键。

单级键

class DiscreteBCTorchModule(TorchRLModule):
    ...

    @override(TorchRLModule)
    def input_specs_exploration(self) -> SpecType:
        # Enforce that input nested dict to exploration method has a key "obs"
        return ["obs"]

    @override(TorchRLModule)
    def output_specs_exploration(self) -> SpecType:
        # Enforce that output nested dict from exploration method has a key
        # "action_dist"
        return ["action_dist"]

嵌套键

class DiscreteBCTorchModule(TorchRLModule):
    ...

    @override(TorchRLModule)
    def input_specs_exploration(self) -> SpecType:
        # Enforce that input nested dict to exploration method has a key "obs"
        # and within that key, it has a key "global" and "local". There should
        # also be a key "action_mask"
        return [("obs", "global"), ("obs", "local"), "action_mask"]

TensorShape 规范

class DiscreteBCTorchModule(TorchRLModule):
    ...

    @override(TorchRLModule)
    def input_specs_exploration(self) -> SpecType:
        # Enforce that input nested dict to exploration method has a key "obs"
        # and its value is a torch.Tensor with shape (b, h) where b is the
        # batch size (determined at run-time) and h is the hidden size
        # (fixed at 10).
        return {"obs": TensorSpec("b, h", h=10, framework="torch")}

类型规范

class DiscreteBCTorchModule(TorchRLModule):
    ...

    @override(TorchRLModule)
    def output_specs_exploration(self) -> SpecType:
        # Enforce that output nested dict from exploration method has a key
        # "action_dist" and its value is a torch.distribution.Categorical
        return {"action_dist": torch.distribution.Categorical}

RLModule 为每个前向方法提供了两种方法，总共有6种方法可以被重写以描述每个方法的输入和输出的规格：

了解更多信息，请参阅 SpecType 文档。

编写自定义多智能体强化学习模块（高级）#

对于多智能体模块，RLlib 实现了 MultiAgentRLModule，这是一个由 RLModule 对象组成的字典，每个策略对应一个对象，可能还有一些共享模块。基类实现适用于大多数需要为智能体子组定义独立神经网络的用例。对于更复杂的多智能体用例，其中智能体共享部分神经网络，您应该继承此类并覆盖默认实现。

The MultiRLModule 提供了一个API，用于构建针对特定需求的自定义模型。实现此自定义的关键方法是 MultiRLModule().build。

以下示例创建了一个带有底层模块的自定义多智能体强化学习模块。这些模块共享一个编码器，该编码器应用于观测空间的全球部分。局部部分通过一个单独的编码器，每个策略都有特定的编码器。

from ray.rllib.core.rl_module.torch.torch_rl_module import TorchRLModule
from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleConfig, MultiRLModule

import torch
import torch.nn as nn


class BCTorchRLModuleWithSharedGlobalEncoder(TorchRLModule):
    """An RLModule with a shared encoder between agents for global observation."""

    def __init__(
        self,
        encoder: nn.Module,
        local_dim: int,
        hidden_dim: int,
        action_dim: int,
        config=None,
    ) -> None:
        super().__init__(config=config)

        self.encoder = encoder
        self.policy_head = nn.Sequential(
            nn.Linear(hidden_dim + local_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
        )

    def _forward_inference(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        with torch.no_grad():
            return self._common_forward(batch)

    def _forward_exploration(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        with torch.no_grad():
            return self._common_forward(batch)

    def _forward_train(self, batch: Dict[str, Any]) -> Dict[str, Any]:
        return self._common_forward(batch)

    def _common_forward(self, batch):
        obs = batch["obs"]
        global_enc = self.encoder(obs["global"])
        policy_in = torch.cat([global_enc, obs["local"]], dim=-1)
        action_logits = self.policy_head(policy_in)

        return {"action_dist": torch.distributions.Categorical(logits=action_logits)}


class BCTorchMultiAgentModuleWithSharedEncoder(MultiRLModule):
    def __init__(self, config: MultiRLModuleConfig) -> None:
        super().__init__(config)

    def setup(self):

        module_specs = self.config.modules
        module_spec = next(iter(module_specs.values()))
        global_dim = module_spec.observation_space["global"].shape[0]
        hidden_dim = module_spec.model_config_dict["fcnet_hiddens"][0]
        shared_encoder = nn.Sequential(
            nn.Linear(global_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )

        rl_modules = {}
        for module_id, module_spec in module_specs.items():
            rl_modules[module_id] = BCTorchRLModuleWithSharedGlobalEncoder(
                config=module_specs[module_id].get_rl_module_config(),
                encoder=shared_encoder,
                local_dim=module_spec.observation_space["local"].shape[0],
                hidden_dim=hidden_dim,
                action_dim=module_spec.action_space.n,
            )

        self._rl_modules = rl_modules

要构建这个自定义的多智能体强化学习模块，将类传递给 MultiRLModuleSpec 构造函数。同时，为每个智能体传递 RLModuleSpec，因为 RLlib 需要每个智能体的观察空间、动作空间和模型超参数。

import gymnasium as gym
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec

spec = MultiRLModuleSpec(
    multi_rl_module_class=BCTorchMultiAgentModuleWithSharedEncoder,
    module_specs={
        "local_2d": RLModuleSpec(
            observation_space=gym.spaces.Dict(
                {
                    "global": gym.spaces.Box(low=-1, high=1, shape=(2,)),
                    "local": gym.spaces.Box(low=-1, high=1, shape=(2,)),
                }
            ),
            action_space=gym.spaces.Discrete(2),
            model_config_dict={"fcnet_hiddens": [64]},
        ),
        "local_5d": RLModuleSpec(
            observation_space=gym.spaces.Dict(
                {
                    "global": gym.spaces.Box(low=-1, high=1, shape=(2,)),
                    "local": gym.spaces.Box(low=-1, high=1, shape=(5,)),
                }
            ),
            action_space=gym.spaces.Discrete(5),
            model_config_dict={"fcnet_hiddens": [64]},
        ),
    },
)

module = spec.build()

扩展现有的 RLlib RL 模块#

RLlib 提供了许多适用于不同框架（例如 PyTorch、TensorFlow 等）的 RL 模块。要自定义现有的 RL 模块，您可以通过继承类并更改 setup() 或其他方法来直接修改 RL 模块。例如，扩展 PPOTorchRLModule 并添加您自己的自定义内容。然后将新的自定义类传递到适当的 AlgorithmConfig 中。

有两种可能的方式来扩展现有的 RL 模块：

继承现有的 RL 模块

扩展现有 RL 模块的默认方式是继承它们并覆盖需要自定义的方法。然后将新的自定义类传递给 AlgorithmConfig 以优化您的自定义 RL 模块。这是首选方法。通过这种方法，我们可以在给定的 RL 模块中显式定义自己的模型，并且不需要与 Catalog 交互，因此您不需要学习关于 Catalog 的知识。

class MyPPORLModule(PPORLModule):

    def __init__(self, config: RLModuleConfig):
        super().__init__(config)
        ...

# Pass in the custom RL Module class to the spec
algo_config = algo_config.rl_module(
    rl_module_spec=RLModuleSpec(module_class=MyPPORLModule)
)

一个具体的例子：如果你想替换 RLlib 为 torch、PPO 和给定的观察空间构建的默认编码器，你可以重写 PPOTorchRLModule 类的 __init__ 方法，以创建你的自定义编码器而不是默认的编码器。我们在下面的例子中这样做。

import gymnasium as gym
import numpy as np

from ray.rllib.algorithms.ppo.ppo import PPOConfig
from ray.rllib.algorithms.ppo.torch.ppo_torch_rl_module import PPOTorchRLModule
from ray.rllib.core.models.configs import MLPHeadConfig
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.examples.envs.classes.random_env import RandomEnv
from ray.rllib.models.torch.torch_distributions import TorchCategorical
from ray.rllib.examples._old_api_stack.models.mobilenet_v2_encoder import (
    MobileNetV2EncoderConfig,
    MOBILENET_INPUT_SHAPE,
)
from ray.rllib.core.models.configs import ActorCriticEncoderConfig


class MobileNetTorchPPORLModule(PPOTorchRLModule):
    """A PPORLModules with mobilenet v2 as an encoder.

    The idea behind this model is to demonstrate how we can bypass catalog to
    take full control over what models and action distribution are being built.
    In this example, we do this to modify an existing RLModule with a custom encoder.
    """

    def setup(self):
        mobilenet_v2_config = MobileNetV2EncoderConfig()
        # Since we want to use PPO, which is an actor-critic algorithm, we need to
        # use an ActorCriticEncoderConfig to wrap the base encoder config.
        actor_critic_encoder_config = ActorCriticEncoderConfig(
            base_encoder_config=mobilenet_v2_config
        )

        self.encoder = actor_critic_encoder_config.build(framework="torch")
        mobilenet_v2_output_dims = mobilenet_v2_config.output_dims

        pi_config = MLPHeadConfig(
            input_dims=mobilenet_v2_output_dims,
            output_layer_dim=2,
        )

        vf_config = MLPHeadConfig(
            input_dims=mobilenet_v2_output_dims, output_layer_dim=1
        )

        self.pi = pi_config.build(framework="torch")
        self.vf = vf_config.build(framework="torch")

        self.action_dist_cls = TorchCategorical


config = (
    PPOConfig()
    .api_stack(enable_rl_module_and_learner=True)
    .rl_module(rl_module_spec=RLModuleSpec(module_class=MobileNetTorchPPORLModule))
    .environment(
        RandomEnv,
        env_config={
            "action_space": gym.spaces.Discrete(2),
            # Test a simple Image observation space.
            "observation_space": gym.spaces.Box(
                0.0,
                1.0,
                shape=MOBILENET_INPUT_SHAPE,
                dtype=np.float32,
            ),
        },
    )
    .env_runners(num_env_runners=0)
    # The following training settings make it so that a training iteration is very
    # quick. This is just for the sake of this example. PPO will not learn properly
    # with these settings!
    .training(train_batch_size=32, sgd_minibatch_size=16, num_sgd_iter=1)
)

config.build().train()

扩展 RL 模块目录

自定义模块的高级方法是扩展其 Catalog 。Catalog 是一个组件，它根据 observation_space、action_space 等因素定义 RL 模块的默认模型和其他子组件。有关 Catalog 类的更多信息，请参阅 Catalog 用户指南。通过修改 Catalog，您可以更改为现有 RL 模块构建的子组件。这种方法主要在你希望自定义组件与 Catalog 所表示的决策树集成时非常有用。以下用例是需要你扩展 Catalog 的示例：

仅为特定的观察空间选择一个自定义模型。

在多个不同的算法中使用自定义动作分布。

在多个不同的 RL 模块中重用您的自定义组件。

例如，为了适应现有的 PPORLModules 以支持 RLlib 不直接支持的自定义图观察空间，扩展用于创建 PPORLModule 的 Catalog 类，并覆盖负责返回编码器组件的方法，以确保您的自定义编码器替换 RLlib 最初提供的默认编码器。

class MyAwesomeCatalog(PPOCatalog):

    def build_actor_critic_encoder():
        # create your awesome graph encoder here and return it
        pass


# Pass in the custom catalog class to the spec
algo_config = algo_config.rl_module(
    rl_module_spec=RLModuleSpec(catalog_class=MyAwesomeCatalog)
)

检查点 RL 模块#

RL 模块可以通过它们的两种方法 save_to_path() 和 from_checkpoint() 进行检查点保存。以下示例展示了如何在 RLlib 算法之外或与其结合使用这些方法。

import gymnasium as gym
import shutil
import tempfile
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.ppo.ppo_catalog import PPOCatalog
from ray.rllib.algorithms.ppo.torch.ppo_torch_rl_module import PPOTorchRLModule
from ray.rllib.core.rl_module.rl_module import RLModule, RLModuleSpec

config = (
    PPOConfig()
    # Enable the new API stack (RLModule and Learner APIs).
    .api_stack(enable_rl_module_and_learner=True).environment("CartPole-v1")
)
env = gym.make("CartPole-v1")
# Create an RL Module that we would like to checkpoint
module_spec = RLModuleSpec(
    module_class=PPOTorchRLModule,
    observation_space=env.observation_space,
    action_space=env.action_space,
    # If we want to use this externally created module in the algorithm,
    # we need to provide the same config as the algorithm. Any changes to
    # the defaults can be given via the right side of the `|` operator.
    model_config_dict=config.model_config | {"fcnet_hiddens": [32]},
    catalog_class=PPOCatalog,
)
module = module_spec.build()

# Create the checkpoint.
module_ckpt_path = tempfile.mkdtemp()
module.save_to_path(module_ckpt_path)

# Create a new RLModule from the checkpoint.
loaded_module = RLModule.from_checkpoint(module_ckpt_path)

# Create a new Algorithm (with the changed module config: 32 units instead of the
# default 256; otherwise loading the state of `module` will fail due to a shape
# mismatch).
config.rl_module(model_config_dict=config.model_config | {"fcnet_hiddens": [32]})
algo = config.build()
# Now load the saved RLModule state (from the above `module.save_to_path()`) into the
# Algorithm's RLModule(s). Note that all RLModules within the algo get updated, the ones
# in the Learner workers and the ones in the EnvRunners.
algo.restore_from_path(
    module_ckpt_path,  # <- NOT an Algorithm checkpoint, but single-agent RLModule one.
    # We have to provide the exact component-path to the (single) RLModule
    # within the algorithm, which is:
    component="learner_group/learner/rl_module/default_policy",
)

从自定义策略和模型迁移到 RL 模块#

本文档适用于那些已经在 RLlib 中实现了自定义策略和模型的用户，并希望迁移到新的 ~ray.rllib.core.rl_module.rl_module.RLModule API。如果你已经实现了扩展 ~ray.rllib.policy.eager_tf_policy_v2.EagerTFPolicyV2 或 ~ray.rllib.policy.torch_policy_v2.TorchPolicyV2 类的自定义策略，你可能这样做是为了修改构建模型和分布的行为（通过重写 ~ray.rllib.policy.torch_policy_v2.TorchPolicyV2.make_model、~ray.rllib.policy.torch_policy_v2.TorchPolicyV2.make_model_and_action_dist），控制动作采样逻辑（通过重写 ~ray.rllib.policy.eager_tf_policy_v2.EagerTFPolicyV2.action_distribution_fn 或 ~ray.rllib.policy.eager_tf_policy_v2.EagerTFPolicyV2.action_sampler_fn），或控制推理逻辑（通过重写 ~ray.rllib.policy.policy.Policy.compute_actions_from_input_dict、~ray.rllib.policy.policy.Policy.compute_actions 或 ~ray.rllib.policy.policy.Policy.compute_log_likelihoods）。这些 API 是基于 ray.rllib.models.modelv2.ModelV2 模型构建的，旨在让你能够自定义这些函数的行为。然而，~ray.rllib.core.rl_module.rl_module.RLModule 是一个更通用的抽象，将减少你需要重写的函数数量。

在新的 ~ray.rllib.core.rl_module.rl_module.RLModule API 中，模型的构建和应使用的动作分布类的定义最好在构造函数中完成。如果用户遵循 Enabling RL Modules in the Configuration 和 Constructing RL Modules 部分中概述的说明，则会自动构建该 RL 模块。~ray.rllib.policy.policy.Policy.compute_actions 和 ~ray.rllib.policy.policy.Policy.compute_actions_from_input_dict 仍然可以用于通过使用 explore=True|False 参数进行推理或探索的动作采样。如果调用时使用 explore=True，这些函数将调用 ~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration，如果使用 explore=False，则它们将调用 ~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference。

你的定制化可能在此之前的样子：

ModelV2

from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.policy.torch_policy_v2 import TorchPolicyV2


class MyCustomModel(TorchModelV2):
    """Code for your previous custom model"""
    ...


class CustomPolicy(TorchPolicyV2):

    @DeveloperAPI
    @OverrideToImplementCustomLogic
    def make_model(self) -> ModelV2:
        """Create model.

        Note: only one of make_model or make_model_and_action_dist
        can be overridden.

        Returns:
        ModelV2 model.
        """
        return MyCustomModel(...)

ModelV2 + 分布

from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.policy.torch_policy_v2 import TorchPolicyV2


class MyCustomModel(TorchModelV2):
    """Code for your previous custom model"""
    ...


class CustomPolicy(TorchPolicyV2):

    @DeveloperAPI
    @OverrideToImplementCustomLogic
    def make_model_and_action_dist(self):
        """Create model and action distribution function.

        Returns:
            ModelV2 model.
            ActionDistribution class.
        """
        my_model = MyCustomModel(...) # construct some ModelV2 instance here
        dist_class = ... # Action distribution cls

        return my_model, dist_class

采样器函数

from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.policy.torch_policy_v2 import TorchPolicyV2

class CustomPolicy(TorchPolicyV2):

    @DeveloperAPI
    @OverrideToImplementCustomLogic
    def action_sampler_fn(
        self,
        model: ModelV2,
        *,
        obs_batch: TensorType,
        state_batches: TensorType,
        **kwargs,
    ) -> Tuple[TensorType, TensorType, TensorType, List[TensorType]]:
        """Custom function for sampling new actions given policy.

        Args:
            model: Underlying model.
            obs_batch: Observation tensor batch.
            state_batches: Action sampling state batch.

        Returns:
            Sampled action
            Log-likelihood
            Action distribution inputs
            Updated state
        """
        return None, None, None, None


    @DeveloperAPI
    @OverrideToImplementCustomLogic
    def action_distribution_fn(
        self,
        model: ModelV2,
        *,
        obs_batch: TensorType,
        state_batches: TensorType,
        **kwargs,
    ) -> Tuple[TensorType, type, List[TensorType]]:
        """Action distribution function for this Policy.

        Args:
            model: Underlying model.
            obs_batch: Observation tensor batch.
            state_batches: Action sampling state batch.

        Returns:
            Distribution input.
            ActionDistribution class.
            State outs.
        """
        return None, None, None

所有 Policy.compute_*** 函数都期望 forward_exploration() 和 forward_inference() 返回一个字典，该字典包含键 “actions” 和/或键 “action_dist_inputs”。

有关如何实现您自己的自定义RL模块的更多详细信息，请参阅编写自定义单代理RL模块。

等效RL模块

"""
No need to override any policy functions. Simply instead implement any custom logic in your custom RL Module
"""
from ray.rllib.models.torch.torch_distributions import YOUR_DIST_CLASS


class MyRLModule(TorchRLModule):

    def __init__(self, config: RLConfig):
        # construct any custom networks here using config
        # specify an action distribution class here
        ...

    def _forward_inference(self, batch):
        ...

    def _forward_exploration(self, batch):
        ...