备注
Ray 2.10.0 引入了 RLlib 的“新 API 栈”的 alpha 阶段。Ray 团队计划将算法、示例脚本和文档迁移到新的代码库中,从而在 Ray 3.0 之前的后续小版本中逐步替换“旧 API 栈”(例如,ModelV2、Policy、RolloutWorker)。
然而,请注意,到目前为止,只有 PPO(单代理和多代理)和 SAC(仅单代理)支持“新 API 堆栈”,并且默认情况下继续使用旧 API 运行。您可以继续使用现有的自定义(旧堆栈)类。
请参阅此处 以获取有关如何使用新API堆栈的更多详细信息。
目录 (Alpha)#
Catalog 是一个实用抽象,它模块化了 RLModules 组件的构建。它包含了如何编码输入观察空间、应该使用什么动作分布等信息。Catalog
。例如,PPOTorchRLModule
有 PPOCatalog
。要自定义现有的 RLModules,可以通过继承类并更改 setup()
方法直接更改 RLModule,或者,扩展分配给该 RLModule
的 Catalog 类。仅在您的自定义符合 Catalog 提供的抽象时使用 Catalog。
备注
修改目录意味着高级用例,因此只有在修改 RLModule 或编写一个不能满足您的用例时才应考虑这一点。我们建议仅在更深入地定制决定 RLlib 默认创建的 Model
和 Distribution
的决策树时才修改目录。
备注
如果你只是想通过改变其默认值来修改一个模型,可以查看模型配置字典:
虽然目录有一个基类 Catalog,但你主要与特定于算法的目录进行交互。因此,本文档还包括了围绕 PPO 的示例,你可以从中推断到其他算法。本用户指南的先决条件是对 RLModules 有一个大致的了解。本用户指南涵盖以下主题:
什么是目录
目录设计和想法
目录和算法配置
基本用法
将您的自定义模型注入 RLModules
将自定义动作分布注入 RLModules
从头编写目录
什么是目录#
目录有两个主要角色:选择合适的 Model
和选择合适的 Distribution
。默认情况下,所有目录都实现了决策树,这些决策树根据输入配置的组合来决定模型架构。这些主要包括 RLModule
的 observation space
和 action space
,以及 model config dict
和 深度学习框架后端
。
下图展示了信息流向 RLModule 中的 models
和 distributions
的分解。RLModule 在其构造函数中接收的 Catalog 类创建一个实例。然后,它借助这个 Catalog 创建其内部的 models
和 distributions
。
备注
你也可以通过重写 RLModule 的构造函数直接修改模型或分布!
下图更详细地展示了一个具体案例。
目录设计和想法#
由于此组件的主要用例涉及对其进行深度修改,因此我们在此部分解释了目录背后的设计和理念。
目录解决了什么问题?#
RL 算法需要神经网络 模型
和 分布
。在一个算法中,这些子组件的许多不同架构都是有效的。此外,模型和分布随环境变化。然而,大多数算法需要的模型具有相似性。问题是在广泛的用例中找到合理的子组件,同时在这些算法之间共享此功能。
目录如何解决这个问题?#
如上所述,目录为 RLModules
的子组件实现了决策树。目录对象中的模型和分布旨在相互配合。由于我们主要使用 Encoder
、Heads 和 Distribution
构建 RLModules,目录通常也反映了这一点。例如,PPOCatalog 将输出一个输出潜在向量的编码器和两个以该潜在向量为输入的 Heads。(这就是为什么目录有一个 latent_dims
属性)。Heads 和分布的行为也相应地调整。每当你创建一个目录时,决策树就会执行以找到适合模型和分布类的配置。默认情况下,这发生在 _get_encoder_config()
和 _get_dist_cls_from_action_space()
中。每当你构建一个模型时,配置就会被转换为模型。分布在 RLModule
的每次前向传递时实例化,因此不会构建。
API 哲学#
目录尝试将模型内部的大部分复杂性封装在 Encoder
中。这意味着递归、注意力和其他特殊情况在编码器内部得到完全处理,并且对其他组件是透明的。编码器是目录基类构建的唯一组件。这是因为许多算法需要自定义的头部和分布,但它们中的大多数可以使用相同的编码器。目录API的设计使得交互通常分为两个阶段:
实例化一个目录。这将执行决策树。
通过目录方法生成任意数量的决定组件。
访问基类组件的两种默认方法是…
你可以重写这些方法来快速修改 RLModules 构建的模型。其他方法是私有的,只有在需要对决策树进行深度修改以增强 Catalogs 的功能时才应重写。此外,get_tokenizer_config()
是一个可以在需要分词时使用的方法。分词意味着单步嵌入。编码也意味着嵌入,但可以跨越多个时间步。事实上,RLlib 在其循环编码器(例如 TorchLSTMEncoder
)中使用的分词器,是非循环编码器类的实例。
目录和算法配置#
由于目录有效地控制了RLlib在底层使用的``models``和``distributions``,它们也是RLlib配置的一部分。作为配置RLlib的主要入口点,AlgorithmConfig
是你配置创建的RLModules的目录的地方。你可以通过 RLModuleSpec
或 MultiRLModuleSpec
来设置``catalog class``。例如,在异构多智能体情况下,你可以修改MultiRLModuleSpec。
以下示例展示了如何配置由PPO创建的 RLModule
的目录。
from ray.rllib.algorithms.ppo.ppo_catalog import PPOCatalog
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
class MyPPOCatalog(PPOCatalog):
def __init__(self, *args, **kwargs):
print("Hi from within PPORLModule!")
super().__init__(*args, **kwargs)
config = (
PPOConfig()
.api_stack(enable_rl_module_and_learner=True)
.environment("CartPole-v1")
.framework("torch")
)
# Specify the catalog to use for the PPORLModule.
config = config.rl_module(rl_module_spec=RLModuleSpec(catalog_class=MyPPOCatalog))
# This is how RLlib constructs a PPORLModule
# It will say "Hi from within PPORLModule!".
ppo = config.build()
基本用法#
在接下来的三个示例中,我们通过操作目录来展示其API。
高级 API#
第一个示例展示了与目录交互的通用API。
import gymnasium as gym
from ray.rllib.algorithms.ppo.ppo_catalog import PPOCatalog
env = gym.make("CartPole-v1")
catalog = PPOCatalog(env.observation_space, env.action_space, model_config_dict={})
# Build an encoder that fits CartPole's observation space.
encoder = catalog.build_actor_critic_encoder(framework="torch")
policy_head = catalog.build_pi_head(framework="torch")
# We expect a categorical distribution for CartPole.
action_dist_class = catalog.get_action_dist_cls(framework="torch")
创建模型和分布#
第二个示例展示了如何使用基础的 Catalog
来创建一个 模型
和一个 动作分布
。除此之外,我们还手动创建了一个 头部网络
,以手动适应这两个组件。
为PPO创建模型和分布#
第三个示例展示了如何使用 PPOCatalog
来创建一个 encoder
和一个 action distribution
。这与 RLlib 内部的实现更为相似。
请注意,以上两个示例原则上说明了实现目录所需的内容。在这种情况下,我们看到了 Catalog
和 PPOCatalog
之间的区别。在大多数情况下,我们可以重用基础 Catalog
基类的功能,并且只需要添加方法来构建头部网络,然后我们可以在适当的 RLModule
中使用这些网络。
将您的自定义模型或动作分布注入到目录中#
你可以通过重写 Catalog
中由 RLModules 用于构建 models
的方法,来制作自定义的 models
。查看 PPOTorchRLModule
的构造函数中的这些行,了解 RLModule
是如何使用 Catalogs 的。
catalog = self.config.get_catalog()
# If we have a stateful model, states for the critic need to be collected
# during sampling and `inference-only` needs to be `False`. Note, at this
# point the encoder is not built, yet and therefore `is_stateful()` does
# not work.
is_stateful = isinstance(
catalog.actor_critic_encoder_config.base_encoder_config,
RecurrentEncoderConfig,
)
if is_stateful:
self.config.inference_only = False
# If this is an `inference_only` Module, we'll have to pass this information
# to the encoder config as well.
if self.config.inference_only and self.framework == "torch":
catalog.actor_critic_encoder_config.inference_only = True
# Build models from catalog.
self.encoder = catalog.build_actor_critic_encoder(framework=self.framework)
self.pi = catalog.build_pi_head(framework=self.framework)
self.vf = catalog.build_vf_head(framework=self.framework)
self.action_dist_cls = catalog.get_action_dist_cls(framework=self.framework)
注意,PPOTorchRLModule 构造函数内部发生的事情与前面的示例 为 PPO 创建模型和分布 类似。
因此,为了构建一个与 PPORLModule 兼容的自定义 Model
,你可以通过继承 PPOCatalog
来重写方法,或者从头开始编写一个实现这些方法的 Catalog
。以下示例展示了此类修改:
这个示例展示了两种修改:
如何编写自定义的
Encoder
如何将自定义编码器注入到
Catalog
请注意,如果您只想将编码器注入到一个 RLModule
中,推荐的流程是从现有的 RL 模块继承并将编码器放置在那里。
import gymnasium as gym
import numpy as np
from ray.rllib.algorithms.ppo.ppo import PPOConfig
from ray.rllib.algorithms.ppo.ppo_catalog import PPOCatalog
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.examples._old_api_stack.models.mobilenet_v2_encoder import (
MobileNetV2EncoderConfig,
MOBILENET_INPUT_SHAPE,
)
from ray.rllib.examples.envs.classes.random_env import RandomEnv
# Define a PPO Catalog that we can use to inject our MobileNetV2 Encoder into RLlib's
# decision tree of what model to choose
class MobileNetEnhancedPPOCatalog(PPOCatalog):
@classmethod
def _get_encoder_config(
cls,
observation_space: gym.Space,
**kwargs,
):
if (
isinstance(observation_space, gym.spaces.Box)
and observation_space.shape == MOBILENET_INPUT_SHAPE
):
# Inject our custom encoder here, only if the observation space fits it
return MobileNetV2EncoderConfig()
else:
return super()._get_encoder_config(observation_space, **kwargs)
# Create a generic config with our enhanced Catalog
ppo_config = (
PPOConfig()
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
.rl_module(rl_module_spec=RLModuleSpec(catalog_class=MobileNetEnhancedPPOCatalog))
.env_runners(num_env_runners=0)
# The following training settings make it so that a training iteration is very
# quick. This is just for the sake of this example. PPO will not learn properly
# with these settings!
.training(train_batch_size=32, sgd_minibatch_size=16, num_sgd_iter=1)
)
# CartPole's observation space is not compatible with our MobileNetV2 Encoder, so
# this will use the default behaviour of Catalogs
ppo_config.environment("CartPole-v1")
results = ppo_config.build().train()
print(results)
# For this training, we use a RandomEnv with observations of shape
# MOBILENET_INPUT_SHAPE. This will use our custom Encoder.
ppo_config.environment(
RandomEnv,
env_config={
"action_space": gym.spaces.Discrete(2),
# Test a simple Image observation space.
"observation_space": gym.spaces.Box(
0.0,
1.0,
shape=MOBILENET_INPUT_SHAPE,
dtype=np.float32,
),
},
)
results = ppo_config.build().train()
print(results)
这个示例展示了两种修改:
如何编写自定义的
Distribution
如何将自定义动作分布注入到
Catalog
import torch
import gymnasium as gym
from ray.rllib.algorithms.ppo.ppo import PPOConfig
from ray.rllib.algorithms.ppo.ppo_catalog import PPOCatalog
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.models.distributions import Distribution
from ray.rllib.models.torch.torch_distributions import TorchDeterministic
# Define a simple categorical distribution that can be used for PPO
class CustomTorchCategorical(Distribution):
def __init__(self, logits):
self.torch_dist = torch.distributions.categorical.Categorical(logits=logits)
def sample(self, sample_shape=torch.Size(), **kwargs):
return self.torch_dist.sample(sample_shape)
def rsample(self, sample_shape=torch.Size(), **kwargs):
return self._dist.rsample(sample_shape)
def logp(self, value, **kwargs):
return self.torch_dist.log_prob(value)
def entropy(self):
return self.torch_dist.entropy()
def kl(self, other, **kwargs):
return torch.distributions.kl.kl_divergence(self.torch_dist, other.torch_dist)
@staticmethod
def required_input_dim(space, **kwargs):
return int(space.n)
@classmethod
# This method is used to create distributions from logits inside RLModules.
# You can use this to inject arguments into the constructor of this distribution
# that are not the logits themselves.
def from_logits(cls, logits):
return CustomTorchCategorical(logits=logits)
# This method is used to create a deterministic distribution for the
# PPORLModule.forward_inference.
def to_deterministic(self):
return TorchDeterministic(loc=torch.argmax(self.logits, dim=-1))
# See if we can create this distribution and sample from it to interact with our
# target environment
env = gym.make("CartPole-v1")
dummy_logits = torch.randn([env.action_space.n])
dummy_dist = CustomTorchCategorical.from_logits(dummy_logits)
action = dummy_dist.sample()
env = gym.make("CartPole-v1")
env.reset()
env.step(action.numpy())
# Define a simple catalog that returns our custom distribution when
# get_action_dist_cls is called
class CustomPPOCatalog(PPOCatalog):
def get_action_dist_cls(self, framework):
# The distribution we wrote will only work with torch
assert framework == "torch"
return CustomTorchCategorical
# Train with our custom action distribution
algo = (
PPOConfig()
.environment("CartPole-v1")
.rl_module(rl_module_spec=RLModuleSpec(catalog_class=CustomPPOCatalog))
.build()
)
results = algo.train()
print(results)
这些示例针对PPO,但工作流程适用于所有RLlib算法。请注意,PPO向基类添加了 from ray.rllib.core.models.base.ActorCriticEncoder
和两个头(策略头和价值头)。您可以类似地覆盖这些。其他算法可能会添加不同的子组件或覆盖默认组件。
从头编写目录#
只有在您想在 RLlib 下编写新算法时才需要这个。请注意,编写算法并不严格要求编写新的目录,但您可以使用目录作为工具来创建合适的默认子组件,例如模型或分布。以下是编写新目录的典型要求和步骤:
算法是否需要一个特殊的编码器?覆盖
_get_encoder_config()
。算法需要额外的网络吗?编写一个方法来构建它。你可以使用 RLlib 的模型配置从维度构建模型。
算法需要自定义分布吗?覆盖
_get_dist_cls_from_action_space()
。算法需要一个特殊的分词器吗?重写
get_tokenizer_config()
。算法完全不需要编码器吗?覆盖
_determine_components_hook()
。
以下示例展示了基于前面步骤的PPO算法的目录实现:
PPORLModules 目录
import gymnasium as gym
from ray.rllib.core.models.catalog import Catalog
from ray.rllib.core.models.configs import (
ActorCriticEncoderConfig,
MLPHeadConfig,
FreeLogStdMLPHeadConfig,
)
from ray.rllib.core.models.base import Encoder, ActorCriticEncoder, Model
from ray.rllib.utils import override
from ray.rllib.utils.annotations import OverrideToImplementCustomLogic
def _check_if_diag_gaussian(action_distribution_cls, framework):
if framework == "torch":
from ray.rllib.models.torch.torch_distributions import TorchDiagGaussian
assert issubclass(action_distribution_cls, TorchDiagGaussian), (
f"free_log_std is only supported for DiagGaussian action distributions. "
f"Found action distribution: {action_distribution_cls}."
)
elif framework == "tf2":
from ray.rllib.models.tf.tf_distributions import TfDiagGaussian
assert issubclass(action_distribution_cls, TfDiagGaussian), (
"free_log_std is only supported for DiagGaussian action distributions. "
"Found action distribution: {}.".format(action_distribution_cls)
)
else:
raise ValueError(f"Framework {framework} not supported for free_log_std.")
class PPOCatalog(Catalog):
"""The Catalog class used to build models for PPO.
PPOCatalog provides the following models:
- ActorCriticEncoder: The encoder used to encode the observations.
- Pi Head: The head used to compute the policy logits.
- Value Function Head: The head used to compute the value function.
The ActorCriticEncoder is a wrapper around Encoders to produce separate outputs
for the policy and value function. See implementations of PPORLModule for
more details.
Any custom ActorCriticEncoder can be built by overriding the
build_actor_critic_encoder() method. Alternatively, the ActorCriticEncoderConfig
at PPOCatalog.actor_critic_encoder_config can be overridden to build a custom
ActorCriticEncoder during RLModule runtime.
Any custom head can be built by overriding the build_pi_head() and build_vf_head()
methods. Alternatively, the PiHeadConfig and VfHeadConfig can be overridden to
build custom heads during RLModule runtime.
Any module built for exploration or inference is built with the flag
`ìnference_only=True` and does not contain a value network. This flag can be set
in the `SingleAgentModuleSpec` through the `inference_only` boolean flag.
In case that the actor-critic-encoder is not shared between the policy and value
function, the inference-only module will contain only the actor encoder network.
"""
def __init__(
self,
observation_space: gym.Space,
action_space: gym.Space,
model_config_dict: dict,
):
"""Initializes the PPOCatalog.
Args:
observation_space: The observation space of the Encoder.
action_space: The action space for the Pi Head.
model_config_dict: The model config to use.
"""
super().__init__(
observation_space=observation_space,
action_space=action_space,
model_config_dict=model_config_dict,
)
# Replace EncoderConfig by ActorCriticEncoderConfig
self.actor_critic_encoder_config = ActorCriticEncoderConfig(
base_encoder_config=self._encoder_config,
shared=self._model_config_dict["vf_share_layers"],
)
self.pi_and_vf_head_hiddens = self._model_config_dict["post_fcnet_hiddens"]
self.pi_and_vf_head_activation = self._model_config_dict[
"post_fcnet_activation"
]
# We don't have the exact (framework specific) action dist class yet and thus
# cannot determine the exact number of output nodes (action space) required.
# -> Build pi config only in the `self.build_pi_head` method.
self.pi_head_config = None
self.vf_head_config = MLPHeadConfig(
input_dims=self.latent_dims,
hidden_layer_dims=self.pi_and_vf_head_hiddens,
hidden_layer_activation=self.pi_and_vf_head_activation,
output_layer_activation="linear",
output_layer_dim=1,
)
@OverrideToImplementCustomLogic
def build_actor_critic_encoder(self, framework: str) -> ActorCriticEncoder:
"""Builds the ActorCriticEncoder.
The default behavior is to build the encoder from the encoder_config.
This can be overridden to build a custom ActorCriticEncoder as a means of
configuring the behavior of a PPORLModule implementation.
Args:
framework: The framework to use. Either "torch" or "tf2".
Returns:
The ActorCriticEncoder.
"""
return self.actor_critic_encoder_config.build(framework=framework)
@override(Catalog)
def build_encoder(self, framework: str) -> Encoder:
"""Builds the encoder.
Since PPO uses an ActorCriticEncoder, this method should not be implemented.
"""
raise NotImplementedError(
"Use PPOCatalog.build_actor_critic_encoder() instead for PPO."
)
@OverrideToImplementCustomLogic
def build_pi_head(self, framework: str) -> Model:
"""Builds the policy head.
The default behavior is to build the head from the pi_head_config.
This can be overridden to build a custom policy head as a means of configuring
the behavior of a PPORLModule implementation.
Args:
framework: The framework to use. Either "torch" or "tf2".
Returns:
The policy head.
"""
# Get action_distribution_cls to find out about the output dimension for pi_head
action_distribution_cls = self.get_action_dist_cls(framework=framework)
if self._model_config_dict["free_log_std"]:
_check_if_diag_gaussian(
action_distribution_cls=action_distribution_cls, framework=framework
)
required_output_dim = action_distribution_cls.required_input_dim(
space=self.action_space, model_config=self._model_config_dict
)
# Now that we have the action dist class and number of outputs, we can define
# our pi-config and build the pi head.
pi_head_config_class = (
FreeLogStdMLPHeadConfig
if self._model_config_dict["free_log_std"]
else MLPHeadConfig
)
self.pi_head_config = pi_head_config_class(
input_dims=self.latent_dims,
hidden_layer_dims=self.pi_and_vf_head_hiddens,
hidden_layer_activation=self.pi_and_vf_head_activation,
output_layer_dim=required_output_dim,
output_layer_activation="linear",
)
return self.pi_head_config.build(framework=framework)
@OverrideToImplementCustomLogic
def build_vf_head(self, framework: str) -> Model:
"""Builds the value function head.
The default behavior is to build the head from the vf_head_config.
This can be overridden to build a custom value function head as a means of
configuring the behavior of a PPORLModule implementation.
Args:
framework: The framework to use. Either "torch" or "tf2".
Returns:
The value function head.
"""
return self.vf_head_config.build(framework=framework)