备注

Ray 2.10.0 引入了 RLlib 的“新 API 栈”的 alpha 阶段。Ray 团队计划将算法、示例脚本和文档迁移到新的代码库中，从而在 Ray 3.0 之前的后续小版本中逐步替换“旧 API 栈”（例如，ModelV2、Policy、RolloutWorker）。

然而，请注意，到目前为止，只有 PPO（单代理和多代理）和 SAC（仅单代理）支持“新 API 堆栈”，并且默认情况下继续使用旧 API 运行。您可以继续使用现有的自定义（旧堆栈）类。

请参阅此处以获取有关如何使用新API堆栈的更多详细信息。

保存和加载你的强化学习算法和策略#

你可以使用 Checkpoint 对象来存储和加载你的 Algorithm 或 Policy 以及这些结构中的神经网络（权重）的当前状态。接下来，我们将介绍如何创建这些检查点（从而保存你的算法和策略）到磁盘，你在哪里可以找到它们，以及如何从给定的检查点恢复（加载）你的 Algorithm 或 Policy。

什么是检查点？#

检查点是一组信息，位于一个目录内（可能包含进一步的子目录），用于恢复一个 Algorithm 或单个 Policy 实例。最初用于创建检查点的算法或策略实例可能在此之前已经过训练，也可能没有。

RLlib 使用 Checkpoint 类来创建检查点并从中恢复对象。

检查点目录中的主文件，包含状态信息，目前使用 Ray 的 cloudpickle 包生成。由于 cloudpickle 对使用的 Python 版本不稳定，我们目前正在试验 msgpack`（和 `msgpack_numpy）作为替代的检查点格式。如果您对生成与 Python 版本无关的检查点感兴趣，请参阅以下详细信息。

算法检查点#

算法检查点包含算法的所有状态，包括其配置、实际的算法子类、所有策略的权重、当前计数器等。

从这样的检查点恢复一个新的算法会让你处于一种状态，在这种状态下，你可以像继续使用旧算法（从中获取检查点）一样继续使用这个新算法。

如何创建算法检查点？#

The Algorithm save() 方法创建一个新的检查点（包含文件的目录）。

让我们来看一个如何创建这种算法检查点的简单示例：

# Create a PPO algorithm object using a config object ..
from ray.rllib.algorithms.ppo import PPOConfig

my_ppo_config = PPOConfig().environment("CartPole-v1")
my_ppo = my_ppo_config.build()

# .. train one iteration ..
my_ppo.train()
# .. and call `save()` to create a checkpoint.
save_result = my_ppo.save()
path_to_checkpoint = save_result.checkpoint.path
print(
    "An Algorithm checkpoint has been created inside directory: "
    f"'{path_to_checkpoint}'."
)

# Let's terminate the algo for demonstration purposes.
my_ppo.stop()
# Doing this will lead to an error.
# my_ppo.train()

如果你查看 save() 调用返回的目录，你应该会看到类似这样的内容：

$ ls -la
  .
  ..
  policies/
  algorithm_state.pkl
  rllib_checkpoint.json

如你所见，为我们创建了一个 policies 子目录（稍后会详细介绍），一个 algorithm_state.pkl 文件，和一个 rllib_checkpoint.json 文件。algorithm_state.pkl 文件包含算法所有非策略特定的状态信息，例如算法的计数器和其他重要的变量，以便持续跟踪。rllib_checkpoint.json 文件包含为用户方便使用的检查点版本。从 Ray RLlib 2.0 及以上版本开始，所有检查点版本都将向后兼容，这意味着 RLlib 版本 V 将能够处理使用 Ray 2.0 或任何版本直至 V 创建的任何检查点。

$ more rllib_checkpoint.json
{"type": "Algorithm", "checkpoint_version": "1.0"}

现在，让我们查看 policies/ 子目录：

$ cd policies
$ ls -la
  .
  ..
  default_policy/

我们可以看到另一个子目录，称为 default_policy。RLlib 在 policies/ 目录中为算法使用的每个策略实例创建一个子目录。在标准的单一代理情况下，这将是“default_policy”。请注意，“default_policy”是所谓的策略ID。在多代理情况下，根据您的特定设置和环境，您可能会在这里看到多个具有不同名称的子目录（不同策略的策略ID）。例如，如果您正在训练两个ID为“policy_1”和“policy_2”的策略，您应该看到以下子目录：

$ ls -la
  .
  ..
  policy_1/
  policy_2/

最后，让我们快速看一下我们的 default_policy 子目录：

$ cd default_policy
$ ls -la
  .
  ..
  rllib_checkpoint.json
  policy_state.pkl

类似于算法的状态（保存在 algorithm_state.pkl 中），策略的状态存储在 policy_state.pkl 文件下。我们将在下面讨论 Policy 检查点时，详细介绍该文件的内容。请注意，Policy 检查点还有一个信息文件（rllib_checkpoint.json），它总是与包含的算法检查点版本相同。

检查点是特定于Python版本的，但可以转换为与版本无关。#

通过 save() 方法创建的算法检查点总是基于 cloudpickle 的，因此依赖于所使用的 Python 版本。这意味着不能保证你能够使用在 Python 3.8 中创建的检查点在新环境中恢复运行 Python 3.9 的算法。

然而，我们现在提供了一个实用工具，用于将检查点（通过 Algorithm.save() 生成）转换为与 Python 版本无关的检查点（基于 msgpack）。然后，您可以使用新转换的 msgpack 检查点从中恢复另一个 Algorithm 实例。请查看此处的简短示例，了解如何执行此操作：

import tempfile

from ray.rllib.algorithms.algorithm import Algorithm
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.utils.checkpoints import convert_to_msgpack_checkpoint


# Base config used for both pickle-based checkpoint and msgpack-based one.
config = DQNConfig().environment("CartPole-v1")
# Build algorithm object.
algo1 = config.build()

# Create standard (pickle-based) checkpoint.
with tempfile.TemporaryDirectory() as pickle_cp_dir:
    # Note: `save()` always creates a pickle based checkpoint.
    algo1.save(checkpoint_dir=pickle_cp_dir)

    # But we can convert this pickle checkpoint to a msgpack one using an RLlib utility
    # function.
    with tempfile.TemporaryDirectory() as msgpack_cp_dir:
        convert_to_msgpack_checkpoint(pickle_cp_dir, msgpack_cp_dir)

        # Try recreating a new algorithm object from the msgpack checkpoint.
        # Note: `Algorithm.from_checkpoint` now works with both pickle AND msgpack
        # type checkpoints.
        algo2 = Algorithm.from_checkpoint(msgpack_cp_dir)

# algo1 and algo2 are now identical.

这样，您可以继续运行您的算法并在偶尔时 save() 它们，或者 - 如果您正在使用 Ray Tune 进行试验 - 使用 Tune 的集成检查点设置。如前所述，这将生成基于 cloudpickle 的检查点。一旦您需要迁移到更高（或更低）的 Python 版本，请使用 convert_to_msgpack_checkpoint() 工具，创建一个基于 msgpack 的检查点，并将其传递给 Algorithm.from_checkpoint() 或提供给您的 Tune 配置。RLlib 现在能够从这两种格式中重新创建算法。

如何从检查点恢复算法？#

给定我们的检查点路径（由 Algorithm.save() 返回），我们现在可以创建一个全新的 Algorithm 实例，并使其与我们在上面的示例中停止（因此无法再使用）的那个完全相同：

from ray.rllib.algorithms.algorithm import Algorithm

# Use the Algorithm's `from_checkpoint` utility to get a new algo instance
# that has the exact same state as the old one, from which the checkpoint was
# created in the first place:
my_new_ppo = Algorithm.from_checkpoint(path_to_checkpoint)

# Continue training.
my_new_ppo.train()

或者，你也可以首先使用与原始算法相同的配置创建一个新的 Algorithm 实例，然后只调用新 Algorithm 的 restore() 方法，并传递给它检查点目录：

# Re-build a fresh algorithm.
my_new_ppo = my_ppo_config.build()

# Restore the old (checkpointed) state.
my_new_ppo.restore(save_result)

# Continue training.
my_new_ppo.train()

上述过程曾经是恢复算法的唯一方法，然而，它比使用 from_checkpoint() 工具更为繁琐，因为它需要一个额外的步骤，并且您必须将原始配置存储在某个地方。

我可以使用哪些算法检查点版本？#

RLlib 使用简单的检查点版本（例如 v0.1 或 v1.0）来确定如何从给定的检查点目录恢复算法（或 Policy；见下文）。

从 Ray 2.1 开始，您可以在检查点目录的顶层找到 rllib_checkpoint.json 文件中写入的检查点版本。RLlib 不使用此文件或其中的信息，它仅为了用户的方便而存在。

从 Ray RLlib 2.0 及以上版本开始，所有检查点版本都将向后兼容，这意味着某些 RLlib 2.x 版本将能够处理由 RLlib 2.0 或任何版本到 2.x 创建的任何检查点。

多智能体算法检查点#

如果你正在使用多智能体设置，并且在你的 Algorithm 中有多个 Policy 需要训练，你可以按照上述方法创建一个 Algorithm 检查点，并在子目录 policies/ 中找到你各自的 Policy 检查点。

例如：

import os

# Use our example multi-agent CartPole environment to train in.
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole

# Set up a multi-agent Algorithm, training two policies independently.
my_ma_config = PPOConfig().multi_agent(
    # Which policies should RLlib create and train?
    policies={"pol1", "pol2"},
    # Let RLlib know, which agents in the environment (we'll have "agent1"
    # and "agent2") map to which policies.
    policy_mapping_fn=(
        lambda agent_id, episode, worker, **kw: (
            "pol1" if agent_id == "agent1" else "pol2"
        )
    ),
    # Setting these isn't necessary. All policies will always be trained by default.
    # However, since we do provide a list of IDs here, we need to remain in charge of
    # changing this `policies_to_train` list, should we ever alter the Algorithm
    # (e.g. remove one of the policies or add a new one).
    policies_to_train=["pol1", "pol2"],  # Again, `None` would be totally fine here.
)

# Add the MultiAgentCartPole env to our config and build our Algorithm.
my_ma_config.environment(
    MultiAgentCartPole,
    env_config={
        "num_agents": 2,
    },
)

my_ma_algo = my_ma_config.build()
my_ma_algo.train()

ma_checkpoint_dir = my_ma_algo.save().checkpoint.path

print(
    "An Algorithm checkpoint has been created inside directory: "
    f"'{ma_checkpoint_dir}'.\n"
    "Individual Policy checkpoints can be found in "
    f"'{os.path.join(ma_checkpoint_dir, 'policies')}'."
)

# Create a new Algorithm instance from the above checkpoint, just as you would for
# a single-agent setup:
my_ma_algo_clone = Algorithm.from_checkpoint(ma_checkpoint_dir)

假设您希望恢复检查点中的所有策略，您可以按照上述单代理情况中的描述进行操作（通过 algo = Algorithm.from_checkpoint([path to your multi-agent checkpoint])）。

然而，可能存在一种情况，即你的算法中有太多策略（例如，你正在进行基于联盟的训练），并且希望从检查点恢复一个新的算法实例，但只在新算法对象中包含原始策略中的一部分。在这种情况下，你也可以这样做：

# Here, we use the same (multi-agent Algorithm) checkpoint as above, but only restore
# it with the first Policy ("pol1").

my_ma_algo_only_pol1 = Algorithm.from_checkpoint(
    ma_checkpoint_dir,
    # Tell the `from_checkpoint` util to create a new Algo, but only with "pol1" in it.
    policy_ids=["pol1"],
    # Make sure to update the mapping function (we must not map to "pol2" anymore
    # to avoid a runtime error). Now both agents ("agent0" and "agent1") map to
    # the same policy.
    policy_mapping_fn=lambda agent_id, episode, worker, **kw: "pol1",
    # Since we defined this above, we have to re-define it here with the updated
    # PolicyIDs, otherwise, RLlib will throw an error (it will think that there is an
    # unknown PolicyID in this list ("pol2")).
    policies_to_train=["pol1"],
)

# Make sure, pol2 isn't in this Algorithm anymore.
assert my_ma_algo_only_pol1.get_policy("pol2") is None

# Continue training (only with pol1).
my_ma_algo_only_pol1.train()

策略检查点#

我们已经查看过 Algorithm 检查点目录内的 policies/ 子目录，并了解到 Algorithm 中的各个策略将其所有状态信息存储在该子目录内的策略ID下。因此，我们现在对检查点的全貌有了完整的了解：

.
..
.is_checkpoint
.tune_metadata

algorithm_state.pkl        # <- state of the Algorithm (excluding Policy states)
rllib_checkpoint.json      # <- checkpoint info, such as checkpoint version, e.g. "1.0"

policies/
  policy_A/
    policy_state.pkl       # <- state of policy_A
    rllib_checkpoint.json  # <- checkpoint info, such as checkpoint version, e.g. "1.0"

  policy_B/
    policy_state.pkl       # <- state of policy_B
    rllib_checkpoint.json  # <- checkpoint info, such as checkpoint version, e.g. "1.0"

如何创建策略检查点？#

你可以通过调用 Algorithm 上的 save() 方法来创建一个 Policy 检查点，这将按照上述描述在 policies/ 子目录下保存每个单独的策略检查点，或者——如果你需要更细粒度的控制——可以通过以下方式进行：

# Retrieve the Policy object from an Algorithm.
# Note that for normal, single-agent Algorithms, the Policy ID is "default_policy".
policy1 = my_ma_algo.get_policy(policy_id="pol1")

# Tell RLlib to store an individual policy checkpoint (only for "pol1") inside
# /tmp/my_policy_checkpoint
policy1.export_checkpoint("/tmp/my_policy_checkpoint")

如果您现在检查提供的目录（/tmp/my_policy_checkpoint/），您应该会在其中看到以下文件：

.
..
rllib_checkpoint.json   # <- checkpoint info, such as checkpoint version, e.g. "1.0"
policy_state.pkl        # <- state of "pol1"

如何从策略检查点恢复？#

假设您希望在生产环境中部署训练好的策略，因此您只想使用 RLlib Policy 实例，而不需要通常与 Algorithm 对象一起提供的所有其他功能，例如用于收集训练样本或评估的不同 ``RolloutWorkers``（这两者都包括 RL 环境的副本），等等。

在这种情况下，如果你能从 Policy 检查点或 Algorithm 检查点中恢复 Policy ，那将非常有用，正如我们上面所学到的，后者包含了其所有策略的检查点。

以下是如何做到这一点：

import numpy as np

from ray.rllib.policy.policy import Policy

# Use the `from_checkpoint` utility of the Policy class:
my_restored_policy = Policy.from_checkpoint("/tmp/my_policy_checkpoint")

# Use the restored policy for serving actions.
obs = np.array([0.0, 0.1, 0.2, 0.3])  # individual CartPole observation
action = my_restored_policy.compute_single_action(obs)

print(f"Computed action {action} from given CartPole observation.")

如何恢复一个包含部分原始策略的多智能体算法？#

想象一下，你已经训练了一个多代理的 Algorithm，例如使用了100种不同的策略，并从这个 Algorithm 创建了一个检查点。这个检查点现在在 policies/ 目录中包含了100个子目录，这些子目录以不同的策略ID命名。

经过对不同策略的仔细评估，您希望恢复 Algorithm 并继续训练它，但仅限于原始100个策略中的一个子集，例如仅使用ID为“polA”和“polB”的策略。

您可以使用原始检查点（包含100个策略）和 Algorithm.from_checkpoint() 实用工具以高效的方式实现这一点。

此示例展示了如何将五项原始政策减少为两项政策：

from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole

# Set up an Algorithm with 5 Policies.
algo_w_5_policies = (
    PPOConfig()
    .environment(
        env=MultiAgentCartPole,
        env_config={
            "num_agents": 5,
        },
    )
    .multi_agent(
        policies={"pol0", "pol1", "pol2", "pol3", "pol4"},
        # Map "agent0" -> "pol0", etc...
        policy_mapping_fn=(
            lambda agent_id, episode, worker, **kwargs: f"pol{agent_id}"
        ),
    )
    .build()
)

# .. train one iteration ..
algo_w_5_policies.train()
# .. and call `save()` to create a checkpoint.
path_to_checkpoint = algo_w_5_policies.save().checkpoint.path
print(
    "An Algorithm checkpoint has been created inside directory: "
    f"'{path_to_checkpoint}'. It should contain 5 policies in the 'policies/' sub dir."
)
# Let's terminate the algo for demonstration purposes.
algo_w_5_policies.stop()

# We will now recreate a new algo from this checkpoint, but only with 2 of the
# original policies ("pol0" and "pol1"). Note that this will require us to change the
# `policy_mapping_fn` (instead of mapping 5 agents to 5 policies, we now have
# to map 5 agents to only 2 policies).


def new_policy_mapping_fn(agent_id, episode, worker, **kwargs):
    return "pol0" if agent_id in ["agent0", "agent1"] else "pol1"


algo_w_2_policies = Algorithm.from_checkpoint(
    checkpoint=path_to_checkpoint,
    policy_ids={"pol0", "pol1"},  # <- restore only those policy IDs here.
    policy_mapping_fn=new_policy_mapping_fn,  # <- use this new mapping fn.
)

# Test, whether we can train with this new setup.
algo_w_2_policies.train()
# Terminate the new algo.
algo_w_2_policies.stop()

请注意，我们不得不将原来的 policy_mapping_fn 从映射 “agent0” 到 “pol0”，”agent1” 到 “pol1” 等，改为一个新的函数，该函数将我们的五个代理映射到仅剩的两个策略：”agent0” 和 “agent1” 映射到 “pol0”，所有其他代理映射到 “pol1”。

模型导出#

除了为您的 RLlib 对象（例如 RLlib Algorithm 或单个 RLlib Policy）创建检查点外，仅以原生（非 RLlib 依赖）格式导出您的神经网络模型也可能非常有用，例如作为 keras 或 PyTorch 模型。然后，您可以在 RLlib 之外使用这些训练好的神经网络模型，例如在生产环境中用于服务目的。

如何导出我的神经网络模型？#

有几种方法可以创建 Keras 或 PyTorch 原生模型的“导出”。

以下是说明这些的示例代码：

from ray.rllib.algorithms.ppo import PPOConfig

# Create a new Algorithm (which contains a Policy, which contains a NN Model).
# Switch on for native models to be included in the Policy checkpoints.
ppo_config = (
    PPOConfig().environment("Pendulum-v1").checkpointing(export_native_model_files=True)
)

# The default framework is TensorFlow, but if you would like to do this example with
# PyTorch, uncomment the following line of code:
# ppo_config.framework("torch")

# Create the Algorithm and train one iteration.
ppo = ppo_config.build()
ppo.train()

# Get the underlying PPOTF1Policy (or PPOTorchPolicy) object.
ppo_policy = ppo.get_policy()

我们现在可以将 Keras NN 模型（PPO 算法中 PPOTF1Policy 使用的模型）导出到磁盘…

使用 Policy 对象：

ppo_policy.export_model("/tmp/my_nn_model")
# .. check /tmp/my_nn_model/ for the model files.

# For Keras You should be able to recover the model via:
# keras_model = tf.saved_model.load("/tmp/my_nn_model/")
# And pass in a Pendulum-v1 observation:
# results = keras_model(tf.convert_to_tensor(
#     np.array([[0.0, 0.1, 0.2]]), dtype=np.float32)
# )

# For PyTorch, do:
# pytorch_model = torch.load("/tmp/my_nn_model/model.pt")
# results = pytorch_model(
#     input_dict={
#         "obs": torch.from_numpy(np.array([[0.0, 0.1, 0.2]], dtype=np.float32)),
#     },
#     state=[torch.tensor(0)],  # dummy value
#     seq_lens=torch.tensor(0),  # dummy value
# )

通过策略的检查点方法：

checkpoint_dir = ppo_policy.export_checkpoint("tmp/ppo_policy")
# .. check /tmp/ppo_policy/model/ for the model files.
# You should be able to recover the keras model via:
# keras_model = tf.saved_model.load("/tmp/ppo_policy/model")
# And pass in a Pendulum-v1 observation:
# results = keras_model(tf.convert_to_tensor(
#     np.array([[0.0, 0.1, 0.2]]), dtype=np.float32)
# )

通过算法（策略）检查点：

checkpoint_dir = ppo.save().checkpoint.path
# .. check `checkpoint_dir` for the Algorithm checkpoint files.
# For keras you should be able to recover the model via:
# keras_model = tf.saved_model.load(checkpoint_dir + "/policies/default_policy/model/")
# And pass in a Pendulum-v1 observation
# results = keras_model(tf.convert_to_tensor(
#     np.array([[0.0, 0.1, 0.2]]), dtype=np.float32)
# )

那么如何将我的神经网络模型导出为ONNX格式呢？#

RLlib 还支持将您的神经网络模型导出为 ONNX 格式。为此，请使用 Policy 的 export_model 方法，但需提供额外的 onnx 参数，如下所示：

# Using the same Policy object, we can also export our NN Model in the ONNX format:
ppo_policy.export_model("/tmp/my_nn_model", onnx=False)