创建自定义环境¶

本页提供了一个关于如何使用 Gymnasium 创建自定义环境的简要概述，如需更完整的教程，包括渲染，请在阅读本页之前先阅读基本用法。

我们将实现一个非常简单的游戏，称为 GridWorldEnv，它由一个固定大小的二维正方形网格组成。在每个时间步，智能体可以在网格单元之间垂直或水平移动，智能体的目标是导航到在情节开始时随机放置在网格上的目标。

关于游戏的基本信息

观察结果提供了目标和代理的位置。
在我们的环境中，有4个离散的动作，分别对应于“向右”、“向上”、“向左”和“向下”的移动。
当代理导航到目标所在的网格单元时，环境结束（终止）。
只有在达到目标时，代理才会获得奖励，即当代理达到目标时奖励为1，否则为0。

环境 `init`¶

像所有环境一样，我们的自定义环境将继承自 :class:gymnasium.Env，它定义了环境的结构。环境的要求之一是定义观察和动作空间，它们声明了环境可能的输入（动作）和输出（观察）的通用集合。根据我们对游戏基本信息的概述，我们的代理有四个离散动作，因此我们将使用包含四个选项的 Discrete(4) 空间。

在我们的观察中，有几种选项，在本教程中，我们将假设我们的观察看起来像 {"agent": array([1, 0]), "target": array([0, 3])}，其中数组元素代表代理或目标的 x 和 y 位置。表示观察的替代选项是作为一个 2D 网格，网格上的值代表代理和目标，或者是一个 3D 网格，每个“层”仅包含代理或目标信息。因此，我们将声明观察空间为 :class:Dict，其中代理和目标空间是一个 :class:Box，允许输出一个整数类型的数组。

有关可与环境一起使用的空间完整列表，请参见 spaces

from typing import Optional
import numpy as np
import gymnasium as gym


class GridWorldEnv(gym.Env):

    def __init__(self, size: int = 5):
        # The size of the square grid
        self.size = size

        # Define the agent and target location; randomly chosen in `reset` and updated in `step`
        self._agent_location = np.array([-1, -1], dtype=np.int32)
        self._target_location = np.array([-1, -1], dtype=np.int32)

        # Observations are dictionaries with the agent's and the target's location.
        # Each location is encoded as an element of {0, ..., `size`-1}^2
        self.observation_space = gym.spaces.Dict(
            {
                "agent": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
                "target": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
            }
        )

        # We have 4 actions, corresponding to "right", "up", "left", "down"
        self.action_space = gym.spaces.Discrete(4)
        # Dictionary maps the abstract actions to the directions on the grid
        self._action_to_direction = {
            0: np.array([1, 0]),  # right
            1: np.array([0, 1]),  # up
            2: np.array([-1, 0]),  # left
            3: np.array([0, -1]),  # down
        }

构建观察¶

由于我们需要在 :meth:Env.reset 和 :meth:Env.step 中计算观察结果，通常有一个方法 _get_obs 将环境的状态转换为观察结果会很方便。然而，这不是强制性的，你可以在 :meth:Env.reset 和 :meth:Env.step 中分别计算观察结果。

    def _get_obs(self):
        return {"agent": self._agent_location, "target": self._target_location}

我们也可以为 :meth:Env.reset 和 :meth:Env.step 返回的辅助信息实现类似的方法。在我们的例子中，我们希望提供智能体和目标之间的曼哈顿距离：

    def _get_info(self):
        return {
            "distance": np.linalg.norm(
                self._agent_location - self._target_location, ord=1
            )
        }

通常，信息还会包含一些只能在 :meth:Env.step 方法内部获得的数据（例如，个别奖励项）。在这种情况下，我们需要更新由 _get_info 在 :meth:Env.step 中返回的字典。

重置功能¶

由于 :meth:reset 的目的是为环境启动一个新片段，并且有两个参数：seed 和 options。种子可以用来将随机数生成器初始化为确定状态，选项可以用来指定在重置过程中使用的值。在重置的第一行，你需要调用 super().reset(seed=seed)，这将初始化随机数生成器（:attr:np_random），以便在 :meth:reset 的其余部分中使用。

在我们的自定义环境中，:meth:reset 需要随机选择代理和目标的位置（如果它们位置相同，则重复此过程）。:meth:reset 的返回类型是初始观察和任何辅助信息的元组。因此，我们可以使用之前为实现该功能而编写的方法 _get_obs 和 _get_info：

    def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
        # We need the following line to seed self.np_random
        super().reset(seed=seed)

        # Choose the agent's location uniformly at random
        self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)

        # We will sample the target's location randomly until it does not coincide with the agent's location
        self._target_location = self._agent_location
        while np.array_equal(self._target_location, self._agent_location):
            self._target_location = self.np_random.integers(
                0, self.size, size=2, dtype=int
            )

        observation = self._get_obs()
        info = self._get_info()

        return observation, info

步骤函数¶

:meth:step 方法通常包含你环境中的大部分逻辑，它接受一个 action 并计算应用该动作后的环境状态，返回一个包含下一个观察、结果奖励、环境是否终止、环境是否截断以及辅助信息的元组。

在我们的环境中，步骤函数期间需要发生几件事情：

我们使用 self._action_to_direction 将离散动作（例如，2）转换为带有我们代理位置的网格方向。为了防止代理超出网格边界，我们将代理位置裁剪以保持在边界内。

我们通过检查代理的当前位置是否等于目标位置来计算代理的奖励。

由于环境内部不会截断（我们可以在 :meth:make 期间对环境应用时间限制包装器），我们将 truncated 永久设置为 False。

我们再次使用 _get_obs 和 _get_info 来获取代理的观察和辅助信息。

    def step(self, action):
        # Map the action (element of {0,1,2,3}) to the direction we walk in
        direction = self._action_to_direction[action]
        # We use `np.clip` to make sure we don't leave the grid bounds
        self._agent_location = np.clip(
            self._agent_location + direction, 0, self.size - 1
        )

        # An environment is completed if and only if the agent has reached the target
        terminated = np.array_equal(self._agent_location, self._target_location)
        truncated = False
        reward = 1 if terminated else 0  # the agent is only reached at the end of the episode
        observation = self._get_obs()
        info = self._get_info()

        return observation, reward, terminated, truncated, info

注册并创建环境¶

虽然现在可以立即使用您的新自定义环境，但更常见的是使用 :meth:gymnasium.make 来初始化环境。在本节中，我们将解释如何注册自定义环境然后初始化它。

环境ID由三个部分组成，其中两个是可选的：一个可选的命名空间（此处：gymnasium_env），一个必需的名称（此处：GridWorld），以及一个可选但推荐的版本（此处：v0）。它也可以被注册为 GridWorld-v0（推荐的方法），GridWorld 或 gymnasium_env/GridWorld，并且在创建环境时应使用适当的ID。

入口点可以是一个字符串或函数，由于本教程不是Python项目的一部分，我们不能使用字符串，但在大多数环境中，这是指定入口点的常规方式。

gym.register(
    id="gymnasium_env/GridWorld-v0",
    entry_point=GridWorldEnv,
)

要获取关于注册自定义环境的更完整指南（包括使用字符串入口点），请阅读完整的创建环境教程。

一旦环境被注册，你可以通过 :meth:gymnasium.pprint_registry 检查，这将输出所有已注册的环境，然后可以使用 :meth:gymnasium.make 初始化环境。一个包含多个相同环境实例并行运行的矢量化版本的环境可以通过 :meth:gymnasium.make_vec 实例化。

import gymnasium as gym
>>> gym.make("gymnasium_env/GridWorld-v0")
<OrderEnforcing<PassiveEnvChecker<GridWorld<gymnasium_env/GridWorld-v0>>>>
>>> gym.make("gymnasium_env/GridWorld-v0", max_episode_steps=100)
<TimeLimit<OrderEnforcing<PassiveEnvChecker<GridWorld<gymnasium_env/GridWorld-v0>>>>>
>>> env = gym.make("gymnasium_env/GridWorld-v0", size=10)
>>> env.unwrapped.size
10
>>> gym.make_vec("gymnasium_env/GridWorld-v0", num_envs=3)
SyncVectorEnv(gymnasium_env/GridWorld-v0, num_envs=3)

使用包装器¶

通常，我们希望使用自定义环境的多种变体，或者我们想要修改由 Gymnasium 或其他方提供的某个环境的行为。包装器允许我们这样做，而无需更改环境实现或添加任何样板代码。有关如何使用包装器以及实现自己的包装器的详细信息，请查看包装器文档。在我们的示例中，观察结果不能直接用于学习代码，因为它们是字典。然而，我们实际上不需要触及我们的环境实现来解决这个问题！我们可以简单地在环境实例之上添加一个包装器，将观察结果展平为一个单一数组：

>>> from gymnasium.wrappers import FlattenObservation

>>> env = gym.make('gymnasium_env/GridWorld-v0')
>>> env.observation_space
Dict('agent': Box(0, 4, (2,), int64), 'target': Box(0, 4, (2,), int64))
>>> env.reset()
({'agent': array([4, 1]), 'target': array([2, 4])}, {'distance': 5.0})
>>> wrapped_env = FlattenObservation(env)
>>> wrapped_env.observation_space
Box(0, 4, (4,), int64)
>>> wrapped_env.reset()
(array([3, 0, 2, 1]), {'distance': 2.0})