包装器¶
- class gymnasium.vector.VectorWrapper(env: VectorEnv)[源代码]¶
将向量化环境包装起来,以允许模块化转换。
此类是所有向量化环境包装器的基类。子类可以重写某些方法,以在不触及原始代码的情况下改变原始向量化环境的行为。
备注
如果子类重写了 :meth:
__init__
,别忘了调用super().__init__(env)
。- 参数:
env – 要包装的环境
- step(actions: ActType) tuple[ObsType, ArrayType, ArrayType, ArrayType, dict[str, Any]] [源代码]¶
使用返回批量数据的行动遍历所有环境。
- class gymnasium.vector.VectorObservationWrapper(env: VectorEnv)[源代码]¶
将矢量化环境包装起来,以允许对观察结果进行模块化转换。
相当于向量化环境的 :class:
gymnasium.ObservationWrapper
。- 参数:
env – 要包装的环境
- class gymnasium.vector.VectorActionWrapper(env: VectorEnv)[源代码]¶
将矢量化环境包装起来,以允许对动作进行模块化变换。
向量化环境的 :class:
gymnasium.ActionWrapper
等价物。- 参数:
env – 要包装的环境
- class gymnasium.vector.VectorRewardWrapper(env: VectorEnv)[源代码]¶
将向量化环境封装起来,以允许对奖励进行模块化转换。
向量化环境的 :class:
gymnasium.RewardWrapper
等效类。- 参数:
env – 要包装的环境
仅向量包装器¶
- class gymnasium.wrappers.vector.DictInfoToList(env: VectorEnv)[源代码]¶
将向量化环境的infos从
dict
转换为List[dict]
。这个包装器将向量环境的info格式从字典转换为字典列表。这个包装器旨在用于向量化环境周围。如果使用其他对info执行操作的包装器,如
RecordEpisodeStatistics
,这需要是最外层的包装器。即
DictInfoToList(RecordEpisodeStatistics(vector_env))
示例
>>> import numpy as np >>> dict_info = { ... "k": np.array([0., 0., 0.5, 0.3]), ... "_k": np.array([False, False, True, True]) ... } ... >>> list_info = [{}, {}, {"k": 0.5}, {"k": 0.3}]
- 向量环境的示例:
>>> import numpy as np >>> import gymnasium as gym >>> from gymnasium.spaces import Dict, Box >>> envs = gym.make_vec("CartPole-v1", num_envs=3) >>> obs, info = envs.reset(seed=123) >>> info {} >>> envs = DictInfoToList(envs) >>> obs, info = envs.reset(seed=123) >>> info [{}, {}, {}]
- 向量环境的另一个示例:
>>> import numpy as np >>> import gymnasium as gym >>> envs = gym.make_vec("HalfCheetah-v4", num_envs=3) >>> _ = envs.reset(seed=123) >>> _ = envs.action_space.seed(123) >>> _, _, _, _, infos = envs.step(envs.action_space.sample()) >>> infos {'x_position': array([0.03332211, 0.10172355, 0.08920531]), '_x_position': array([ True, True, True]), 'x_velocity': array([-0.06296527, 0.89345848, 0.37710836]), '_x_velocity': array([ True, True, True]), 'reward_run': array([-0.06296527, 0.89345848, 0.37710836]), '_reward_run': array([ True, True, True]), 'reward_ctrl': array([-0.24503503, -0.21944423, -0.20672209]), '_reward_ctrl': array([ True, True, True])} >>> envs = DictInfoToList(envs) >>> _ = envs.reset(seed=123) >>> _ = envs.action_space.seed(123) >>> _, _, _, _, infos = envs.step(envs.action_space.sample()) >>> infos [{'x_position': 0.03332210900362942, 'x_velocity': -0.06296527291998533, 'reward_run': -0.06296527291998533, 'reward_ctrl': -0.2450350284576416}, {'x_position': 0.10172354684460168, 'x_velocity': 0.8934584807363618, 'reward_run': 0.8934584807363618, 'reward_ctrl': -0.21944422721862794}, {'x_position': 0.08920531470057845, 'x_velocity': 0.3771083596080768, 'reward_run': 0.3771083596080768, 'reward_ctrl': -0.20672209262847902}]
- 变更日志:
v0.24.0 - 最初添加为
VectorListInfo
v1.0.0 - 重命名为
DictInfoToList
- 参数:
env (Env) – 应用包装器的环境
- class gymnasium.wrappers.vector.VectorizeTransformObservation(env: VectorEnv, wrapper: type[TransformObservation], **kwargs: Any)[源代码]¶
为向量环境向量化单一代理转换观察包装器。
大多数单代理环境的 lambda 观察包装器都有矢量化实现,建议用户通过从
gymnasium.wrappers.vector...
导入来直接使用这些实现。以下示例说明了需要自定义 lambda 观察包装器的情况。- 示例 - 正常观察:
>>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> envs.close() >>> obs array([[ 0.01823519, -0.0446179 , -0.02796401, -0.03156282], [ 0.02852531, 0.02858594, 0.0469136 , 0.02480598], [ 0.03517495, -0.000635 , -0.01098382, -0.03203924]], dtype=float32)
- 示例 - 应用一个自定义的 lambda 观察包装器,该包装器复制来自环境的观察结果
>>> import numpy as np >>> import gymnasium as gym >>> from gymnasium.spaces import Box >>> from gymnasium.wrappers import TransformObservation >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> old_space = envs.single_observation_space >>> new_space = Box(low=np.array([old_space.low, old_space.low]), high=np.array([old_space.high, old_space.high])) >>> envs = VectorizeTransformObservation(envs, wrapper=TransformObservation, func=lambda x: np.array([x, x]), observation_space=new_space) >>> obs, info = envs.reset(seed=123) >>> envs.close() >>> obs array([[[ 0.01823519, -0.0446179 , -0.02796401, -0.03156282], [ 0.01823519, -0.0446179 , -0.02796401, -0.03156282]], [[ 0.02852531, 0.02858594, 0.0469136 , 0.02480598], [ 0.02852531, 0.02858594, 0.0469136 , 0.02480598]], [[ 0.03517495, -0.000635 , -0.01098382, -0.03203924], [ 0.03517495, -0.000635 , -0.01098382, -0.03203924]]], dtype=float32)
- 参数:
env – 要包装的向量环境。
wrapper – 向量化包装器
**kwargs – 包装器的键参数
- class gymnasium.wrappers.vector.VectorizeTransformAction(env: VectorEnv, wrapper: type[TransformAction], **kwargs: Any)[源代码]¶
为向量环境向量化单一代理转换动作包装器。
- 示例 - 无动作转换:
>>> import gymnasium as gym >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> obs, rew, term, trunc, info = envs.step(envs.action_space.sample()) >>> envs.close() >>> obs array([[-4.6343064e-01, 9.8971417e-05], [-4.4488689e-01, -1.9375233e-03], [-4.3118435e-01, -1.5342437e-03]], dtype=float32)
- 示例 - 添加一个对动作应用ReLU的变换:
>>> import gymnasium as gym >>> from gymnasium.wrappers import TransformAction >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> envs = VectorizeTransformAction(envs, wrapper=TransformAction, func=lambda x: (x > 0.0) * x, action_space=envs.single_action_space) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> obs, rew, term, trunc, info = envs.step(envs.action_space.sample()) >>> envs.close() >>> obs array([[-4.6343064e-01, 9.8971417e-05], [-4.4354835e-01, -5.9898634e-04], [-4.3034542e-01, -6.9532328e-04]], dtype=float32)
- 参数:
env – 要包装的向量环境
wrapper – 向量化包装器
**kwargs – LambdaAction 包装器的参数
- class gymnasium.wrappers.vector.VectorizeTransformReward(env: VectorEnv, wrapper: type[TransformReward], **kwargs: Any)[源代码]¶
向量化单一代理转换奖励包装器以用于向量环境。
- 一个应用ReLU到奖励的例子:
>>> import gymnasium as gym >>> from gymnasium.wrappers import TransformReward >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> envs = VectorizeTransformReward(envs, wrapper=TransformReward, func=lambda x: (x > 0.0) * x) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> obs, rew, term, trunc, info = envs.step(envs.action_space.sample()) >>> envs.close() >>> rew array([-0., -0., -0.])
- 参数:
env – 要包装的向量环境。
wrapper – 向量化包装器
**kwargs – 包装器的键参数
矢量化常用包装器¶
- class gymnasium.wrappers.vector.RecordEpisodeStatistics(env: VectorEnv, buffer_length: int = 100, stats_key: str = 'episode')[源代码]¶
这个包装器将跟踪累积奖励和情节长度。
在向量化环境中的任何一集结束时,该集的统计数据将使用键
episode
添加到info
中,而_episode
键用于指示具有终止或截断集的环境索引。>>> infos = { ... ... ... "episode": { ... "r": "<array of cumulative reward for each done sub-environment>", ... "l": "<array of episode length for each done sub-environment>", ... "t": "<array of elapsed time since beginning of episode for each done sub-environment>" ... }, ... "_episode": "<boolean array of length num-envs>" ... }
此外,最近的奖励和情节长度存储在可以通过 :attr:
wrapped_env.return_queue
和 :attr:wrapped_env.length_queue
分别访问的缓冲区中。- 变量:
return_queue – 过去
deque_size
个回合的累积奖励length_queue – 最后
deque_size
个回合的长度
示例
>>> from pprint import pprint >>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3) >>> envs = RecordEpisodeStatistics(envs) >>> obs, info = envs.reset(123) >>> _ = envs.action_space.seed(123) >>> end = False >>> while not end: ... obs, rew, term, trunc, info = envs.step(envs.action_space.sample()) ... end = term.any() or trunc.any() ... >>> envs.close() >>> pprint(info) {'_episode': array([ True, False, False]), '_final_info': array([ True, False, False]), '_final_observation': array([ True, False, False]), 'episode': {'l': array([11, 0, 0], dtype=int32), 'r': array([11., 0., 0.], dtype=float32), 't': array([0.007812, 0. , 0. ], dtype=float32)}, 'final_info': array([{}, None, None], dtype=object), 'final_observation': array([array([ 0.11448676, 0.9416149 , -0.20946532, -1.7619033 ], dtype=float32), None, None], dtype=object)}
- 参数:
env (Env) – 应用包装器的环境
buffer_length – 缓冲区 :attr:
return_queue
、:attr:length_queue
和 :attr:time_queue
的大小stats_key – 保存数据的 info 键
已实现的观察包装器¶
- class gymnasium.wrappers.vector.TransformObservation(env: VectorEnv, func: Callable[[ObsType], Any], observation_space: Space | None = None)[源代码]¶
通过提供给包装器的函数转换观察结果。
此函数允许手动指定向量观察函数以及单个观察函数。当例如可以并行处理向量观察或通过其他更优化的方法时,这是可取的。否则,应使用
VectorizeTransformObservation
,其中只需定义single_func
。- 示例 - 无观察变换:
>>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> obs array([[ 0.01823519, -0.0446179 , -0.02796401, -0.03156282], [ 0.02852531, 0.02858594, 0.0469136 , 0.02480598], [ 0.03517495, -0.000635 , -0.01098382, -0.03203924]], dtype=float32) >>> envs.close()
- 示例 - 带有观察变换:
>>> import gymnasium as gym >>> from gymnasium.spaces import Box >>> def scale_and_shift(obs): ... return (obs - 1.0) * 2.0 ... >>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> new_obs_space = Box(low=envs.observation_space.low, high=envs.observation_space.high) >>> envs = TransformObservation(envs, func=scale_and_shift, observation_space=new_obs_space) >>> obs, info = envs.reset(seed=123) >>> obs array([[-1.9635296, -2.0892358, -2.055928 , -2.0631256], [-1.9429494, -1.9428282, -1.9061728, -1.9503881], [-1.9296501, -2.00127 , -2.0219676, -2.0640786]], dtype=float32) >>> envs.close()
- 参数:
env – 要包装的向量环境
func – 一个将转换向量观测的函数。如果这个转换后的观测超出了
env.observation_space
的观测空间,则提供一个observation_space
。observation_space – 包装器的观察空间,如果为 None,则假定与
env.observation_space
相同。
- class gymnasium.wrappers.vector.FilterObservation(env: VectorEnv, filter_keys: Sequence[str | int])[源代码]¶
用于过滤字典或元组观察空间的向量包装器。
- 示例 - 使用Dict空间创建矢量化环境,以演示如何过滤键:
>>> import numpy as np >>> import gymnasium as gym >>> from gymnasium.spaces import Dict, Box >>> from gymnasium.wrappers import TransformObservation >>> from gymnasium.wrappers.vector import VectorizeTransformObservation, FilterObservation >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> make_dict = lambda x: {"obs": x, "junk": np.array([0.0])} >>> new_space = Dict({"obs": envs.single_observation_space, "junk": Box(low=-1.0, high=1.0)}) >>> envs = VectorizeTransformObservation(env=envs, wrapper=TransformObservation, func=make_dict, observation_space=new_space) >>> envs = FilterObservation(envs, ["obs"]) >>> obs, info = envs.reset(seed=123) >>> envs.close() >>> obs {'obs': array([[ 0.01823519, -0.0446179 , -0.02796401, -0.03156282], [ 0.02852531, 0.02858594, 0.0469136 , 0.02480598], [ 0.03517495, -0.000635 , -0.01098382, -0.03203924]], dtype=float32)}
- 参数:
env – 要包装的向量环境
filter_keys – 要包含的子空间,对
Dict
和Tuple
空间分别使用字符串或整数列表。
- class gymnasium.wrappers.vector.FlattenObservation(env: VectorEnv)[源代码]¶
观察包装器,用于展平观察结果。
示例
>>> import gymnasium as gym >>> envs = gym.make_vec("CarRacing-v2", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 96, 96, 3) >>> envs = FlattenObservation(envs) >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 27648) >>> envs.close()
- 参数:
env – 要包装的向量环境
- class gymnasium.wrappers.vector.GrayscaleObservation(env: VectorEnv, keep_dim: bool = False)[源代码]¶
观察包装器,将RGB图像转换为灰度图像。
示例
>>> import gymnasium as gym >>> envs = gym.make_vec("CarRacing-v2", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 96, 96, 3) >>> envs = GrayscaleObservation(envs) >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 96, 96) >>> envs.close()
- 参数:
env – 要包装的向量环境
keep_dim – 如果在观察中保留通道,如果
True
,则obs.shape == 3
,否则obs.shape == 2
- class gymnasium.wrappers.vector.ResizeObservation(env: VectorEnv, shape: tuple[int, ...])[源代码]¶
使用 OpenCV 调整图像观察的大小为指定形状。
示例
>>> import gymnasium as gym >>> envs = gym.make_vec("CarRacing-v2", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 96, 96, 3) >>> envs = ResizeObservation(envs, shape=(28, 28)) >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 28, 28, 3) >>> envs.close()
- 参数:
env – 要包装的向量环境
shape – 调整大小后的观察形状
- class gymnasium.wrappers.vector.ReshapeObservation(env: VectorEnv, shape: int | tuple[int, ...])[源代码]¶
将基于数组的观测值重塑为形状。
示例
>>> import gymnasium as gym >>> envs = gym.make_vec("CarRacing-v2", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 96, 96, 3) >>> envs = ReshapeObservation(envs, shape=(9216, 3)) >>> obs, info = envs.reset(seed=123) >>> obs.shape (3, 9216, 3) >>> envs.close()
- 参数:
env – 要包装的向量环境
shape – 重塑的观察空间
- class gymnasium.wrappers.vector.RescaleObservation(env: VectorEnv, min_obs: floating | integer | ndarray, max_obs: floating | integer | ndarray)[源代码]¶
线性重缩放观测值到最小值和最大值之间。
示例
>>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> obs.min() -0.0446179 >>> obs.max() 0.0469136 >>> envs = RescaleObservation(envs, min_obs=-5.0, max_obs=5.0) >>> obs, info = envs.reset(seed=123) >>> obs.min() -0.33379582 >>> obs.max() 0.55998987 >>> envs.close()
- 参数:
env – 要包装的向量环境
min_obs – 新的最小观测界限
max_obs – 新的最大观测界限
- class gymnasium.wrappers.vector.DtypeObservation(env: VectorEnv, dtype: Any)[源代码]¶
用于转换观察值数据类型的观察包装器。
示例
>>> import numpy as np >>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> obs.dtype dtype('float32') >>> envs = DtypeObservation(envs, dtype=np.float64) >>> obs, info = envs.reset(seed=123) >>> obs.dtype dtype('float64') >>> envs.close()
- 参数:
env – 要包装的向量环境
dtype – 观察的新数据类型
- class gymnasium.wrappers.vector.NormalizeObservation(env: VectorEnv, epsilon: float = 1e-8)[源代码]¶
此包装器将标准化观测值,使得每个坐标均以单位方差为中心。
属性
_update_running_mean
允许冻结/继续观察统计量的运行均值计算。如果为True
(默认),RunningMeanStd
将在每次步骤和重置调用时更新。如果为False
,则使用计算的统计量但不进行更新;这可能在评估期间使用。备注
归一化依赖于过去的轨迹和观察结果,如果包装器是新实例化的或策略最近被更改,归一化将不会正确进行。
- 没有归一化奖励包装器的示例:
>>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> obs, info = envs.reset(seed=123) >>> _ = envs.action_space.seed(123) >>> for _ in range(100): ... obs, *_ = envs.step(envs.action_space.sample()) >>> np.mean(obs) 0.024251968 >>> np.std(obs) 0.62259156 >>> envs.close()
- 使用 normalize reward wrapper 的示例:
>>> import gymnasium as gym >>> envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync") >>> envs = NormalizeObservation(envs) >>> obs, info = envs.reset(seed=123) >>> _ = envs.action_space.seed(123) >>> for _ in range(100): ... obs, *_ = envs.step(envs.action_space.sample()) >>> np.mean(obs) -0.2359734 >>> np.std(obs) 1.1938739 >>> envs.close()
- 参数:
env (Env) – 应用包装器的环境
epsilon – 在缩放观测值时使用的稳定性参数。
已实现的 Action 包装器¶
- class gymnasium.wrappers.vector.TransformAction(env: VectorEnv, func: Callable[[ActType], Any], action_space: Space | None = None)[源代码]¶
通过提供给包装器的函数转换一个动作。
函数 :attr:
func
将应用于所有向量动作。如果来自 :attr:func
的观察结果超出了env
的动作空间范围,请提供一个 :attr:action_space
,它指定了向量化环境的动作空间。- 示例 - 无动作转换:
>>> import gymnasium as gym >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> for _ in range(10): ... obs, rew, term, trunc, info = envs.step(envs.action_space.sample()) ... >>> envs.close() >>> obs array([[-0.46553135, -0.00142543], [-0.498371 , -0.00715587], [-0.4651575 , -0.00624371]], dtype=float32)
- 示例 - 带有动作转换:
>>> import gymnasium as gym >>> from gymnasium.spaces import Box >>> def shrink_action(act): ... return act * 0.3 ... >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> new_action_space = Box(low=shrink_action(envs.action_space.low), high=shrink_action(envs.action_space.high)) >>> envs = TransformAction(env=envs, func=shrink_action, action_space=new_action_space) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> for _ in range(10): ... obs, rew, term, trunc, info = envs.step(envs.action_space.sample()) ... >>> envs.close() >>> obs array([[-0.48468155, -0.00372536], [-0.47599354, -0.00545912], [-0.46543318, -0.00615723]], dtype=float32)
- 参数:
env – 要包装的向量环境
func – 一个将转换动作的函数。如果这个转换后的动作超出了
env.action_space
的动作空间,则提供一个action_space
。action_space – 包装器的动作空间,如果为 None,则假定与
env.action_space
相同。
- class gymnasium.wrappers.vector.ClipAction(env: VectorEnv)[源代码]¶
在有效的 :class:
Box
观测空间边界内裁剪连续动作。- 示例 - 将超出范围的动作传递给环境以进行裁剪。
>>> import numpy as np >>> import gymnasium as gym >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> envs = ClipAction(envs) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> obs, rew, term, trunc, info = envs.step(np.array([5.0, -5.0, 2.0])) >>> envs.close() >>> obs array([[-0.4624777 , 0.00105192], [-0.44504836, -0.00209899], [-0.42884544, 0.00080468]], dtype=float32)
- 参数:
env – 要包装的向量环境
- class gymnasium.wrappers.vector.RescaleAction(env: VectorEnv, min_action: float | int | ndarray, max_action: float | int | ndarray)[源代码]¶
将环境的连续动作空间仿射重缩放到范围 [min_action, max_action]。
- 示例 - 无动作缩放:
>>> import numpy as np >>> import gymnasium as gym >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> for _ in range(10): ... obs, rew, term, trunc, info = envs.step(0.5 * np.ones((3, 1))) ... >>> envs.close() >>> obs array([[-0.44799727, 0.00266526], [-0.4351738 , 0.00133522], [-0.42683297, 0.00048403]], dtype=float32)
- 示例 - 使用动作缩放:
>>> import numpy as np >>> import gymnasium as gym >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> envs = RescaleAction(envs, 0.0, 1.0) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> for _ in range(10): ... obs, rew, term, trunc, info = envs.step(0.5 * np.ones((3, 1))) ... >>> envs.close() >>> obs array([[-0.48657528, -0.00395268], [-0.47377947, -0.00529102], [-0.46546045, -0.00614867]], dtype=float32)
- 参数:
env (Env) – 要包装的向量环境
min_action (float, int or np.ndarray) – 每个动作的最小值。这可能是一个 numpy 数组或一个标量。
max_action (float, int or np.ndarray) – 每个动作的最大值。这可能是一个numpy数组或一个标量。
已实现的奖励包装器¶
- class gymnasium.wrappers.vector.TransformReward(env: VectorEnv, func: Callable[[ArrayType], ArrayType])[源代码]¶
一个奖励包装器,允许自定义函数修改步骤奖励。
- 奖励转换示例:
>>> import gymnasium as gym >>> from gymnasium.spaces import Box >>> def scale_and_shift(rew): ... return (rew - 1.0) * 2.0 ... >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> envs = TransformReward(env=envs, func=scale_and_shift) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> obs, rew, term, trunc, info = envs.step(envs.action_space.sample()) >>> envs.close() >>> obs array([[-4.6343064e-01, 9.8971417e-05], [-4.4488689e-01, -1.9375233e-03], [-4.3118435e-01, -1.5342437e-03]], dtype=float32)
- 参数:
env (Env) – 要包装的向量环境
func – (Callable): 应用于奖励的函数
- class gymnasium.wrappers.vector.ClipReward(env: VectorEnv, min_reward: float | ndarray | None = None, max_reward: float | ndarray | None = None)[源代码]¶
一个将环境奖励限制在上下限之间的包装器。
- 带有裁剪奖励的示例:
>>> import numpy as np >>> import gymnasium as gym >>> envs = gym.make_vec("MountainCarContinuous-v0", num_envs=3) >>> envs = ClipReward(envs, 0.0, 2.0) >>> _ = envs.action_space.seed(123) >>> obs, info = envs.reset(seed=123) >>> for _ in range(10): ... obs, rew, term, trunc, info = envs.step(0.5 * np.ones((3, 1))) ... >>> envs.close() >>> rew array([0., 0., 0.])
- 参数:
env – 要包装的向量环境
min_reward – 每一步的最小奖励
max_reward – 每一步的最大奖励
- class gymnasium.wrappers.vector.NormalizeReward(env: VectorEnv, gamma: float = 0.99, epsilon: float = 1e-8)[源代码]¶
此包装器将标准化即时奖励,使得其指数移动平均值具有固定的方差。
指数移动平均的方差为 :math:
(1 - \gamma)^2
。属性
_update_running_mean
允许冻结/继续奖励统计的运行均值计算。如果为True
(默认),每次调用self.normalize()
时,RunningMeanStd
都会更新。如果为False
,则使用计算的统计数据,但不再更新;这可能在评估期间使用。备注
缩放取决于过去的轨迹,如果包装器是新实例化的或策略最近被更改,奖励将不会被正确缩放。
- 没有归一化奖励包装器的示例:
>>> import gymnasium as gym >>> import numpy as np >>> envs = gym.make_vec("MountainCarContinuous-v0", 3) >>> _ = envs.reset(seed=123) >>> _ = envs.action_space.seed(123) >>> episode_rewards = [] >>> for _ in range(100): ... observation, reward, *_ = envs.step(envs.action_space.sample()) ... episode_rewards.append(reward) ... >>> envs.close() >>> np.mean(episode_rewards) -0.03359492141887935 >>> np.std(episode_rewards) 0.029028230434438706
- 使用 normalize reward wrapper 的示例:
>>> import gymnasium as gym >>> import numpy as np >>> envs = gym.make_vec("MountainCarContinuous-v0", 3) >>> envs = NormalizeReward(envs) >>> _ = envs.reset(seed=123) >>> _ = envs.action_space.seed(123) >>> episode_rewards = [] >>> for _ in range(100): ... observation, reward, *_ = envs.step(envs.action_space.sample()) ... episode_rewards.append(reward) ... >>> envs.close() >>> np.mean(episode_rewards) -0.1598639586606745 >>> np.std(episode_rewards) 0.27800309628058434
- 参数:
env (env) – 应用包装器的环境
epsilon (float) – 一个稳定性参数
gamma (float) – 用于指数移动平均中的折扣因子。