Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The Ray Team plans to transition algorithms, example scripts, and documentation to the new code base thereby incrementally replacing the “old API stack” (e.g., ModelV2, Policy, RolloutWorker) throughout the subsequent minor releases leading up to Ray 3.0.

Note, however, that so far only PPO (single- and multi-agent) and SAC (single-agent only) support the “new API stack” and continue to run by default with the old APIs. You can continue to use the existing custom (old stack) classes.

See here for more details on how to use the new API stack.

BaseEnv API#

rllib.env.base_env.BaseEnv#

class ray.rllib.env.base_env.BaseEnv[source]#

The lowest-level env interface used by RLlib for sampling.

BaseEnv models multiple agents executing asynchronously in multiple vectorized sub-environments. A call to poll() returns observations from ready agents keyed by their sub-environment ID and agent IDs, and actions for those agents can be sent back via send_actions().

All other RLlib supported env types can be converted to BaseEnv. RLlib handles these conversions internally in RolloutWorker, for example:

gym.Env => rllib.VectorEnv => rllib.BaseEnv rllib.MultiAgentEnv (is-a gym.Env) => rllib.VectorEnv => rllib.BaseEnv rllib.ExternalEnv => rllib.BaseEnv

MyBaseEnv = ...
env = MyBaseEnv()
obs, rewards, terminateds, truncateds, infos, off_policy_actions = (
    env.poll()
)
print(obs)

env.send_actions({
  "env_0": {
    "car_0": 0,
    "car_1": 1,
  }, ...
})
obs, rewards, terminateds, truncateds, infos, off_policy_actions = (
    env.poll()
)
print(obs)

print(terminateds)
{
    "env_0": {
        "car_0": [2.4, 1.6],
        "car_1": [3.4, -3.2],
    },
    "env_1": {
        "car_0": [8.0, 4.1],
    },
    "env_2": {
        "car_0": [2.3, 3.3],
        "car_1": [1.4, -0.2],
        "car_3": [1.2, 0.1],
    },
}
{
    "env_0": {
        "car_0": [4.1, 1.7],
        "car_1": [3.2, -4.2],
    }, ...
}
{
    "env_0": {
        "__all__": False,
        "car_0": False,
        "car_1": True,
    }, ...
}
to_base_env(make_env: Callable[[int], Any | gymnasium.Env] | None = None, num_envs: int = 1, remote_envs: bool = False, remote_env_batch_wait_ms: int = 0, restart_failed_sub_environments: bool = False) BaseEnv[source]#

Converts an RLlib-supported env into a BaseEnv object.

Supported types for the env arg are gym.Env, BaseEnv, VectorEnv, MultiAgentEnv, ExternalEnv, or ExternalMultiAgentEnv.

The resulting BaseEnv is always vectorized (contains n sub-environments) to support batched forward passes, where n may also be 1. BaseEnv also supports async execution via the poll and send_actions methods and thus supports external simulators.

TODO: Support gym3 environments, which are already vectorized.

Parameters:
  • env – An already existing environment of any supported env type to convert/wrap into a BaseEnv. Supported types are gym.Env, BaseEnv, VectorEnv, MultiAgentEnv, ExternalEnv, and ExternalMultiAgentEnv.

  • make_env – A callable taking an int as input (which indicates the number of individual sub-environments within the final vectorized BaseEnv) and returning one individual sub-environment.

  • num_envs – The number of sub-environments to create in the resulting (vectorized) BaseEnv. The already existing env will be one of the num_envs.

  • remote_envs – Whether each sub-env should be a @ray.remote actor. You can set this behavior in your config via the remote_worker_envs=True option.

  • remote_env_batch_wait_ms – The wait time (in ms) to poll remote sub-environments for, if applicable. Only used if remote_envs is True.

  • policy_config – Optional policy config dict.

Returns:

The resulting BaseEnv object.

poll() Tuple[Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]]][source]#

Returns observations from ready agents.

All return values are two-level dicts mapping from EnvID to dicts mapping from AgentIDs to (observation/reward/etc..) values. The number of agents and sub-environments may vary over time.

Returns:

New observations for each ready agent. Reward values for each ready agent. If the episode is just started, the value will be None. Terminated values for each ready agent. The special key “__all__” is used to indicate episode termination. Truncated values for each ready agent. The special key “__all__” is used to indicate episode truncation. Info values for each ready agent. Agents may take off-policy actions, in which case, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.

Return type:

Tuple consisting of

send_actions(action_dict: Dict[int | str, Dict[Any, Any]]) None[source]#

Called to send actions back to running agents in this env.

Actions should be sent for each ready agent that returned observations in the previous poll() call.

Parameters:

action_dict – Actions values keyed by env_id and agent_id.

try_reset(env_id: int | str | None = None, *, seed: int | None = None, options: dict | None = None) Tuple[Dict[int | str, Dict[Any, Any]] | None, Dict[int | str, Dict[Any, Any]] | None][source]#

Attempt to reset the sub-env with the given id or all sub-envs.

If the environment does not support synchronous reset, a tuple of (ASYNC_RESET_REQUEST, ASYNC_RESET_REQUEST) can be returned here.

Note: A MultiAgentDict is returned when using the deprecated wrapper classes such as ray.rllib.env.base_env._MultiAgentEnvToBaseEnv, however for consistency with the poll() method, a MultiEnvDict is returned from the new wrapper classes, such as ray.rllib.env.multi_agent_env.MultiAgentEnvWrapper.

Parameters:
  • env_id – The sub-environment’s ID if applicable. If None, reset the entire Env (i.e. all sub-environments).

  • seed – The seed to be passed to the sub-environment(s) when resetting it. If None, will not reset any existing PRNG. If you pass an integer, the PRNG will be reset even if it already exists.

  • options – An options dict to be passed to the sub-environment(s) when resetting it.

Returns:

A tuple consisting of a) the reset (multi-env/multi-agent) observation dict and b) the reset (multi-env/multi-agent) infos dict. Returns the (ASYNC_RESET_REQUEST, ASYNC_RESET_REQUEST) tuple, if not supported.

try_restart(env_id: int | str | None = None) None[source]#

Attempt to restart the sub-env with the given id or all sub-envs.

This could result in the sub-env being completely removed (gc’d) and recreated.

Parameters:

env_id – The sub-environment’s ID, if applicable. If None, restart the entire Env (i.e. all sub-environments).

get_sub_environments(as_dict: bool = False) List[Any | gymnasium.Env] | dict[source]#

Return a reference to the underlying sub environments, if any.

Parameters:

as_dict – If True, return a dict mapping from env_id to env.

Returns:

List or dictionary of the underlying sub environments or [] / {}.

get_agent_ids() Set[Any][source]#

Return the agent ids for the sub_environment.

Returns:

All agent ids for each the environment.

try_render(env_id: int | str | None = None) None[source]#

Tries to render the sub-environment with the given id or all.

Parameters:

env_id – The sub-environment’s ID, if applicable. If None, renders the entire Env (i.e. all sub-environments).

stop() None[source]#

Releases all resources used.

property observation_space: gymnasium.Space#

Returns the observation space for each agent.

Note: samples from the observation space need to be preprocessed into a

MultiEnvDict before being used by a policy.

Returns:

The observation space for each environment.

property action_space: gymnasium.Space#

Returns the action space for each agent.

Note: samples from the action space need to be preprocessed into a

MultiEnvDict before being passed to send_actions.

Returns:

The observation space for each environment.

action_space_sample(agent_id: list = None) Dict[int | str, Dict[Any, Any]][source]#
Returns a random action for each environment, and potentially each

agent in that environment.

Parameters:

agent_id – List of agent ids to sample actions for. If None or empty list, sample actions for all agents in the environment.

Returns:

A random action for each environment.

observation_space_sample(agent_id: list = None) Dict[int | str, Dict[Any, Any]][source]#
Returns a random observation for each environment, and potentially

each agent in that environment.

Parameters:

agent_id – List of agent ids to sample actions for. If None or empty list, sample actions for all agents in the environment.

Returns:

A random action for each environment.

last() Tuple[Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]], Dict[int | str, Dict[Any, Any]]][source]#

Returns the last observations, rewards, done- truncated flags and infos …

that were returned by the environment.

Returns:

The last observations, rewards, done- and truncated flags, and infos for each sub-environment.

observation_space_contains(x: Dict[int | str, Dict[Any, Any]]) bool[source]#

Checks if the given observation is valid for each environment.

Parameters:

x – Observations to check.

Returns:

True if the observations are contained within their respective

spaces. False otherwise.

action_space_contains(x: Dict[int | str, Dict[Any, Any]]) bool[source]#

Checks if the given actions is valid for each environment.

Parameters:

x – Actions to check.

Returns:

True if the actions are contained within their respective

spaces. False otherwise.