Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The Ray Team plans to transition algorithms, example scripts, and documentation to the new code base thereby incrementally replacing the “old API stack” (e.g., ModelV2, Policy, RolloutWorker) throughout the subsequent minor releases leading up to Ray 3.0.

Note, however, that so far only PPO (single- and multi-agent) and SAC (single-agent only) support the “new API stack” and continue to run by default with the old APIs. You can continue to use the existing custom (old stack) classes.

See here for more details on how to use the new API stack.

Environments#

Any environment type provided by you to RLlib (e.g. a user-defined gym.Env class), is converted internally into the BaseEnv API, whose main methods are poll() and send_actions():

The BaseEnv API allows RLlib to support:

Vectorization of sub-environments (i.e. individual gym.Env instances, stacked to form a vector of envs) in order to batch the action computing model forward passes.
External simulators requiring async execution (e.g. envs that run on separate machines and independently request actions from a policy server).
Stepping through the individual sub-environments in parallel via pre-converting them into separate @ray.remote actors.
Multi-agent RL via dicts mapping agent IDs to observations/rewards/etc..

For example, if you provide a custom gym.Env class to RLlib, auto-conversion to BaseEnv goes as follows:

User provides a gym.Env -> _VectorizedGymEnv (is-a VectorEnv) -> BaseEnv

Here is a simple example:

# __rllib-custom-gym-env-begin__
import gymnasium as gym

import ray
from ray.rllib.algorithms.ppo import PPOConfig


class SimpleCorridor(gym.Env):
    def __init__(self, config):
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = gym.spaces.Discrete(2)  # right/left
        self.observation_space = gym.spaces.Discrete(self.end_pos)

    def reset(self, *, seed=None, options=None):
        self.cur_pos = 0
        return self.cur_pos, {}

    def step(self, action):
        if action == 0 and self.cur_pos > 0:  # move right (towards goal)
            self.cur_pos -= 1
        elif action == 1:  # move left (towards start)
            self.cur_pos += 1
        if self.cur_pos >= self.end_pos:
            return 0, 1.0, True, True, {}
        else:
            return self.cur_pos, -0.1, False, False, {}


ray.init()

config = PPOConfig().environment(SimpleCorridor, env_config={"corridor_length": 5})
algo = config.build()

for _ in range(3):
    print(algo.train())

algo.stop()
# __rllib-custom-gym-env-end__

However, you may also conveniently sub-class any of the other supported RLlib-specific environment types. The automated paths from those env types (or callables returning instances of those types) to an RLlib BaseEnv is as follows:

User provides a custom MultiAgentEnv (is-a gym.Env) -> VectorEnv -> BaseEnv
User uses a policy client (via an external simulator) -> ExternalEnv | ExternalMultiAgentEnv -> BaseEnv
User provides a custom VectorEnv -> BaseEnv
User provides a custom BaseEnv -> do nothing

Environments#

Environment API Reference#