Note
Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The Ray Team plans to transition algorithms, example scripts, and documentation to the new code base thereby incrementally replacing the “old API stack” (e.g., ModelV2, Policy, RolloutWorker) throughout the subsequent minor releases leading up to Ray 3.0.
Note, however, that so far only PPO (single- and multi-agent) and SAC (single-agent only) support the “new API stack” and continue to run by default with the old APIs. You can continue to use the existing custom (old stack) classes.
See here for more details on how to use the new API stack.
Environments#
Any environment type provided by you to RLlib (e.g. a user-defined gym.Env class),
is converted internally into the BaseEnv
API, whose main methods are poll()
and send_actions()
:
The BaseEnv
API allows RLlib to support:
Vectorization of sub-environments (i.e. individual gym.Env instances, stacked to form a vector of envs) in order to batch the action computing model forward passes.
External simulators requiring async execution (e.g. envs that run on separate machines and independently request actions from a policy server).
Stepping through the individual sub-environments in parallel via pre-converting them into separate
@ray.remote
actors.Multi-agent RL via dicts mapping agent IDs to observations/rewards/etc..
For example, if you provide a custom gym.Env class to RLlib, auto-conversion to BaseEnv
goes as follows:
User provides a gym.Env ->
_VectorizedGymEnv
(is-aVectorEnv
) ->BaseEnv
Here is a simple example:
# __rllib-custom-gym-env-begin__
import gymnasium as gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
class SimpleCorridor(gym.Env):
def __init__(self, config):
self.end_pos = config["corridor_length"]
self.cur_pos = 0
self.action_space = gym.spaces.Discrete(2) # right/left
self.observation_space = gym.spaces.Discrete(self.end_pos)
def reset(self, *, seed=None, options=None):
self.cur_pos = 0
return self.cur_pos, {}
def step(self, action):
if action == 0 and self.cur_pos > 0: # move right (towards goal)
self.cur_pos -= 1
elif action == 1: # move left (towards start)
self.cur_pos += 1
if self.cur_pos >= self.end_pos:
return 0, 1.0, True, True, {}
else:
return self.cur_pos, -0.1, False, False, {}
ray.init()
config = PPOConfig().environment(SimpleCorridor, env_config={"corridor_length": 5})
algo = config.build()
for _ in range(3):
print(algo.train())
algo.stop()
# __rllib-custom-gym-env-end__
However, you may also conveniently sub-class any of the other supported RLlib-specific
environment types. The automated paths from those env types (or callables returning instances of those types) to
an RLlib BaseEnv
is as follows:
User provides a custom
MultiAgentEnv
(is-a gym.Env) ->VectorEnv
->BaseEnv
User uses a policy client (via an external simulator) ->
ExternalEnv
|ExternalMultiAgentEnv
->BaseEnv
User provides a custom
BaseEnv
-> do nothing