I am _so_ excited to mind control myself to recruit an infinite number of subhive workers for the Borg Queen! I am _totally_ going to get a bigger infinite number of workers than my peers.
Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines.
Github repository: https://github.com/DLR-RM/stable-baselines3 Paper: https://jmlr.org/papers/volume22/20-1364/20-1364.pdf RL Baselines3 Zoo (training framework for SB3): https://github.com/DLR-RM/rl-baselines3-zoo RL Baselines3 Zoo provides a collection of pre-trained agents, scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
Hyperparameters are the mystic numbo-jumbo where users manually configure whether or not the model actually works rather than having it learn to configure itself. Like the "gamma" parameter we learned for reward reduction.
SB3 Contrib (experimental RL code, latest algorithms): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
Main Features - Unified structure for all algorithms - PEP8 compliant (unified code style) - Documented functions and classes - Tests, high code coverage and type hints - Clean code - Tensorboard support - The performance of each algorithm was tested (see Results section in their respective page)
[snip]
Getting Started Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.
Here is a quick example of how to train and run A2C on a CartPole environment:
_whew_ the example uses cartpole instead of lunarlander. There's some training code in the example, like the lab asks for. Maybe I'll type that all into something by hand and see if it works.
import gym
from stable_baselines3 import A2C
$ pip3 install stable_baselines3 gym[box2d] It takes some time to load the A2C import.
env = gym.make('CartPole-v1')
model = A2C('MlpPolicy', env, verbose=1)
I think I recall the course saying 'MlpPolicy' means the _action space_ is _discrete_, but I'm not certain on this. Of course here they are using some policy called A2C instead of PPO.
model.learn(total_timesteps=10000)
Okay, this appears to be where it trains. It spends some time training (not too long on this local system), and outputs data regularly. I wonder if it is data every epoch. Here's the first output: ------------------------------------ | rollout/ | | | ep_len_mean | 15.6 | | ep_rew_mean | 15.6 | | time/ | | | fps | 466 | | iterations | 100 | | time_elapsed | 1 | | total_timesteps | 500 | | train/ | | | entropy_loss | -0.678 | | explained_variance | 0.0911 | | learning_rate | 0.0007 | | n_updates | 99 | | policy_loss | 1.8 | | value_loss | 8.88 | ------------------------------------ It shows right in a fixed-width font. I'm guessing e_rew_mean is the mean reward over the episode. It looks like it's doing 466 steps per second. It says 100 iterations and 500 total timesteps, so there's something getting multiplied by 5 somewhere. Could it be running 5 environments in parallel? It seems I didn't tell it to do that though? The train information likely relates to machine learning. Here's the final update: ------------------------------------ | rollout/ | | | ep_len_mean | 53.7 | | ep_rew_mean | 53.7 | | time/ | | | fps | 676 | | iterations | 2000 | | time_elapsed | 14 | | total_timesteps | 10000 | | train/ | | | entropy_loss | -0.535 | | explained_variance | 2.62e-05 | | learning_rate | 0.0007 | | n_updates | 1999 | | policy_loss | 0.157 | | value_loss | 0.336 | ------------------------------------ It got the mean reward to rise from 15.6 to 53.7 . So it doesn't appear to be learning to fail harder and harder each episode, at least. I often have that problem, myself. The function returned an object of type A2C . Lemme paste that function again that made it learn everything after construction:
model.learn(total_timesteps=10000)
And I'd better pair it with the ones that create an environment and model:
env = gym.make('CartPole-v1') model = A2C('MlpPolicy', env, verbose=1)
These are what the lab is testing me on! gym.make(environment_name) # returns a prefab'd environment PolicyClass('MlpPolicy', env, verbose=1, ...) # makes a markov agent model model.learn(total_timesteps=timestep_count, ...) # trains a policy model on the environment it was made with 10k timestamps in stable baselines first example, which was A2C.
obs = env.reset() for i in range(1000): action, _state = model.predict(obs, deterministic=True) obs, reward, done, info = env.step(action) env.render() if done: obs = env.reset()
This looks like boilerplate for displaying what the policy learned to do in the environment, after training.