[ot][spam][crazy] lab1 docs was: lab1 was: draft: learning RL

Mon May 9 15:01:12 PDT 2022

I am _so_ excited to mind control myself to recruit an infinite number
of subhive workers for the Borg Queen! I am _totally_ going to get a
bigger infinite number of workers than my peers.

> Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning
> algorithms in PyTorch. It is the next major version of Stable Baselines.

> Github repository: https://github.com/DLR-RM/stable-baselines3
> Paper: https://jmlr.org/papers/volume22/20-1364/20-1364.pdf
> RL Baselines3 Zoo (training framework for SB3):
>  https://github.com/DLR-RM/rl-baselines3-zoo
> RL Baselines3 Zoo provides a collection of pre-trained agents, scripts for training,
> evaluating agents, tuning hyperparameters, plotting results and recording videos.

Hyperparameters are the mystic numbo-jumbo where users manually
configure whether or not the model actually works rather than having
it learn to configure itself. Like the "gamma" parameter we learned
for reward reduction.

> SB3 Contrib (experimental RL code, latest algorithms):
>  https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

> Main Features
> - Unified structure for all algorithms
> - PEP8 compliant (unified code style)
> - Documented functions and classes
> - Tests, high code coverage and type hints
> - Clean code
> - Tensorboard support
> - The performance of each algorithm was tested
>   (see Results section in their respective page)

[snip]

> Getting Started
> Most of the library tries to follow a sklearn-like syntax for the Reinforcement
> Learning algorithms.
>
> Here is a quick example of how to train and run A2C on a CartPole environment:
>

_whew_ the example uses cartpole instead of lunarlander.

There's some training code in the example, like the lab asks for.
Maybe I'll type that all into something by hand and see if it works.

> > import gym
> >
> > from stable_baselines3 import A2C
> >

$ pip3 install stable_baselines3 gym[box2d]

It takes some time to load the A2C import.

> > env = gym.make('CartPole-v1')
> >
> > model = A2C('MlpPolicy', env, verbose=1)

I think I recall the course saying 'MlpPolicy' means the _action
space_ is _discrete_, but I'm not certain on this. Of course here they
are using some policy called A2C instead of PPO.

> > model.learn(total_timesteps=10000)

Okay, this appears to be where it trains. It spends some time training
(not too long on this local system), and outputs data regularly. I
wonder if it is data every epoch. Here's the first output:

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 15.6     |
|    ep_rew_mean        | 15.6     |
| time/                 |          |
|    fps                | 466      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.678   |
|    explained_variance | 0.0911   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 1.8      |
|    value_loss         | 8.88     |
------------------------------------

It shows right in a fixed-width font.

I'm guessing e_rew_mean is the mean reward over the episode. It looks
like it's doing 466 steps per second.

It says 100 iterations and 500 total timesteps, so there's something
getting multiplied by 5 somewhere. Could it be running 5 environments
in parallel? It seems I didn't tell it to do that though?

The train information likely relates to machine learning.

Here's the final update:

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 53.7     |
|    ep_rew_mean        | 53.7     |
| time/                 |          |
|    fps                | 676      |
|    iterations         | 2000     |
|    time_elapsed       | 14       |
|    total_timesteps    | 10000    |
| train/                |          |
|    entropy_loss       | -0.535   |
|    explained_variance | 2.62e-05 |
|    learning_rate      | 0.0007   |
|    n_updates          | 1999     |
|    policy_loss        | 0.157    |
|    value_loss         | 0.336    |
------------------------------------

It got the mean reward to rise from 15.6 to 53.7 . So it doesn't
appear to be learning to fail harder and harder each episode, at
least. I often have that problem, myself.

The function returned an object of type A2C .

Lemme paste that function again that made it learn everything after
construction:

> model.learn(total_timesteps=10000)

And I'd better pair it with the ones that create an environment and model:

> env = gym.make('CartPole-v1')
> model = A2C('MlpPolicy', env, verbose=1)

These are what the lab is testing me on!
gym.make(environment_name) # returns a prefab'd environment
PolicyClass('MlpPolicy', env, verbose=1, ...) # makes a markov agent model
model.learn(total_timesteps=timestep_count, ...) # trains a policy
model on the environment it was made with

10k timestamps in stable baselines first example, which was A2C.

> >
> > obs = env.reset()
> > for i in range(1000):
> >     action, _state = model.predict(obs, deterministic=True)
> >     obs, reward, done, info = env.step(action)
> >     env.render()
> >     if done:
> >       obs = env.reset()

This looks like boilerplate for displaying what the policy learned to
do in the environment, after training.