[ot][spam][crazy] lab1 was: draft: learning RL

Mon May 9 07:13:06 PDT 2022

I misread step 4, it's just the environment not the training !

>
It ends with a description of vectorised environments. The lab says
stacking multiple independent environments gives more diverse experiences
during training. One might prefer to have the agent decide this, but then
it would get too useful, yes? And need information on the local training
system, and decide when to train and when not ... Uhhh ...

env = make_vec_env('LunarLander-v2', n_envs=16)

This makes a wrapping environment that processes 16 times much stuff at
once. I think it basically adds a dimension to the observation and reward
tensors.

--

Step 5 is creating a model.

Problem statement: land correctly on the landing pad by controlling left,
right, and main orientation engine.

Daydream side thread: what they did here was the problem have a certain
kind of difficulty and ease by presetting a reward schedule. Maybe ideally
a trained model would decide the reward schedule, which maybe ideally would
be part of this model I suppose, or maybe that would be unclear.

SB3, the first deep rl library being introduced, is used. It contains
reliable implementations for reinforcement learning algorithms in pytorch.

They give the great advice of visiting documentation and trying tutorials
before using a new library:
https://stable-baselines3.readthedocs.io/en/master .
A large picture is shown of a robot with a machine learning library logo
planning how to sink a basketball through a hoop. I imagine the importance
of preparation and practice, looking at it.

The algorithm that will be used is PPO. This is an algorithm considered
state of the art that will be studied during the course.

Side note: PPO has been used to train a two-legged robot to walk from
scratch within a single hour, using massively parallel simulations,
hyperparameter tuning, and some nvidia stuff. It's an older algorithm as
things go nowadays, openai named it a year or two ago. The letters stand
for "proximal policy optimization".

Setting up Stable-Baselines3:
- create the environment
- define and instantiate the model to use, model = PPO('MlpPolicy')
- train the agent with model.learn and define how much training to do

# create env
env = gym.make('LunarLander-v2')

# instantiate agent (models aren't agents imo but the lab implies they are)
model = PPO('MlpPolicy', env, verbose=1)

# train agent
model.learn(total_timesteps=int(2e5))

The lab provides a space where the user has to type in the PPO('MlpPolicy',
env) line for the code to continue successfully, so that I learn to type
it. A comment explains that -> MlpPolicy is for vector inputs, where
CnnPolicy is for image inputs.

Under the hood, the policies are likely simple input encoders passing to
the same transformer model. If you wanted different data, you'd copy an
input encoder from some other use of transformer models, or make your own,
or pass it raw.

So here it is:

# How do you define a PPO MlpPolicy architecture? Use MultiLayerPerceptron
(MLPPolicy) because the input is a vector.

model = [fill this in]

NOTE: the lab expects you to have read the associated documentation, but
the answer they give goes a little above and beyond what a new person would
likely figure out.

SPOILER: solution immediately below this line

# We added some parameters to fasten the training
model = PPO(
  policy = 'MlpPolicy',
  env = env,
  n_steps = 1024,
  batch_size = 64,
  n_epochs = 4,
  gamma = 0.999,
  gae_lambda = 0.98,
  ent_coef = 0.01,
  verbose=1)

My guesses:
- I don't remember which step count n_steps is
- batch_size is usually the number of runs that are calculated in parallel
when the model gradients are backpropagated to reduce a run of training loss
- n_epochs is usually the number of times the data is run over, this likely
means something specific in the context of rl
- gamma was mentioned earlier as the discounting factor for the reward, I
think
- not familiar with gae_lambda or ent_coef at this time

I have very little experience with these things, and should have read the
documentation!

Step 6:

Train the agent for 500k timesteps appropriately for available hardware
acceleration. It says this takes approximately 10 minutes on a colab GPU,
which in my session is a Tesla K80. (I executed "!nvidia-smi" ).

The lab says you can use fewer timesteps if you want to just try it out,
and recommends a break is taken during training.

Here's the template:

# TODO: Train it for enough timesteps

# TODO: Specify file name for model and save the model to file
model_name = ""
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 6876 bytes
Desc: not available
URL: <https://lists.cpunks.org/pipermail/cypherpunks/attachments/20220509/fad2baf5/attachment.txt>