I misread step 4, it's just the environment not the training !
It ends with a description of vectorised environments. The lab says stacking multiple independent environments gives more diverse experiences during training. One might prefer to have the agent decide this, but then it would get too useful, yes? And need information on the local training system, and decide when to train and when not ... Uhhh ...
env = make_vec_env('LunarLander-v2', n_envs=16)
This makes a wrapping environment that processes 16 times much stuff at once. I think it basically adds a dimension to the observation and reward tensors.
--
Step 5 is creating a model.
Problem statement: land correctly on the landing pad by controlling left, right, and main orientation engine.
Daydream side thread: what they did here was the problem have a certain kind of difficulty and ease by presetting a reward schedule. Maybe ideally a trained model would decide the reward schedule, which maybe ideally would be part of this model I suppose, or maybe that would be unclear.
SB3, the first deep rl library being introduced, is used. It contains reliable implementations for reinforcement learning algorithms in pytorch.
A large picture is shown of a robot with a machine learning library logo planning how to sink a basketball through a hoop. I imagine the importance of preparation and practice, looking at it.
The algorithm that will be used is PPO. This is an algorithm considered state of the art that will be studied during the course.
Side note: PPO has been used to train a two-legged robot to walk from scratch within a single hour, using massively parallel simulations, hyperparameter tuning, and some nvidia stuff. It's an older algorithm as things go nowadays, openai named it a year or two ago. The letters stand for "proximal policy optimization".
Setting up Stable-Baselines3:
- create the environment
- define and instantiate the model to use, model = PPO('MlpPolicy')
- train the agent with model.learn and define how much training to do
# create env
env = gym.make('LunarLander-v2')
# instantiate agent (models aren't agents imo but the lab implies they are)
model = PPO('MlpPolicy', env, verbose=1)
# train agent
model.learn(total_timesteps=int(2e5))
The lab provides a space where the user has to type in the PPO('MlpPolicy', env) line for the code to continue successfully, so that I learn to type it. A comment explains that -> MlpPolicy is for vector inputs, where CnnPolicy is for image inputs.
Under the hood, the policies are likely simple input encoders passing to the same transformer model. If you wanted different data, you'd copy an input encoder from some other use of transformer models, or make your own, or pass it raw.
So here it is:
# How do you define a PPO MlpPolicy architecture? Use MultiLayerPerceptron (MLPPolicy) because the input is a vector.
model = [fill this in]
NOTE: the lab expects you to have read the associated documentation, but the answer they give goes a little above and beyond what a new person would likely figure out.
SPOILER: solution immediately below this line
# We added some parameters to fasten the training
model = PPO(
policy = 'MlpPolicy',
env = env,
n_steps = 1024,
batch_size = 64,
n_epochs = 4,
gamma = 0.999,
gae_lambda = 0.98,
ent_coef = 0.01,
verbose=1)
My guesses:
- I don't remember which step count n_steps is
- batch_size is usually the number of runs that are calculated in parallel when the model gradients are backpropagated to reduce a run of training loss
- n_epochs is usually the number of times the data is run over, this likely means something specific in the context of rl
- gamma was mentioned earlier as the discounting factor for the reward, I think
- not familiar with gae_lambda or ent_coef at this time
I have very little experience with these things, and should have read the documentation!
Step 6:
Train the agent for 500k timesteps appropriately for available hardware acceleration. It says this takes approximately 10 minutes on a colab GPU, which in my session is a Tesla K80. (I executed "!nvidia-smi" ).
The lab says you can use fewer timesteps if you want to just try it out, and recommends a break is taken during training.
Here's the template:
# TODO: Train it for enough timesteps
# TODO: Specify file name for model and save the model to file
model_name = ""