I misread step 4, it's just the environment not the training !

   It ends with a description of vectorised environments. The lab says
   stacking multiple independent environments gives more diverse
   experiences during training. One might prefer to have the agent decide
   this, but then it would get too useful, yes? And need information on
   the local training system, and decide when to train and when not ...
   Uhhh ...
   env = make_vec_env('LunarLander-v2', n_envs=16)
   This makes a wrapping environment that processes 16 times much stuff at
   once. I think it basically adds a dimension to the observation and
   reward tensors.
   --
   Step 5 is creating a model.
   Problem statement: land correctly on the landing pad by controlling
   left, right, and main orientation engine.
   Daydream side thread: what they did here was the problem have a certain
   kind of difficulty and ease by presetting a reward schedule. Maybe
   ideally a trained model would decide the reward schedule, which maybe
   ideally would be part of this model I suppose, or maybe that would be
   unclear.
   SB3, the first deep rl library being introduced, is used. It contains
   reliable implementations for reinforcement learning algorithms in
   pytorch.
   They give the great advice of visiting documentation and trying
   tutorials before using a new library:
   [1]https://stable-baselines3.readthedocs.io/en/master .
   A large picture is shown of a robot with a machine learning library
   logo planning how to sink a basketball through a hoop. I imagine the
   importance of preparation and practice, looking at it.
   The algorithm that will be used is PPO. This is an algorithm considered
   state of the art that will be studied during the course.
   Side note: PPO has been used to train a two-legged robot to walk from
   scratch within a single hour, using massively parallel simulations,
   hyperparameter tuning, and some nvidia stuff. It's an older algorithm
   as things go nowadays, openai named it a year or two ago. The letters
   stand for "proximal policy optimization".
   Setting up Stable-Baselines3:
   - create the environment
   - define and instantiate the model to use, model = PPO('MlpPolicy')
   - train the agent with model.learn and define how much training to do
   # create env
   env = gym.make('LunarLander-v2')
   # instantiate agent (models aren't agents imo but the lab implies they
   are)
   model = PPO('MlpPolicy', env, verbose=1)
   # train agent
   model.learn(total_timesteps=int(2e5))
   The lab provides a space where the user has to type in the
   PPO('MlpPolicy', env) line for the code to continue successfully, so
   that I learn to type it. A comment explains that -> MlpPolicy is for
   vector inputs, where CnnPolicy is for image inputs.
   Under the hood, the policies are likely simple input encoders passing
   to the same transformer model. If you wanted different data, you'd copy
   an input encoder from some other use of transformer models, or make
   your own, or pass it raw.
   So here it is:
   # How do you define a PPO MlpPolicy architecture? Use
   MultiLayerPerceptron (MLPPolicy) because the input is a vector.
   model = [fill this in]
   NOTE: the lab expects you to have read the associated documentation,
   but the answer they give goes a little above and beyond what a new
   person would likely figure out.
   SPOILER: solution immediately below this line
   # We added some parameters to fasten the training
   model = PPO(
     policy = 'MlpPolicy',
     env = env,
     n_steps = 1024,
     batch_size = 64,
     n_epochs = 4,
     gamma = 0.999,
     gae_lambda = 0.98,
     ent_coef = 0.01,
     verbose=1)
   My guesses:
   - I don't remember which step count n_steps is
   - batch_size is usually the number of runs that are calculated in
   parallel when the model gradients are backpropagated to reduce a run of
   training loss
   - n_epochs is usually the number of times the data is run over, this
   likely means something specific in the context of rl
   - gamma was mentioned earlier as the discounting factor for the reward,
   I think
   - not familiar with gae_lambda or ent_coef at this time
   I have very little experience with these things, and should have read
   the documentation!
   Step 6:
   Train the agent for 500k timesteps appropriately for available hardware
   acceleration. It says this takes approximately 10 minutes on a colab
   GPU, which in my session is a Tesla K80. (I executed "!nvidia-smi" ).
   The lab says you can use fewer timesteps if you want to just try it
   out, and recommends a break is taken during training.
   Here's the template:
   # TODO: Train it for enough timesteps
   # TODO: Specify file name for model and save the model to file
   model_name = ""

References

   1. https://stable-baselines3.readthedocs.io/en/master