Like an RL model, I have minimal working memory nowadays.
So I'll need some docs to solve this model stuff. The lab says to read them.
The challenge is to properly instantiate a PPO MlpPolicy model, and then to train it on a gym environment for 500k timesteps.