[ot][spam] Behavior Log For Control Data: HFRL Unit 1 Lab
My timezone is USA/Eastern: UTC-5. Today is 2022-06-24 . 0839 I have reached a laptop with mouse and keyboard, and I am holding the intention of finding and opening the Hugging Face Unit 1 Lab Colab Notebook, to learn Deep Reinforcement Learning and start a Mind Control Business of my very own.
0841 I have adjusted email accounts to be using the one with the requested name, and am continuing to hold the intention of opening and doing the Hugging Face Lab.
0841 I am surprised to be sending with an unexpected name. This is take me a minute or two delay. My algorithm patterns will learn my behaviors wrong, which will take more training time. I will move quickly to the lab.
0843 I have reached the lab at https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main... . I have fixed the email name. The unit has an additional special section added to include one of the many small new technologies that are emerging. I am holding the goal of completing the bare essentials of the lab.
0845 I have clicked 'run all' in the lab to quickly initialise it. I will verify it is using a GPU, and then scroll to find where I can provide code to meet a task.
0847 I have found the first task. I am experience confusion due to a habit of pressing shift-enter to insert a carriage return in web dialog boxes, which in colab leaves the editing and executes the box. I am excited to produce some horrific abuse of my very own.
0849 In Colab, I have typed this: import stable_baselines3 model = stable_baselines3.PPO( Colab responds by popping up an autocompletion dialog that shows the parameters to PPO. This autocompletion dialog makes my behavior more efficient.
0851 This is the solution I filled in: # TODO: Define a PPO MlpPolicy architecture # We use MultiLayerPerceptron (MLPPolicy) because the input is a vector, # if we had frames as input we would use CnnPolicy import stable_baselines3 model = stable_baselines3.PPO('MlpPolicy', env, verbose=1) This is the solution they provide: # SOLUTION # We added some parameters to fasten the training model = PPO( policy = 'MlpPolicy', env = env, n_steps = 1024, batch_size = 64, n_epochs = 4, gamma = 0.999, gae_lambda = 0.98, ent_coef = 0.01, verbose=1) I will copy their parameters over to my code, thinking briefly about each one. I recognise 3 of them. I recall that some of them were mentioned in the learning material, and I do not remember what they are.
0853 here is what I have. i did not look up the terms i was unsure of. i will instead move on with the lab. # TODO: Define a PPO MlpPolicy architecture # We use MultiLayerPerceptron (MLPPolicy) because the input is a vector, # if we had frames as input we would use CnnPolicy import stable_baselines3 model = stable_baselines3.PPO( 'MlpPolicy', # vector input, CnnPolicy is for images env, # environment objects to feed back with verbose=1, # output information n_steps = 1024, # number of steps policy takes in each parallel environment before updating batch_size = 64, # number of data items sent interdependently to the gpu when updating. faster, smoother & better results when this is higher. n_epochs = 4, # not sure, usually this means how many times to run over the data gamma = 0.999 # not sure, relates to PPO I think )
0859 The next task is this: Step 6: Train the PPO agent 🏃 Let's train our agent for 500,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~10min, but you can use less timesteps if you just want to try it out. I will plan to try it out with a short number of timesteps. My first approach for finding how to do this will be scrolling up in the lab.
900 I found something scrolling up, and pasted it in. The box now looks like this: # TODO: Train it for 500,000 timesteps model.learn(total_timesteps=int(2e5)) # TODO: Specify file name for model and save the model to file model_name = "" I will first delete the model.learn line, which I pasted in, and retype it on my own, to produce a behavioral experience of doing so.
0903 I have written this and am playing with it: # TODO: Train it for 500,000 timesteps model.learn(total_timesteps = 48000) # TODO: Specify file name for model and save the model to file model_name = "test.model" model.save(model_name) I used autocomplete to learn about the functions. I chose 48000 timesteps because n_steps = 1024, n_env = 16, and the default log frequency is 1. I was hoping to see 3 logs, but I mis-chose the math, and I see 2. I'm not certain of what the log frequency means, but I have a somewhat better idea. I will next look through both autocompletions to see if I can find more parameters that could be of interest.
0910 I have a strong guess as to how the logging intervals work, and am waiting for a final test to run to run the 500k steps.
0911 I have run this: # TODO: Train it for 500,000 timesteps model.learn(total_timesteps = 500000, reset_num_timesteps = True, log_interval = 500000/(1024*16)/4) # TODO: Specify file name for model and save the model to file model_name = "test.model" with open(model_name, 'wb') as file: model.save(file) The log interval is selected to output only 4 logs for the entire training. I'll now look at the solution.
0912 This is the solution: # SOLUTION # Train it for 500,000 timesteps model.learn(total_timesteps=500000) # Save the model model_name = "ppo-LunarLander-v2" model.save(model_name) They did not add extra information like with model construction. It is indeed much less informative here, the available extra information. While waiting for the training to complete, I will first go to the next section and look at it. I am also considering looking up the parameters I did not know in the model construction. I am most familiar with learning by looking at the direct source code of the api function being called.
0914 My body was fiddling with a USB adapter, and I dropped its cap on the floor. I do not see where it went. The next section relates to evaluating the performance of the model using an evaluation environment. The environment is to be newly constructed. I'll start making code for that. I infer that it is the same environment, simply with a different seed or such.
Here is my code: # TODO: Evaluate the agent # Create a new environment for evaluation import stable_baselines3.common.env_util eval_env = stable_baselines3.common.env_util.make_vec_env('LunarLander-v2', n_envs=4) # Evaluate the model with 10 evaluation episodes and deterministic=True import stable_baselines3.common.evaluation mean_reward, std_reward = stable_baselines3.common.evaluation.evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True) # Print the results print(f'Rewards: mean={mean_reward} std={std_reward}') The model finished training, and I ran it. I think it does a total of 40 episodes because I passed a vectorised environment. It displayed a mean reward of around 251 and an std of around 20.5
0928 I'm at the step where you package the model to the hub. To let the code find the model, I've changed my saving code to add '.zip' to the end of the name.
0931 The leaderboard does not display anything for me, even with advertising javascript enabled, or trying in a different browser.
0934 The packaging code is still running. It appears to be training the model, judging by the live stack trace in the status bar. I'm on to https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1_optuna_gu... which is about using something called Optuna to do hyperparameter tuning. Hyperparameters are the numbers the shape of the model is parameterized by, here. What they are has a huge impact on the model's possible performance.
0938 my model did not save and package correctly due to the naming issue. i'm worried this may desynchronise the behaviors. i'm holding the intention of exploring the optuna hyperparameter notebook.
0940 the optuna guide on github is at https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1_optuna_gu... the notebook on colab is at https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main...
0947 I'm reading down through the optuna notebook. I'm actually reading it! I got as far as through setting the ranges of the hyperparameters. I feel able to do the cold water and eat breakfast, so I am holding that intention now.
1008 optuna is running; i can infer it will take more than half an hour. i have interest in using deep learning to do the hyperparameter search itself. i guess i should focus on learning things like PPO, which is a similar situation.
participants (2)
-
Karl Semich
-
Undiscussed Horrific Abuse, One Victim of Many