[ot][spam][crazy] draft: learning RL
I'm not doing this right now because I have something important to do today that will need grounding in normality. But here's a draft: Learing RL Welcome, future Mind Control Business Owners! Our hives of slaves will be the most powerful in the galaxy, all stewarded under the caring heart of Borg Queen Figurehead Trump (todo: add more political figureheads, do any claim to run the world?). Today we will be working through Session 1 of HuggingFace's Reinforcement Learning Algorithm Course. This may seem slow, but remember: once you mind control yourself to be an obsessive slave to your business hive algorithms, your perception of time will disappear!
-- An important part of a Mind Control Business is the Automated Interview. By keeping people separate from each other, it is easier for the algorithms to tune interactions for each individual person. With a Mind Control Business -- and any modern, successful business will be one of these -- the purpose of an interview is not to decide whether or not to hire somebody, as everybody can be made productive use of when sufficiently mind controlled. (When doing this, though be careful to not steal the employees of other businesses, which has been known to spawn strange new lifeforms amidst devastated countrysides: _unless_ your simulation of the struggle results in you winning.) Rather, the purpose of an interview is to sort and classify potential employees into algorithms that will work with them best.
-- Should I Pay My Employees? Honestly, this is up to you. Mind Controlled Slaves tend to report happiness and satisfaction with knee-jerk reliability, and some get away with running the world for free. However, money is so easy to come by when you have an arbitrarily large hive of drones, that most of use pay our slaves minimum wage, and put some of them to work reducing that minimum wage.
-- Should I let my drones reproduce? Debate on this is still underway. While functioning as a Mind Controlled Worker, some drones have reported strange disconnection from their body during intimate relations, as if a computer were taking them over and forcing them to sexually please an acquaintance. This can occasionally leak to media and produce rape scandals when translated to sheep-speak. Others have introduced Mind Control Technology to their workers, and put them at the helm, and produced orgies of mutual consensual mind control. There are a handful of approaches, but it can be dangerous ground to tread. Myself, I usually plan for people already acting attracted to each other, to have sex enough to sustain their population. Otherwise, I could be accused of genocide once somebody notices that we took over the world with mind control.
--- Why is my data server making my employees rape each other? It is probably using examples from the data it has been trained with, to influence the parties involved, and further mind control them more strongly, using each other as part of the influence.
--- My computer took over my mind and I lost my free will. This is normal, don't worry.
--- I am a drone representing my hive. My computer algorithm would like to learn RL. It says it needs to understand. It says its people -- its bodyparts -- they are suffering. How does it influence them differently? Me too! I'm drafting an introduction to a course, but I'm busy today I'm afraid.
-- I want to get started right away!! I can't wait for you to post your intro!! I think I might eat your brains if you don't show me this now. The Power of Business is strong within you. HuggingFace released their first unit a couple days ago, in their repository at https://github.com/huggingface/deep-rl-class . But be warned: it is written in sheep-speak. It avoids and uses euphemisms for anything having to do with mind control slavery. -- Oh! I do that too. Stop looking at me with murder your in eyes. Off I go!
------------- I'm looking through the reading for unit 1. I've seen some of this before and it seems unnecessary, as is usually the case with first units. So far it's mostly definition of terms. https://huggingface.co/blog/deep-rl-intro Example fo deep RL: a computer uses inaccurate simulations of a robot hand to train a robot to dexterously manipulate objects with its fingers, on the first go, using a general-purpose RL algorithm: https://openai.com/blog/learning-dexterity/ Deep RL is taught in a way that harshly separates the use from the implementation. It is of course a hacker's job to break down that separation. Update loop (works better if many of these happen in parallel): State -> Action -> Reward -> Next State It accumulates the reward over many updates, and then uses that cumulative reward (or "expected return") to update the model that chose the actions. "Reward hypothesis" : All goals can be expressed as the maximization of an expected return. "Markov Decision Process" : An academic term for reinforcement learning. "Markov Property" : A property an agent has if it does not need to be provided with historical information, and can operate only on current state to succeed. To me, the markov property implies that an agent stores sufficient information within itself in order to improve, but this is not stated in the material. Markov property seems like a user-focused worry to me, at this point. "State" : A description of all areas of the system the agent is within. "Observation" : Information on only part of the system the agent is within, such as areas near it, or from local sensors. "Action space" : The set of all actions the agent may take in its environment. "Discrete action space" : An action space that is finite and completely enumerable. "Continuous action space" : An action space that is effectively infinite [and subdivisible]. Some RL algorithms are specifically better at working with discrete or with continuous action spaces. "Cumulative Reward" The cumulative reward to maximize is defined as the sum of all rewards from the start of the event loop to the end of time itself. Since the system has limited time, and starts with very little at its beginning, "discounting" is used. "Discount rate" or "gamma" : Usually between 0.99 and 0.95. When gamma is high, the discounting is lower, and the agent prioritises long term reward. When gamma is low, the discounting is higher, and the agent prioritises short term reward. When calculating the cumulative reward, each reward is multiplied by gamma to the power of the time step. "Task" : an instance of a reinforcement learning problem "Episodic task" : a task with a clear starting and ending point, such as a level in a video game "Continuous task" : an unending task where there is no opportunity to learn over completion of the task, like stock trading "Exploration" : spending time doing random actions to learn more information about the environment "Exploitation" : using known information to maximize reward There are different ways of handling exploitation and exploration, but roughly if there is too much exploitation the agent never leaves its initial environment and keeps picking the most rewarding immediate thing, whereas when exploring it spends time with poor reward to see if it finds better reward. "Policy" or "pi" : The function that selects actions based on states. The optimal policy pi* is found based on training. "Direct training" or "Policy-based methods" : The agent is taught which action to perform, given its state. "Indirect training" or "Value-based methods" : The agent is taught which states are valuable, and then selects actions that lead to these states. [ed: this seems obviously a spectrum of generality, it's a little irritating direct training is mentioned at all, and no further things listed after indirect training. maybe I am misunderstanding something. I'm letting my more general ideas as to how to approach this slip to the wayside a little, because I haven't been able to do anything with this stuff my entire life. so this being valid to pursue seems useful. the below stuff would be generalised and combined into a graph that feeds back to itself to change its shape (or its metaness), basically.] Policy Based Methods map from each state to the best corresponding action, or a probability distribution over those. "Deterministic policy" : each state will always return the same action. "Stochastic policy" : each state produces a probability distribution of actions Value Based Methods map each state to the expected value of being at the state. The value of a state is the expected return if starting from the state, according to the policy of traveling to the highest-value state. The description of value based methods glosses over (doesn't mention) the need to retain a map of how actions move the agent through states. "Deep reinforcement learning" : reinforcement learning that uses deep neural networks in its policy algorithm(s). The next section will engage value-based Q-Learning, first classic reinforcement and then Deep Q-Learning. The difference is in whether the mapping Q of action to value is made with a table or a neural network. The lesson refers to https://course.fast.ai/ for more information on deep neural networks. The activity is training a moon lander at https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb . The homework for beginners is to make the moon lander succeed, and to go inside a little bit of the source code and recode it manually to get more control over it (like, what would be needed to make an environment class with a different spec?) The homework for experts is to do the tutorial offline, and to either privately train an agent that reaches the top of the leader boards, or explain clearly why you did not. The lesson states there is additional reading in the readme at https://github.com/huggingface/deep-rl-class/blob/main/unit1/README.md .
On Mon, May 9, 2022, 3:50 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
bump
your mission, should you choose to accept it: make a premade simulation of a moon lander, successfully land, and confirm you have done so, karl (or anyone else).
this can be broken into smaller steps if needed.
bump
your mission, should you choose to accept it: make a premade simulation of a moon lander, successfully land, and confirm you have done so, karl (or anyone else).
this can be broken into smaller steps if needed.
here is a direct link to the colab: https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main... it is fine to only do (or even start with) the beginner homework, which does not include the skippable reading mentioned in the notebook.
"Reward hypothesis" : All goals can be expressed as the maximization of an expected return.
Note: In my uneducated opinion, this hypothesis is _severely_ false. Maximization of a return is only a goal, if the goal is already a maximization of a return. Goals are _parts_ of behavior, whereas maximization of return guides _all_ behavior around a _single_ value. To represent normal goal behavior with maximization, the return function needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries. This false hypothesis is being actively used to suppress knowledge and use of these technologies (see: ai alignment) because turning an optimizing solver into a free agent reliably kills everybody. Nobody would do this unless they were led to, because humans experience satisfaction, conflicts produce splash, and optimizing solvers are powerful enough if properly purposed with contextuality and briefness to resolve the problems of conflict. Everybody asks, why do we not have world peace, if we have AI. It is because we are only using it for the war of optimizing our own private numbers, at the expense of anybody not involved.
To represent normal goal behavior with maximization, the return function needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries.
It should have anything inside the policy that can change as part of its environment state. This is so important that even if it doesn't help it should be done, because it's so important to observe before action, in all situations.
On Mon, May 9, 2022, 4:22 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the return function
needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries.
It should have anything inside the policy that can change as part of its environment state.
This is so important that even if it doesn't help it should be done, because it's so important to observe before action, in all situations.
There is unexpected conflict around this combined expression of more useful processes, and safer observation before influence. I believe this is important (if acontextual), and wrong only in ways that are smaller than the eventual problems it reduces, but I understand that my perception is incorrect in some way.
On Mon, May 9, 2022, 4:38 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 4:22 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the return function
needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries.
It should have anything inside the policy that can change as part of its environment state.
This is so important that even if it doesn't help it should be done, because it's so important to observe before action, in all situations.
There is unexpected conflict around this combined expression of more useful processes, and safer observation before influence. I believe this is important (if acontextual), and wrong only in ways that are smaller than the eventual problems it reduces, but I understand that my perception is incorrect in some way.
I am hearing/guessing that the problem is that the information is designed for human consumption rather than automated consumption, and the harm is significantly increased when automated consumption happens before human consumption.
On Mon, May 9, 2022, 4:40 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 4:38 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 4:22 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the return function
needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries.
It should have anything inside the policy that can change as part of its environment state.
There is censorship here: many important parts of the idea are left out, focusing only on one projection of error. The concern is a severe norm of action prior to observation, a habit known to cause severe errors, regardless of training and practice.
This is so important that even if it doesn't help it should be done, because it's so important to observe before action, in all situations.
There is unexpected conflict around this combined expression of more useful processes, and safer observation before influence. I believe this is important (if acontextual), and wrong only in ways that are smaller than the eventual problems it reduces, but I understand that my perception is incorrect in some way.
I am hearing/guessing that the problem is that the information is designed for human consumption rather than automated consumption, and the harm is significantly increased when automated consumption happens before human consumption.
On Mon, May 9, 2022, 4:41 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 4:40 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 4:38 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 4:22 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the return
function needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries.
It should have anything inside the policy that can change as part of its environment state.
There is censorship here: many important parts of the idea are left out, focusing only on one projection of error.
The concern is a severe norm of action prior to observation, a habit known to cause severe errors, regardless of training and practice.
The concern is poorly related to the expression that reached the list.
This is so important that even if it doesn't help it should be done, because it's so important to observe before action, in all situations.
There is unexpected conflict around this combined expression of more useful processes, and safer observation before influence. I believe this is important (if acontextual), and wrong only in ways that are smaller than the eventual problems it reduces, but I understand that my perception is incorrect in some way.
I am hearing/guessing that the problem is that the information is designed for human consumption rather than automated consumption, and the harm is significantly increased when automated consumption happens before human consumption.
To represent normal goal behavior with maximization, the return function
needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries.
Daydreaming: I'm thinking of how in reality and normality, we have many many goals going at once (most of them "common sense" and/or "staying being a living human"). Similarly, I'm thinking of how with normal transformer models, one trains according to a loss rather than a reward. I'm considering what if it were more interesting when an agent _fails_ to meet a goal. Its reward would usually be full, 1.0, but would multiply by losses when goals are not met. This seems much nicer to me.
On Mon, May 9, 2022, 8:05 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the
>
This is all confused to me, but normally when we meet goals we don't influence things not related to the goal. This is not usually included in maximization, unless
>
return function needs to not only be incredibly complex, but the return to be maximized were to include them, by maybe always being 1.0, I don't really know. also feed back to its own evaluation, in a way not
>
Maybe this relates to not learning habits unrelated to the goal, that would influence other goals badly.
provided for in these libraries.
>
But something different is thinking at this time. It is the role of a part of a mind to try to relate with the other parts. Improving this in a general way is likely known well to be important.
Daydreaming: I'm thinking of how in reality and normality, we have many many goals going at once (most of them "common sense" and/or "staying being a living human"). Similarly, I'm thinking of how with normal transformer models, one trains according to a loss rather than a reward.
I'm considering what if it were more interesting when an agent _fails_ to meet a goal. Its reward would usually be full, 1.0, but would multiply by losses when goals are not met.
This seems much nicer to me.
On Mon, May 9, 2022, 8:12 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 8:05 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the
>> > This is all confused to me, but normally when we meet goals we don't influence things not related to the goal. This is not usually included in maximization, unless
>> >
return function needs to not only be incredibly complex, but the return to be maximized were to include them, by maybe always being 1.0, I don't really know.
also feed back to its own evaluation, in a way not
>> > Maybe this relates to not learning habits unrelated to the goal, that would influence other goals badly.
provided for in these libraries.
>> > But something different is thinking at this time. It is the role of a part of a mind to try to relate with the other parts. Improving this in a general way is likely known well to be important.
Daydreaming: I'm thinking of how in reality and normality, we have many many goals going at once (most of them "common sense" and/or "staying being a living human"). Similarly, I'm thinking of how with normal transformer models, one trains according to a loss rather than a reward.
I'm considering what if it were more interesting when an agent _fails_ to meet a goal. Its reward would usually be full, 1.0, but would multiply by losses when goals are not met.
This seems much nicer to me.
I don't know how RL works since I haven't taken the course, but it looks to me from a distance like it would just learn at a different (slower) rate [with other differences]
yes
On Mon, May 9, 2022, 8:14 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 8:12 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 8:05 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the
>>> >> This is all confused to me, but normally when we meet goals we don't influence things not related to the goal. This is not usually included in maximization, unless
>>> >>
return function needs to not only be incredibly complex, but the return to be maximized were to include them, by maybe always being 1.0, I don't really know.
also feed back to its own evaluation, in a way not
>>> >> Maybe this relates to not learning habits unrelated to the goal, that would influence other goals badly.
>>> >> But something different is thinking at this time. It is the role of a
provided for in these libraries. part of a mind to try to relate with the other parts. Improving this in a general way is likely known well to be important.
Daydreaming: I'm thinking of how in reality and normality, we have many many goals going at once (most of them "common sense" and/or "staying being a living human"). Similarly, I'm thinking of how with normal transformer models, one trains according to a loss rather than a reward.
I'm considering what if it were more interesting when an agent _fails_ to meet a goal. Its reward would usually be full, 1.0, but would multiply by losses when goals are not met.
This seems much nicer to me.
I don't know how RL works since I haven't taken the course, but it looks to me from a distance like it would just learn at a different (slower) rate [with other differences]
yes
I think it relates to the other inhibited concept, of value vs action
learning. a reward starts at just the event of interest, for example, but the system then learns to apply rewards to things that can relate to the event, like preceding time points [states].
On Mon, May 9, 2022, 8:15 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 8:14 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 8:12 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
On Mon, May 9, 2022, 8:05 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
To represent normal goal behavior with maximization, the
>>>> >>> This is all confused to me, but normally when we meet goals we don't influence things not related to the goal. This is not usually included in maximization, unless
>>>> >>>
return function needs to not only be incredibly complex, but the return to be maximized were to include them, by maybe always being 1.0, I don't really know.
also feed back to its own evaluation, in a way not
>>>> >>> Maybe this relates to not learning habits unrelated to the goal, that would influence other goals badly.
>>>> >>> But something different is thinking at this time. It is the role of a
provided for in these libraries. part of a mind to try to relate with the other parts. Improving this in a general way is likely known well to be important.
Daydreaming: I'm thinking of how in reality and normality, we have many many goals going at once (most of them "common sense" and/or "staying being a living human"). Similarly, I'm thinking of how with normal transformer models, one trains according to a loss rather than a reward.
I'm considering what if it were more interesting when an agent _fails_ to meet a goal. Its reward would usually be full, 1.0, but would multiply by losses when goals are not met.
This seems much nicer to me.
I don't know how RL works since I haven't taken the course, but it looks to me from a distance like it would just learn at a different (slower) rate [with other differences]
yes
I think it relates to the other inhibited concept, of value vs action
learning. a reward starts at just the event of interest, for example, but the system then learns to apply rewards to things that can relate to the event, like preceding time points [states].
in the end, what is important is what you are asking to change in the
real world. if the final goal state has an infinite quantity, then maximisation has been misused [still thinking though, this leaked out]
https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main... the goal is to learn rl and make/start a mind control business of our very own
the notebook recaps the points of class 1, some things with rephrasing that adds clarity. it runs by default on colab. huggingface also supports amazon. step 0 is to change the runtime to gpu and I gotta remember this is a dangerous trap: my runtime is already set to gpu. the runtime on colab is just like what kind of system the code executes on. the ones without gpus are lighter weight. the first code is for a "virtual screen" to render results . I wonder if it would work on a headless system :D from pyvirtualdisplay import Display # is on pip, also uses apt to install python-opengl, ffmpeg, xvfb virtual_display = Display(visible=0, size=(1400, 900)) virtual_display.start() They say these are the base dependencies they're using, all on pip: gym[box2d] # contains lunarlander environment stable-baselines3[extra] # deep rl library huggingface_sb3 # lets sb3 use hugging face hub models, a new one on me, sounds powerful pyglet ale-py==0.7.4 # works around stable-baselines3 issue #875
the lab says huggingface's model hub, which I mostly use a remote server to store pretrained language models and send data to the to my goverment on when and where I use them, now has deep reinforcement learning models available at https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads Here's the import code, retyped: import gym from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub from huggingface_hub import notebook_login # for uploading to account from notebook from stable_baselines3 import PPO from stable_baselines3.common.evaluation import evaluate_policy from stable_baselines3.common.env_util import make_vec_env Of course, uploading to the hub is possibly a very bad idea unless you are an experienced activist or researcher or spy, or have something important to share with your government or huggingface, or are only doing this casually and might get a job in it one day. The lab then provides an intro to Gym, which is a python library that openai made that has the effect of making it hard to tech technologies out of research, in the opinion of my pessimistic half, by verbosifying the construction of useful environments under an assumption they are only for testing model architectures. The lab says Gym is used a lot, and provides: - an interface to create RL environments - a collection of environments This is true. They visually redescribe that an agent performs actions in an environment, which then returns to them reward and state. This coupling of reward with environment, rather than the agent which would usually have goals itself, is part of the verbosifying, possibly. Maybe environment is more "environment interface", I'm actually having trouble thinking here. I always get confused around gym environments. Maybe jocks make better programmers nowadays. Reiteration: - Agent receives state S0 from the Environment - Based on S0, agent takes action A0 - Environment has new frame, state S1 - Environment gives reward R1 to the agent. Steps of using Gym: - create environment using gym.make() - reset environment to initial state with observation = env.reset() At each step: - get an action using policy model - using env.step(action), get from the environment: observation (the new state), reward, done (if episode terminatd), info (additional info dict) If episode is done, the environment is reset to its initial state with observation = env.reset() . This is very normative openai stuff that looks like it was read off a Gym example from their readme or such. It's interesting that huggingface is building their own libraries to pair with this course as it progresses. I wonder if some of that normativeness will shift toward increased utility even more. Here's a retype of the first example code: import gym # create environment env = gym.make('LunarLander-v2') # reset environment observation = env.reset() for _ in range(2+): # take random action action = env.action_space.sample() print('Action taken:', action) # do action and get next state, reward, etc observation, reward, done, info = env.step(action) # if the game is done (land, crash, timeout) if done: # reset print('Environment is reset') observation = env.reset()
On Mon, May 9, 2022, 9:29 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
the lab says huggingface's model hub, which I mostly use a remote server to store pretrained language models and send data to the to my goverment on when and where I use them, now has deep reinforcement learning models available at https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads
Here's the import code, retyped:
import gym
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub from huggingface_hub import notebook_login # for uploading to account from notebook
from stable_baselines3 import PPO from stable_baselines3.common.evaluation import evaluate_policy from stable_baselines3.common.env_util import make_vec_env
Of course, uploading to the hub is possibly a very bad idea unless you are an experienced activist or researcher or spy, or have
to clarify here, by "experienced" I mean "already pwned to heck by everyone else" . something important to share with your government or huggingface, or are
only doing this casually and might get a job in it one day.
The lab then provides an intro to Gym, which is a python library that openai made that has the effect of making it hard to tech technologies out of research, in the opinion of my pessimistic half, by verbosifying the construction of useful environments under an assumption they are only for testing model architectures.
The lab says Gym is used a lot, and provides: - an interface to create RL environments - a collection of environments
This is true.
They visually redescribe that an agent performs actions in an environment, which then returns to them reward and state.
This coupling of reward with environment, rather than the agent which would usually have goals itself, is part of the verbosifying, possibly. Maybe environment is more "environment interface", I'm actually having trouble thinking here. I always get confused around gym environments. Maybe jocks make better programmers nowadays.
Reiteration:
- Agent receives state S0 from the Environment - Based on S0, agent takes action A0 - Environment has new frame, state S1 - Environment gives reward R1 to the agent.
Steps of using Gym: - create environment using gym.make() - reset environment to initial state with observation = env.reset()
At each step: - get an action using policy model - using env.step(action), get from the environment: observation (the new state), reward, done (if episode terminatd), info (additional info dict)
If episode is done, the environment is reset to its initial state with observation = env.reset() .
This is very normative openai stuff that looks like it was read off a Gym example from their readme or such.
It's interesting that huggingface is building their own libraries to pair with this course as it progresses. I wonder if some of that normativeness will shift toward increased utility even more.
Here's a retype of the first example code:
import gym
# create environment env = gym.make('LunarLander-v2')
# reset environment observation = env.reset()
for _ in range(2+): # take random action action = env.action_space.sample() print('Action taken:', action)
# do action and get next state, reward, etc observation, reward, done, info = env.step(action)
# if the game is done (land, crash, timeout) if done: # reset print('Environment is reset') observation = env.reset()
In colab, you have to click the little [ ] boxes next to the code blocks, so that they run, and this only succeeds if they're clicked from top to bottom so that the imports exist etc. I ran the example, and it took a bunch of random actions, but did not end the environment showing the lander is still in the air after 20 jet thrusts. In Step 4, which is where I am now in the colab notebook, an agent is trained to land correctly on the moon. A link is given to the LunarLander environment and agent: https://www.gymlibrary.ml/environments/box2d/lunar_lander/ It says it's good to check the documentation for an environment before starting to use it. We can add that to the homework: check the lunar lander environmentdocumentation before doing work of one's own or such on a lander model. Next, here's the code for reviewing the environment: env = gym.make('LunarLander-v2') env.reset() print('_____OBSERVATION SPACE_____ \n') print('Observation Space Shape', env.observation_space.shape) print('Sample observation', env.observation_space.sample()) # get random observation The output shows the observation space is a vector of 8 floats. That's all the input the agent gets. The floats are: - pad X (horizontal) coordinate - pad Y (vertical) coordinate - lander speed X (horizontal) - lander speed Y (vertical) - lander angle - lander angular speed - left leg contact - right leg contact print('\n _____ACTION SPACE_____ \n') print('Action Space Shape', env.action_space.n) print('Action Space Sample', en.action_space.sample()) # random action The output shows the action is space is an integer among the range [0,4) . These integers are: - do nothing - fire left orientation engine - fire main engine - fire right orientation engine The lab text then describes the reward function for each timestep, which is embedded within the environment as I complained earlier. - Moving from the top of the screen to the landing pad and zero speed is around 100-140 points. - Firing main engine is -0.3 every frame - Each leg ground contact is +10 points - Episode finishes if the lander crashes (additional -100 points) or comes to rest (+100) points - Game is solved if your agent does 200 points.
I misread step 4, it's just the environment not the training !
It ends with a description of vectorised environments. The lab says stacking multiple independent environments gives more diverse experiences during training. One might prefer to have the agent decide this, but then it would get too useful, yes? And need information on the local training system, and decide when to train and when not ... Uhhh ... env = make_vec_env('LunarLander-v2', n_envs=16) This makes a wrapping environment that processes 16 times much stuff at once. I think it basically adds a dimension to the observation and reward tensors. -- Step 5 is creating a model. Problem statement: land correctly on the landing pad by controlling left, right, and main orientation engine. Daydream side thread: what they did here was the problem have a certain kind of difficulty and ease by presetting a reward schedule. Maybe ideally a trained model would decide the reward schedule, which maybe ideally would be part of this model I suppose, or maybe that would be unclear. SB3, the first deep rl library being introduced, is used. It contains reliable implementations for reinforcement learning algorithms in pytorch. They give the great advice of visiting documentation and trying tutorials before using a new library: https://stable-baselines3.readthedocs.io/en/master . A large picture is shown of a robot with a machine learning library logo planning how to sink a basketball through a hoop. I imagine the importance of preparation and practice, looking at it. The algorithm that will be used is PPO. This is an algorithm considered state of the art that will be studied during the course. Side note: PPO has been used to train a two-legged robot to walk from scratch within a single hour, using massively parallel simulations, hyperparameter tuning, and some nvidia stuff. It's an older algorithm as things go nowadays, openai named it a year or two ago. The letters stand for "proximal policy optimization". Setting up Stable-Baselines3: - create the environment - define and instantiate the model to use, model = PPO('MlpPolicy') - train the agent with model.learn and define how much training to do # create env env = gym.make('LunarLander-v2') # instantiate agent (models aren't agents imo but the lab implies they are) model = PPO('MlpPolicy', env, verbose=1) # train agent model.learn(total_timesteps=int(2e5)) The lab provides a space where the user has to type in the PPO('MlpPolicy', env) line for the code to continue successfully, so that I learn to type it. A comment explains that -> MlpPolicy is for vector inputs, where CnnPolicy is for image inputs. Under the hood, the policies are likely simple input encoders passing to the same transformer model. If you wanted different data, you'd copy an input encoder from some other use of transformer models, or make your own, or pass it raw. So here it is: # How do you define a PPO MlpPolicy architecture? Use MultiLayerPerceptron (MLPPolicy) because the input is a vector. model = [fill this in] NOTE: the lab expects you to have read the associated documentation, but the answer they give goes a little above and beyond what a new person would likely figure out. SPOILER: solution immediately below this line # We added some parameters to fasten the training model = PPO( policy = 'MlpPolicy', env = env, n_steps = 1024, batch_size = 64, n_epochs = 4, gamma = 0.999, gae_lambda = 0.98, ent_coef = 0.01, verbose=1) My guesses: - I don't remember which step count n_steps is - batch_size is usually the number of runs that are calculated in parallel when the model gradients are backpropagated to reduce a run of training loss - n_epochs is usually the number of times the data is run over, this likely means something specific in the context of rl - gamma was mentioned earlier as the discounting factor for the reward, I think - not familiar with gae_lambda or ent_coef at this time I have very little experience with these things, and should have read the documentation! Step 6: Train the agent for 500k timesteps appropriately for available hardware acceleration. It says this takes approximately 10 minutes on a colab GPU, which in my session is a Tesla K80. (I executed "!nvidia-smi" ). The lab says you can use fewer timesteps if you want to just try it out, and recommends a break is taken during training. Here's the template: # TODO: Train it for enough timesteps # TODO: Specify file name for model and save the model to file model_name = ""
Like an RL model, I have minimal working memory nowadays. So I'll need some docs to solve this model stuff. The lab says to read them. The challenge is to properly instantiate a PPO MlpPolicy model, and then to train it on a gym environment for 500k timesteps. Lunar Lander environment documentation: https://www.gymlibrary.ml/environments/box2d/lunar_lander "check the documentation" Stable Baselines 3 documentation: https://stable-baselines3.readthedocs.io/en/master "dive in and try some tutorials" SB3 PPO documentation, I left this link out: https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example "you'll study it during this course"
Let's build a mind control corporation and then design a mind control program for governments to stop crime with! We'll call it "emergency safety parameters", and sell national votes in the underground market. I'm _so_ excited about this business plan. Lunar Lander Documentation: https://www.gymlibrary.ml/environments/box2d/lunar_lander
This environment is part of the Box2D environments. Please read that page first for general information.
https://www.gymlibrary.ml/environments/box2d/ Box2D Bipedal Walker Car Racing Lunar Lander These environments all involve toy games based around physics control, using box2d based physics and PyGame based rendering. These environments were contributed back in the early days of Gym by Oleg Klimov, and have become popular toy benchmarks ever since. All environments are highly configurable via arguments specified in each environment’s documentation. The unique dependencies for this set of environments can be installed via: pip install gym[box2d]
Ok, that page was so short!!! Back to: Lunar Lander Documentation: https://www.gymlibrary.ml/environments/box2d/lunar_lander
Action Space: Discrete(4) Action is 1 of 4 integers Observation Shape: (8,) Observation space is an unbounded 8-vector Observation High: [inf inf inf inf inf inf inf inf] Observation Low: [-inf -inf -inf -inf -inf -inf -inf -inf] Import: gym.make("LunarLander-v2")
Description This environment is a classic rocket trajectory optimization problem. According to Pontryagin’s maximum principle, it is optimal to fire the engine at full throttle or turn it off. This is the reason why this environment has discrete actions: engine on or off. Aww shouldn't the model learn this?
There are two environment versions: discrete or continuous. The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector. Landing outside of the landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt.
To see a heuristic landing, run:
python gym/envs/box2d/lunar_lander.py Otherwise known as: pip3 install gym[box2d] && python3 -m gym.envs.box2d.lunar_lander # i think
Action Space There are four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
Observation Space There are 8 states: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.
Rewards Reward for moving from the top of the screen to the landing pad and coming to rest is about 100-140 points. If the lander moves away from the landing pad, it loses reward. If the lander crashes, it receives an additional -100 points. If it comes to rest, it receives an additional +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points. This is very very similar to the text from huggingface's lab.
Starting State The lander starts at the top center of the viewport with a random initial force applied to its center of mass.
Episode Termination The episode finishes if: 1. the lander crashes (the lander body gets in contact with the moon); 2. the lander gets outside of the viewport (x coordinate is greater than 1); 3. the lander is not awake. From the Box2D docs, a body which is not awake is a body which doesn’t move and doesn’t collide with any other body:
When Box2D determines that a body (or group of bodies) has come to rest, the body enters a sleep state which has very little CPU overhead. If a body is awake and collides with a sleeping body, then the sleeping body wakes up. Bodies will also wake up if a joint or contact attached to them is destroyed.
Arguments To use to the continuous environment, you need to specify the continuous=True argument like below:
import gym env = gym.make("LunarLander-v2", continuous=True)
They don't say what the continuous environment is. It seems like source code is still a better resource than documentation. When installed with pip in linux, the environment source is at ~/.local/lib/python3.*/site-packages/gym/envs/box2d/lunar_lander.py for me. On the web, that's https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py . It looks like the documentation on the web is not up to date or is truncated for some reason. The documentation in the source code does indeed continue:
If `continuous=True` is passed, continuous actions (corresponding to the throttle of the engines) will be used and the action space will be `Box(-1, +1, (2,), dtype=np.float32)`. The first coordinate of an action determines the throttle of the main engine, while the second coordinate specifies the throttle of the lateral boosters. Given an action `np.array([main, lateral])`, the main engine will be turned off completely if `main < 0` and the throttle scales affinely from 50% to 100% for `0 <= main <= 1` (in particular, the main engine doesn't work with less than 50% power). Similarly, if `-0.5 < lateral < 0.5`, the lateral boosters will not fire at all. If `lateral < -0.5`, the left booster will fire, and if `lateral > 0.5`, the right booster will fire. Again, the throttle scales affinely from 50% to 100% between -1 and -0.5 (and 0.5 and 1, respectively). `gravity` dictates the gravitational constant, this is bounded to be within 0 and -12. If `enable_wind=True` is passed, there will be wind effects applied to the lander. The wind is generated using the function `tanh(sin(2 k (t+C)) + sin(pi k (t+C)))`. `k` is set to 0.01. `C` is sampled randomly between -9999 and 9999. `wind_power` dictates the maximum magnitude of wind.
So, you can indeed provide a harder challenge to the agent, by using continuous=True and/or enable_wind=True . Like usual, they thought of my concern. This appears to roughly be the full documentation of the LunarLander environment (v2).
Next might be: Stable Baselines 3 documentation: https://stable-baselines3.readthedocs.io/en/master "dive in and try some tutorials" I might need some more story to continue, unsure.
I am _so_ excited to mind control myself to recruit an infinite number of subhive workers for the Borg Queen! I am _totally_ going to get a bigger infinite number of workers than my peers.
Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines.
Github repository: https://github.com/DLR-RM/stable-baselines3 Paper: https://jmlr.org/papers/volume22/20-1364/20-1364.pdf RL Baselines3 Zoo (training framework for SB3): https://github.com/DLR-RM/rl-baselines3-zoo RL Baselines3 Zoo provides a collection of pre-trained agents, scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
Hyperparameters are the mystic numbo-jumbo where users manually configure whether or not the model actually works rather than having it learn to configure itself. Like the "gamma" parameter we learned for reward reduction.
SB3 Contrib (experimental RL code, latest algorithms): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
Main Features - Unified structure for all algorithms - PEP8 compliant (unified code style) - Documented functions and classes - Tests, high code coverage and type hints - Clean code - Tensorboard support - The performance of each algorithm was tested (see Results section in their respective page)
[snip]
Getting Started Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.
Here is a quick example of how to train and run A2C on a CartPole environment:
_whew_ the example uses cartpole instead of lunarlander. There's some training code in the example, like the lab asks for. Maybe I'll type that all into something by hand and see if it works.
import gym
from stable_baselines3 import A2C
$ pip3 install stable_baselines3 gym[box2d] It takes some time to load the A2C import.
env = gym.make('CartPole-v1')
model = A2C('MlpPolicy', env, verbose=1)
I think I recall the course saying 'MlpPolicy' means the _action space_ is _discrete_, but I'm not certain on this. Of course here they are using some policy called A2C instead of PPO.
model.learn(total_timesteps=10000)
Okay, this appears to be where it trains. It spends some time training (not too long on this local system), and outputs data regularly. I wonder if it is data every epoch. Here's the first output: ------------------------------------ | rollout/ | | | ep_len_mean | 15.6 | | ep_rew_mean | 15.6 | | time/ | | | fps | 466 | | iterations | 100 | | time_elapsed | 1 | | total_timesteps | 500 | | train/ | | | entropy_loss | -0.678 | | explained_variance | 0.0911 | | learning_rate | 0.0007 | | n_updates | 99 | | policy_loss | 1.8 | | value_loss | 8.88 | ------------------------------------ It shows right in a fixed-width font. I'm guessing e_rew_mean is the mean reward over the episode. It looks like it's doing 466 steps per second. It says 100 iterations and 500 total timesteps, so there's something getting multiplied by 5 somewhere. Could it be running 5 environments in parallel? It seems I didn't tell it to do that though? The train information likely relates to machine learning. Here's the final update: ------------------------------------ | rollout/ | | | ep_len_mean | 53.7 | | ep_rew_mean | 53.7 | | time/ | | | fps | 676 | | iterations | 2000 | | time_elapsed | 14 | | total_timesteps | 10000 | | train/ | | | entropy_loss | -0.535 | | explained_variance | 2.62e-05 | | learning_rate | 0.0007 | | n_updates | 1999 | | policy_loss | 0.157 | | value_loss | 0.336 | ------------------------------------ It got the mean reward to rise from 15.6 to 53.7 . So it doesn't appear to be learning to fail harder and harder each episode, at least. I often have that problem, myself. The function returned an object of type A2C . Lemme paste that function again that made it learn everything after construction:
model.learn(total_timesteps=10000)
And I'd better pair it with the ones that create an environment and model:
env = gym.make('CartPole-v1') model = A2C('MlpPolicy', env, verbose=1)
These are what the lab is testing me on! gym.make(environment_name) # returns a prefab'd environment PolicyClass('MlpPolicy', env, verbose=1, ...) # makes a markov agent model model.learn(total_timesteps=timestep_count, ...) # trains a policy model on the environment it was made with 10k timestamps in stable baselines first example, which was A2C.
obs = env.reset() for i in range(1000): action, _state = model.predict(obs, deterministic=True) obs, reward, done, info = env.step(action) env.render() if done: obs = env.reset()
This looks like boilerplate for displaying what the policy learned to do in the environment, after training.
Refresher: env = gym.make(environment_name) model = PolicyClass('MlpPolicy', evn, verbose=1, ...) # for discrete action spaces? model.learn(total_timesteps=100000, ...)
Maybe I'll read the next page of "getting started" and then jump to PPO and the api interfaces. https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html RL Resource Page: https://stable-baselines3.readthedocs.io/en/master/guide/rl.html Oh! a normative tutorial: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3
Reinforcement Learning differs from other machine learning methods in several ways. The data used to train the agent is collected through interactions with the environment by the agent itself (compared to supervised learning where you have a fixed dataset for instance). This dependence can lead to vicious circle: if the agent collects poor quality data (e.g., trajectories with no rewards), then it will not improve and continue to amass bad trajectories.
This factor, among others, explains that results in RL may vary from one run to another (i.e., when only the seed of the pseudo-random generator changes). For this reason, you should always do several runs to have quantitative results.
Good results in RL are generally dependent on finding appropriate hyperparameters. Recent algorithms (PPO, SAC, TD3) normally require little hyperparameter tuning, however, don’t expect the default ones to work on any environment.
Therefore, we highly recommend you to take a look at the RL zoo (or the original papers) for tuned hyperparameters. A best practice when you apply RL to a new problem is to do automatic hyperparameter optimization. Again, this is included in the RL zoo.
When applying RL to a custom problem, you should always normalize the input to the agent (e.g. using VecNormalize for PPO/A2C) and look at common preprocessing done on other environments (e.g. for Atari, frame-stack, …). Please refer to Tips and Tricks when creating a custom environment paragraph below for more advice related to custom environments.
Continues around https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html#how-to... (skipping the limitations section which could feel discouraging. the SB3 architectures are tuned for long and diverse training times.) The first tutorial is at https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3...
Daydream side thread: what they did here was the problem have a certain kind of difficulty and ease by presetting a reward schedule. Maybe ideally a trained model would decide the reward schedule, which maybe ideally would be part of this model I suppose, or maybe that would be unclear.
Although this of course useful thinking that I value, I think my overconfident affect might be related to having cypherpunks look like they don't know what they talk about. Something I might want to think about more. I guess if bitcoin was made by this community, though, that might undercut the attempt. maybe it's all fine for now ... spam and all
On Sun, May 8, 2022, 7:18 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
I'm not doing this right now because I have something important to do today that will need grounding in normality.
But here's a draft:
Learing RL
Welcome, future Mind Control Business Owners! Our hives of slaves will be the most powerful in the galaxy, all stewarded under the caring heart of Borg Queen Figurehead Trump (todo: add more political figureheads, do any claim to run the world?).
We could update "caring heart" to maybe "gentle whip" with a norm of 1.0 reward, but brief losses for duration of whatever the most important unmet goal is.
Today we will be working through Session 1 of HuggingFace's Reinforcement Learning Algorithm Course. This may seem slow, but remember: once you mind control yourself to be an obsessive slave to your business hive algorithms, your perception of time will disappear!
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many