[ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2
1017 The Optuna tuning from Unit 1 is still running. I have opened up Unit 2 at https://github.com/huggingface/deep-rl-class/tree/main/unit2 . The first reading is https://huggingface.co/blog/deep-rl-q-part1 . I'm holding the intention of reading this first reading. I am planning to write parts here to help me move forward on it.
1020 I have read that this exercise will be about studying value-based methods, and a specific algorithm called Q-learning.
1021 I have read that the unit is divided into 2 parts. The first part relates to learning about value-based methods, and the second part to Q-learning. I have also read that two environments will be solved, and that both involving navigating a small grid with an agent.
1023 I have moved through the introduction. It also listed some of the subparts of the unit. It described that Q-learning was the first algorithm able to beat humans at some video games, and it roughly said that this unit is important if you want to be able to work Q-learning algorithms. My perception is that Q-learning is less useful than PPO; I could be wrong. This perception creates difficulty for me.
1025 I have begun reading the first section of the intro, called "What is RL?" at https://huggingface.co/blog/deep-rl-q-part1#what-is-rl-a-short-recap
1028 I have mostly read that section. Most of it was a recap that reinforcement learning uses a policy to prioritise actions based on observation information of an environment. Policy-based methods are described as training a policy directly. Value-based methods are described as breaking the environment into distinguishable states, learning the value of the states, and selecting actions that move toward them. The next section is "The two types of value-based methods" at https://huggingface.co/blog/deep-rl-q-part1#the-two-types-of-value-based-met... .
1031 I'm taking notes here as I read the section. The reward the policy engages may be discounted to reduce the quality of nearby states [note: one of many heavily improvable heuristics]. A link is given to https://huggingface.co/blog/deep-rl-intro#rewards-and-the-discounting to review that. The value of each state is defined as being the expected return if the agent starts at that state and acts according to the policy. This sounds like a definition to me that is roughly on track. [1034] Value-based methods have policies for selecting actions given states. A "Greedy" policy is one that selects the one with the biggest reward. Usually in value-based methods, what's called an Epsilon-Greedy Policy is used to manage exploration/exploitation tradeoff. The value of the states is called a "value function": the "policy" is the selection of actions from the states. The value function and policy work together to refine the accuracy of state values and action selection.
1038 I am now on the state-value function section at https://huggingface.co/blog/deep-rl-q-part1#the-state-value-function . The information bit I missed writing in the last section was that in value-based methods, the policy is defined by hand, whereas the value function is modularised as a neural network: in policy-based methods, the policy itself is the neural network. [limiting hardcoded heuristics]
1041 State value function V_Pi(s) = E_Pi[G_t|S_t = s] The policy value of state s is the expected policy return if the agent starts at state s. The equation doesn't seem very helfpul. Second description of equation: For each state, the state-value fucntion outputs the expected return, if the agent starts in that state, and then follows the policy forever after. A graphic is shown of a mouse in a tiny maze finding cheese. Each square has a number representing the negative number of steps needed to reach the cheese. This step is the value. At this point, it is pretty easy to imagine a function recursively updating the values of every square in such a maze until they stabilise, to arrive at the picture.
1044 The Action-Value function https://huggingface.co/blog/deep-rl-q-part1#the-action-value-function
1045 For each state and action pair, the action-value function outputs the expected return if the agent starts in that state and takes action, and then follows the policy forever after. So, the action-value function is doing the same prediction task as the state-value function. They give another confusing equation Q_Pi(s,a) = E_Pi[G_t|S_t = s, A_t = a] saying the same self-referential thing about returns being expected cumulative after starting there. The chart of the mouse with the cheese is present again. The numbers are the same, but each number is associated with a direction of travel out of a square to another. The difference between the state-value function and the action-value function is that the state-value function outputs the value of a state, whereas the action-value function outputs the value of a state-action pair: the value of taking that action at that state. The section ends saying that the Bellman equation can help resolve the problem of summing all the rewards an agent can get if it starts at a state, to calculate the action-value or state-value.
1050 The Bellman Equation https://huggingface.co/blog/deep-rl-q-part1#the-bellman-equation-simplify-ou...
1052 I reviewed the help desk to get their aid staying on task. They might need to add something to their FAQ, not sure, or maybe reorder it. The Bellman Equation simplifies the calculation of state-value and state-action value. The examples in this section are simplified, removing discounting of the reward. Note: It is not too hard to calculate a reward for a state in order to sum them. The environment provides this information. I may have confused the terms "reward" and "return" in earlier notes. The return is the sum of the rewards following the policy. Bellman Equation: V(st) = R_t+1 + gamma * V(St + 1) The value of a state is the
1101 uh anyway the Bellman equation is just a recursive statement of the definition of value. It is most helpful to consider the sum of all following rewards, as the sum of this reward plus the following return. The next section is Monte Carlo vs Temporal Difference Learning: https://huggingface.co/blog/deep-rl-q-part1#monte-carlo-vs-temporal-differen...
1104 - this is the last section of part 1 - there are two ways of learning Monte Carlo and Temporal Difference Learning are two different training strategies based on the experiences of the agent. Monte Carlo uses an entire episode of experiences. Temporal Difference uses a single state (a quadruple of state, action, reward, next-state) One of the sentences could imply that these might also apply to policy-based approaches.
1106 Monte Carlo https://huggingface.co/blog/deep-rl-q-part1#monte-carlo-learning-at-the-end-... Monte Carlo approach waits until the episode is over, calculates the entire return, and updates the policy based on this entire return. This means the value function is only updated after an entire episode. An example mentions that the Epsilon Greedy Strategy alternates between exploration and exploitation, where exploration is random actions. The example uses episodes limited to enough steps to visit all the areas with reward. The Monte Carlo approach accumulates each reward and state and plugs them into the Bellman equation or otherwise sums the rewards, then repeats. Each repetition things improve. - at the start the value of every state is 0, since the value model is untrained - it provides parameters, learning rate = 0.1 (model training aggressiveness), and discount rate (gamma) is 1 (no discount, since it's a multiplier) - this results in random exploration In the example, the sum of rewards is 3. The state-value function is depicted as updated like this: V(S_t) <- V(S_t) + alpha * [G_t - V(S_t)] V(S_t) is the state value function alpha is the learning rate G_t is the return t is the timestep So with a return of 3 and a learning rate of 0.1, the new value function for the starting state becomes 0.3 .
1117 Temporal Difference Learning https://huggingface.co/blog/deep-rl-q-part1#temporal-difference-learning-lea...
1118 Temporal difference updates the value policy after every step. V(S_t) <- V(S_t) + alpha[R_t+1 + gamma * Value(S_t+1) - V(S_t)] R is the reward. Basically, the old value policy is used on the next state to estimate the value update for the preceding state, and Bellman's equation simplifies the calculation. [seems it would be helpful to derive a relation between these two expressions] They list it out a little like pseudocode, it's not super clear. I'm a little confused on whether they're updating the next state or the following state or what the reward is associated with ... I infer the reward is associated with the next state. So each step, the value function for the state being left can be updated with the reward from the state being entered, I'm guessing. They're considering the value function to be a map of states to real numbers, and simply scaling them by the learning rate times the change in estimated return. They go through the same example with this approach. I'll probably skip to the section summary after it, which doesn't have it's own heading, but continues through to the end of the blog.
1129 Summary The are two types of value-based functions - State-Value function gives value for every state - Action-Value function gives value for specific actions leaving specific states. There are two methods used to learn a value policy. - Assuming the return does not rely on the timestep or the path taken, Monte Carlo approach uses the complete accurate return, but it only updates from a complete episode - With TD learning, the value function is updated every step, but it is estimated as next_reward + discount * old_next_return. (discount is gamma) The reading states that it is normal if the parts are still all confusing, that this is fine. It does say to take time to grasp it before moving on. I did not include all the terms and equations in my notes. There is a link for feedback at https://forms.gle/3HgA7bEHwAmmLfwh9 . There is a quiz. I remember now there is also a quiz for unit 1. I am holding the intention of going back and doing quiz 1, which I don't remember well.
1138 reviewing quiz 1, back from unit 1 https://github.com/huggingface/deep-rl-class/blob/main/unit1/quiz.md
1139 Q1: What is Reinforcment Learning? My guess: a strategy for automatically accomplishing tasks by training policies to select actions from observations of an environment so as to maximize their reward. Q2: Define the RL Loop - Our Agent receives ____ from the environment. guess: an observation - Based on that ____ the Agent takes an _____ guess: observation, action - Our Agent will move to the right - The Environment goes to a _____ guess: new state - The Environment gives ____ to the Agent guess: reward Solution: - Our Agent receives state s0 from the environment - Based on that state s0 the Agent takes an action a0 - Our Agent will move to the right - The Environment goes to a new state s1 - The Environment gives a reward r1 to the Agent Noting: the reward is associated with the new state the environment moved to. There is no r0. Q3: What's the difference between a state and an observation? guess: The state is the entire situation of the environment. The observation is what the Agent receives. guess 2: No difference Q4: A task is an instance of a Reinforcement Learning problem. What are the two types of tasks? I don't know this one. I guess looking it up would be a helpful behavior. I looked it up. They can be episodic or continuous. An episodic task has a start state and terminal state. A continuous task is ongoing without bounds. Q5: What is the exploration/exploitation tradeoff? guess: the dilemma an agent faces when decided whether to engage in purportedly random actions so as to gather more data, or select those it knows of with the highest return Q6: What is a policy? guess: A function or trained model which selects actions based on state. Q7: What are value-based methods? guess: approaches to RL where each state is associated with a value, and actions are selected to move toward the highest-valued states. Q8: What are policy-based methods? guess: approaches to RL where actions are selected directly, rather than making a formal association with state and value
1148 solutions, excluding Q2 where i looked Q1: What is Reinforcment Learning? My guess: a strategy for automatically accomplishing tasks by training policies to select actions from observations of an environment so as to maximize their reward. solution: a framework for solving control tasks or decision problems, by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback. https://huggingface.co/blog/deep-rl-intro#a-formal-definition Q2: in last message. information at https://huggingface.co/blog/deep-rl-intro#the-rl-process Q3: What's the difference between a state and an observation? guess: The state is the entire situation of the environment. The observation is what the Agent receives. guess 2: No difference solution: The state is a complete description of the state of the world, without hidden information in a fully observed environment. The observation is a partial description of the state, in a partially observed environment. https://huggingface.co/blog/deep-rl-intro#observationsstates-space [X] Q4: A task is an instance of a Reinforcement Learning problem. What are the two types of tasks? episodic or continuous solution: Episodic task: we have a starting point and an ending point. Continuous task: thesea re tasks that continue forever. https://huggingface.co/blog/deep-rl-intro#type-of-tasks Q5: What is the exploration/exploitation tradeoff? guess: the dilemma an agent faces when decided whether to engage in purportedly random actions so as to gather more data, or select those it knows of with the highest return solution: The need to balance how much we explore the environment and how much we exploit what we know. Exploring is exploring by trying random actions to find more information. Exploitation is exploiting known information to maximize reward. https://huggingface.co/blog/deep-rl-intro#exploration-exploitation-tradeoff [X] Q6: What is a policy? guess: A function or trained model which selects actions based on state. solution: Policy Pi is the brain of the Agent. A function that tells what action to take given the state. It defines the agent's behavior at a given time. https://huggingface.co/blog/deep-rl-intro#the-policy-%CF%80-the-agents-brain Q7: What are value-based methods? guess: approaches to RL where each state is associated with a value, and actions are selected to move toward the highest-valued states. solution: Value-based methods are one of the main approaches. A value function is trained instead of a policy function; it maps a state to the expected value of being there. https://huggingface.co/blog/deep-rl-intro#value-based-methods Q8: What are policy-based methods? guess: approaches to RL where actions are selected directly, rather than making a formal association with state and value solution: A policy function is learned directly, to map from each state to the best corresponding action; or a probability distribution over the set of possible actions at that state. https://huggingface.co/blog/deep-rl-intro#value-based-methods
1203 Unit 1 Quiz 2 https://huggingface.co/blog/deep-rl-intro#value-based-methods Q1: What are the two main approaches to find optimal policy? guess: [i'm spending some tiem thinking. monte carlo and TD are nearest in my mind, and this was before that.] value-based, where a policy is trained to update values for each state, and policy-based, where a policy is trained to directly select actions. Q2: What is the Bellman Equation? guess: a way to calculate the return for a value-based state given [stepped away to do something, now 1211] its reward and the return for the next state, rather than exhaustively summing all of them Q3: Define each part of the Bellman Equation [stepped away again. it's now 1217] V_Pi(s) = E_Pi[R_t+1 + gamma * V_Pi(S_t+1)|S_t = s] V_Pi(s): ____ guess: The value policy function for a state E_Pi[R_t+1: ____ guess: The estimated policy return gamma * V_Pi(S_t+1): ______ guess: The discounted return for the next state S_t = s: ____ guess: for a complete episode that started at this state Q4: What is the difference between Monte Carlo and Temporal Difference learning methods? guess: Temporal Difference methods use entire accumulated episodes with accurate returns calculated, to update the value policy function. Monte Carlo methods use estimated returns based on past experience, to update the value policy every step. Q5: Define each part of the Temporal Difference learning formula V(S_t) <- V(S_t) + alpha[R_t+1 + gamma * V(S_t+1) - V(S_t)] V(S_t): _____ V(S_t): _____ alpha: _____ R_t+1: _____ gamma * V(S_t+1): _____ V(S_t): _____ [R_t+1 + gamma * V(S_t+1)]: _____ [something unexpected is happening. i'm stepping away from this. 1222
Date: 2022-06-24 Time zone: America/Eastern UTC-5 1238 I have returned to a state similar to the one I was in yesterday, before I began this.
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many