-------------
I'm looking through the reading for unit 1. I've seen some of this before and it seems unnecessary, as is usually the case with first units.
So far it's mostly definition of terms.
Example fo deep RL: a computer uses inaccurate simulations of a robot hand to train a robot to dexterously manipulate objects with its fingers, on the first go, using a general-purpose RL algorithm:
https://openai.com/blog/learning-dexterity/
Deep RL is taught in a way that harshly separates the use from the implementation. It is of course a hacker's job to break down that separation.
Update loop (works better if many of these happen in parallel):
State -> Action -> Reward -> Next State
It accumulates the reward over many updates, and then uses that cumulative reward (or "expected return") to update the model that chose the actions.
"Reward hypothesis" : All goals can be expressed as the maximization of an expected return.
"Markov Decision Process" : An academic term for reinforcement learning.
"Markov Property" : A property an agent has if it does not need to be provided with historical information, and can operate only on current state to succeed.
To me, the markov property implies that an agent stores sufficient information within itself in order to improve, but this is not stated in the material. Markov property seems like a user-focused worry to me, at this point.
"State" : A description of all areas of the system the agent is within.
"Observation" : Information on only part of the system the agent is within, such as areas near it, or from local sensors.
"Action space" : The set of all actions the agent may take in its environment.
"Discrete action space" : An action space that is finite and completely enumerable.
"Continuous action space" : An action space that is effectively infinite [and subdivisible].
Some RL algorithms are specifically better at working with discrete or with continuous action spaces.
"Cumulative Reward"
The cumulative reward to maximize is defined as the sum of all rewards from the start of the event loop to the end of time itself.
Since the system has limited time, and starts with very little at its beginning, "discounting" is used.
"Discount rate" or "gamma" : Usually between 0.99 and 0.95.
When gamma is high, the discounting is lower, and the agent prioritises long term reward. When gamma is low, the discounting is higher, and the agent prioritises short term reward.
When calculating the cumulative reward, each reward is multiplied by gamma to the power of the time step.
"Task" : an instance of a reinforcement learning problem
"Episodic task" : a task with a clear starting and ending point, such as a level in a video game
"Continuous task" : an unending task where there is no opportunity to learn over completion of the task, like stock trading
"Exploration" : spending time doing random actions to learn more information about the environment
"Exploitation" : using known information to maximize reward
There are different ways of handling exploitation and exploration, but roughly if there is too much exploitation the agent never leaves its initial environment and keeps picking the most rewarding immediate thing, whereas when exploring it spends time with poor reward to see if it finds better reward.
"Policy" or "pi" : The function that selects actions based on states. The optimal policy pi* is found based on training.
"Direct training" or "Policy-based methods" : The agent is taught which action to perform, given its state.
"Indirect training" or "Value-based methods" : The agent is taught which states are valuable, and then selects actions that lead to these states.
[ed: this seems obviously a spectrum of generality, it's a little irritating direct training is mentioned at all, and no further things listed after indirect training. maybe I am misunderstanding something. I'm letting my more general ideas as to how to approach this slip to the wayside a little, because I haven't been able to do anything with this stuff my entire life. so this being valid to pursue seems useful. the below stuff would be generalised and combined into a graph that feeds back to itself to change its shape (or its metaness), basically.]
Policy Based Methods map from each state to the best corresponding action, or a probability distribution over those.
"Deterministic policy" : each state will always return the same action.
"Stochastic policy" : each state produces a probability distribution of actions
Value Based Methods map each state to the expected value of being at the state. The value of a state is the expected return if starting from the state, according to the policy of traveling to the highest-value state.
The description of value based methods glosses over (doesn't mention) the need to retain a map of how actions move the agent through states.
"Deep reinforcement learning" : reinforcement learning that uses deep neural networks in its policy algorithm(s).
The next section will engage value-based Q-Learning, first classic reinforcement and then Deep Q-Learning. The difference is in whether the mapping Q of action to value is made with a table or a neural network.
The homework for beginners is to make the moon lander succeed, and to go inside a little bit of the source code and recode it manually to get more control over it (like, what would be needed to make an environment class with a different spec?)
The homework for experts is to do the tutorial offline, and to either privately train an agent that reaches the top of the leader boards, or explain clearly why you did not.