-------------
   I'm looking through the reading for unit 1. I've seen some of this
   before and it seems unnecessary, as is usually the case with first
   units.
   So far it's mostly definition of terms.
   [1]https://huggingface.co/blog/deep-rl-intro
   Example fo deep RL: a computer uses inaccurate simulations of a robot
   hand to train a robot to dexterously manipulate objects with its
   fingers, on the first go, using a general-purpose RL
   algorithm: [2]https://openai.com/blog/learning-dexterity/
   Deep RL is taught in a way that harshly separates the use from the
   implementation. It is of course a hacker's job to break down that
   separation.
   Update loop (works better if many of these happen in parallel):
   State -> Action -> Reward -> Next State
   It accumulates the reward over many updates, and then uses that
   cumulative reward (or "expected return") to update the model that chose
   the actions.
   "Reward hypothesis" : All goals can be expressed as the maximization of
   an expected return.
   "Markov Decision Process" : An academic term for reinforcement
   learning.
   "Markov Property" : A property an agent has if it does not need to be
   provided with historical information, and can operate only on current
   state to succeed.
   To me, the markov property implies that an agent stores sufficient
   information within itself in order to improve, but this is not stated
   in the material. Markov property seems like a user-focused worry to me,
   at this point.
   "State" : A description of all areas of the system the agent is within.
   "Observation" : Information on only part of the system the agent is
   within, such as areas near it, or from local sensors.
   "Action space" : The set of all actions the agent may take in its
   environment.
   "Discrete action space" : An action space that is finite and completely
   enumerable.
   "Continuous action space" : An action space that is effectively
   infinite [and subdivisible].
   Some RL algorithms are specifically better at working with discrete or
   with continuous action spaces.
   "Cumulative Reward"
    The cumulative reward to maximize is defined as the sum of all rewards
   from the start of the event loop to the end of time itself.
   Since the system has limited time, and starts with very little at its
   beginning, "discounting" is used.
   "Discount rate" or "gamma" : Usually between 0.99 and 0.95.
   When gamma is high, the discounting is lower, and the agent prioritises
   long term reward. When gamma is low, the discounting is higher, and the
   agent prioritises short term reward.
   When calculating the cumulative reward, each reward is multiplied by
   gamma to the power of the time step.
   "Task" : an instance of a reinforcement learning problem
   "Episodic task" : a task with a clear starting and ending point, such
   as a level in a video game
   "Continuous task" : an unending task where there is no opportunity to
   learn over completion of the task, like stock trading
   "Exploration" : spending time doing random actions to learn more
   information about the environment
   "Exploitation" : using known information to maximize reward
   There are different ways of handling exploitation and exploration, but
   roughly if there is too much exploitation the agent never leaves its
   initial environment and keeps picking the most rewarding immediate
   thing, whereas when exploring it spends time with poor reward to see if
   it finds better reward.
   "Policy" or "pi" : The function that selects actions based on states.
   The optimal policy pi* is found based on training.
   "Direct training" or "Policy-based methods" : The agent is taught which
   action to perform, given its state.
   "Indirect training" or "Value-based methods" : The agent is taught
   which states are valuable, and then selects actions that lead to these
   states.
   [ed: this seems obviously a spectrum of generality, it's a little
   irritating direct training is mentioned at all, and no further things
   listed after indirect training. maybe I am misunderstanding something.
   I'm letting my more general ideas as to how to approach this slip to
   the wayside a little, because I haven't been able to do anything with
   this stuff my entire life. so this being valid to pursue seems useful.
   the below stuff would be generalised and combined into a graph that
   feeds back to itself to change its shape (or its metaness), basically.]
   Policy Based Methods map from each state to the best corresponding
   action, or a probability distribution over those.
   "Deterministic policy" : each state will always return the same action.
   "Stochastic policy" : each state produces a probability distribution of
   actions
   Value Based Methods map each state to the expected value of being at
   the state. The value of a state is the expected return if starting from
   the state, according to the policy of traveling to the highest-value
   state.
   The description of value based methods glosses over (doesn't mention)
   the need to retain a map of how actions move the agent through states.
   "Deep reinforcement learning" : reinforcement learning that uses deep
   neural networks in its policy algorithm(s).
   The next section will engage value-based Q-Learning, first classic
   reinforcement and then Deep Q-Learning. The difference is in whether
   the mapping Q of action to value is made with a table or a neural
   network.
   The lesson refers to [3]https://course.fast.ai/ for more information on
   deep neural networks.
   The activity is training a moon lander
   at [4]https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit
   1.ipynb .
   The homework for beginners is to make the moon lander succeed, and to
   go inside a little bit of the source code and recode it manually to get
   more control over it (like, what would be needed to make an environment
   class with a different spec?)
   The homework for experts is to do the tutorial offline, and to either
   privately train an agent that reaches the top of the leader boards, or
   explain clearly why you did not.
   The lesson states there is additional reading in the readme
   at [5]https://github.com/huggingface/deep-rl-class/blob/main/unit1/READ
   ME.md .

References

   1. https://huggingface.co/blog/deep-rl-intro
   2. https://openai.com/blog/learning-dexterity/
   3. https://course.fast.ai/
   4. https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb
   5. https://github.com/huggingface/deep-rl-class/blob/main/unit1/README.md