[ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2

Fri Jun 24 07:37:44 PDT 2022

1031 I'm taking notes here as I read the section.

The reward the policy engages may be discounted to reduce the quality
of nearby states [note: one of many heavily improvable heuristics]. A
link is given to
https://huggingface.co/blog/deep-rl-intro#rewards-and-the-discounting
to review that.

The value of each state is defined as being the expected return if the
agent starts at that state and acts according to the policy. This
sounds like a definition to me that is roughly on track. [1034]

Value-based methods have policies for selecting actions given states.
A "Greedy" policy is one that selects the one with the biggest reward.

Usually in value-based methods, what's called an Epsilon-Greedy Policy
is used to manage exploration/exploitation tradeoff.

The value of the states is called a "value function": the "policy" is
the selection of actions from the states. The value function and
policy work together to refine the accuracy of state values and action
selection.