[ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2

Fri Jun 24 08:14:11 PDT 2022

1106 Monte Carlo
https://huggingface.co/blog/deep-rl-q-part1#monte-carlo-learning-at-the-end-of-the-episode

Monte Carlo approach waits until the episode is over, calculates the
entire return, and updates the policy based on this entire return.
This means the value function is only updated after an entire episode.

An example mentions that the Epsilon Greedy Strategy alternates
between exploration and exploitation, where exploration is random
actions. The example uses episodes limited to enough steps to visit
all the areas with reward.

The Monte Carlo approach accumulates each reward and state and plugs
them into the Bellman equation or otherwise sums the rewards, then
repeats. Each repetition things improve.

- at the start the value of every state is 0, since the value model is untrained
- it provides parameters, learning rate = 0.1 (model training
aggressiveness), and discount rate (gamma) is 1 (no discount, since
it's a multiplier)
- this results in random exploration

In the example, the sum of rewards is 3.
The state-value function is depicted as updated like this:
V(S_t) <- V(S_t) + alpha * [G_t - V(S_t)]

V(S_t) is the state value function
alpha is the learning rate
G_t is the return
t is the timestep

So with a return of 3 and a learning rate of 0.1, the new value
function for the starting state becomes 0.3 .