[ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2

Fri Jun 24 07:44:33 PDT 2022

1041

State value function

V_Pi(s) = E_Pi[G_t|S_t = s]

The policy value of state s is the expected policy return if the agent
starts at state s.

The equation doesn't seem very helfpul.

Second description of equation:
For each state, the state-value fucntion outputs the expected return,
if the agent starts in that state, and then follows the policy forever
after.

A graphic is shown of a mouse in a tiny maze finding cheese. Each
square has a number representing the negative number of steps needed
to reach the cheese. This step is the value.

At this point, it is pretty easy to imagine a function recursively
updating the values of every square in such a maze until they
stabilise, to arrive at the picture.