[ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2

Undiscussed Horrific Abuse, One Victim of Many gmkarl at gmail.com
Fri Jun 24 07:49:52 PDT 2022


1045

For each state and action pair, the action-value function outputs the
expected return if the agent starts in that state and takes action,
and then follows the policy forever after.

So, the action-value function is doing the same prediction task as the
state-value function.

They give another confusing equation
Q_Pi(s,a) = E_Pi[G_t|S_t = s, A_t = a]
saying the same self-referential thing about returns being expected
cumulative after starting there.

The chart of the mouse with the cheese is present again. The numbers
are the same, but each number is associated with a direction of travel
out of a square to another.

The difference between the state-value function and the action-value
function is that the state-value function outputs the value of a
state, whereas the action-value function outputs the value of a
state-action pair: the value of taking that action at that state.

The section ends saying that the Bellman equation can help resolve the
problem of summing all the rewards an agent can get if it starts at a
state, to calculate the action-value or state-value.


More information about the cypherpunks mailing list