[ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2

Undiscussed Horrific Abuse, One Victim of Many gmkarl at gmail.com
Fri Jun 24 08:26:29 PDT 2022


1118

Temporal difference updates the value policy after every step.

V(S_t) <- V(S_t) + alpha[R_t+1 + gamma * Value(S_t+1) - V(S_t)]

R is the reward.

Basically, the old value policy is used on the next state to estimate
the value update for the preceding state, and Bellman's equation
simplifies the calculation.

[seems it would be helpful to derive a relation between these two expressions]

They list it out a little like pseudocode, it's not super clear. I'm a
little confused on whether they're updating the next state or the
following state or what the reward is associated with ... I infer the
reward is associated with the next state.

So each step, the value function for the state being left can be
updated with the reward from the state being entered, I'm guessing.

They're considering the value function to be a map of states to real
numbers, and simply scaling them by the learning rate times the change
in estimated return.

They go through the same example with this approach. I'll probably
skip to the section summary after it, which doesn't have it's own
heading, but continues through to the end of the blog.


More information about the cypherpunks mailing list