1118 Temporal difference updates the value policy after every step. V(S_t) <- V(S_t) + alpha[R_t+1 + gamma * Value(S_t+1) - V(S_t)] R is the reward. Basically, the old value policy is used on the next state to estimate the value update for the preceding state, and Bellman's equation simplifies the calculation. [seems it would be helpful to derive a relation between these two expressions] They list it out a little like pseudocode, it's not super clear. I'm a little confused on whether they're updating the next state or the following state or what the reward is associated with ... I infer the reward is associated with the next state. So each step, the value function for the state being left can be updated with the reward from the state being entered, I'm guessing. They're considering the value function to be a map of states to real numbers, and simply scaling them by the learning rate times the change in estimated return. They go through the same example with this approach. I'll probably skip to the section summary after it, which doesn't have it's own heading, but continues through to the end of the blog.