1106 Monte Carlo https://huggingface.co/blog/deep-rl-q-part1#monte-carlo-learning-at-the-end-... Monte Carlo approach waits until the episode is over, calculates the entire return, and updates the policy based on this entire return. This means the value function is only updated after an entire episode. An example mentions that the Epsilon Greedy Strategy alternates between exploration and exploitation, where exploration is random actions. The example uses episodes limited to enough steps to visit all the areas with reward. The Monte Carlo approach accumulates each reward and state and plugs them into the Bellman equation or otherwise sums the rewards, then repeats. Each repetition things improve. - at the start the value of every state is 0, since the value model is untrained - it provides parameters, learning rate = 0.1 (model training aggressiveness), and discount rate (gamma) is 1 (no discount, since it's a multiplier) - this results in random exploration In the example, the sum of rewards is 3. The state-value function is depicted as updated like this: V(S_t) <- V(S_t) + alpha * [G_t - V(S_t)] V(S_t) is the state value function alpha is the learning rate G_t is the return t is the timestep So with a return of 3 and a learning rate of 0.1, the new value function for the starting state becomes 0.3 .