1045 For each state and action pair, the action-value function outputs the expected return if the agent starts in that state and takes action, and then follows the policy forever after. So, the action-value function is doing the same prediction task as the state-value function. They give another confusing equation Q_Pi(s,a) = E_Pi[G_t|S_t = s, A_t = a] saying the same self-referential thing about returns being expected cumulative after starting there. The chart of the mouse with the cheese is present again. The numbers are the same, but each number is associated with a direction of travel out of a square to another. The difference between the state-value function and the action-value function is that the state-value function outputs the value of a state, whereas the action-value function outputs the value of a state-action pair: the value of taking that action at that state. The section ends saying that the Bellman equation can help resolve the problem of summing all the rewards an agent can get if it starts at a state, to calculate the action-value or state-value.