1031 I'm taking notes here as I read the section. The reward the policy engages may be discounted to reduce the quality of nearby states [note: one of many heavily improvable heuristics]. A link is given to https://huggingface.co/blog/deep-rl-intro#rewards-and-the-discounting to review that. The value of each state is defined as being the expected return if the agent starts at that state and acts according to the policy. This sounds like a definition to me that is roughly on track. [1034] Value-based methods have policies for selecting actions given states. A "Greedy" policy is one that selects the one with the biggest reward. Usually in value-based methods, what's called an Epsilon-Greedy Policy is used to manage exploration/exploitation tradeoff. The value of the states is called a "value function": the "policy" is the selection of actions from the states. The value function and policy work together to refine the accuracy of state values and action selection.