24 Jun
2022
24 Jun
'22
2:44 p.m.
1041 State value function V_Pi(s) = E_Pi[G_t|S_t = s] The policy value of state s is the expected policy return if the agent starts at state s. The equation doesn't seem very helfpul. Second description of equation: For each state, the state-value fucntion outputs the expected return, if the agent starts in that state, and then follows the policy forever after. A graphic is shown of a mouse in a tiny maze finding cheese. Each square has a number representing the negative number of steps needed to reach the cheese. This step is the value. At this point, it is pretty easy to imagine a function recursively updating the values of every square in such a maze until they stabilise, to arrive at the picture.