24 Jun
                
                    2022
                
            
            
                24 Jun
                
                '22
                
            
            
            
        
    
                2:44 p.m.
            
        1041 State value function V_Pi(s) = E_Pi[G_t|S_t = s] The policy value of state s is the expected policy return if the agent starts at state s. The equation doesn't seem very helfpul. Second description of equation: For each state, the state-value fucntion outputs the expected return, if the agent starts in that state, and then follows the policy forever after. A graphic is shown of a mouse in a tiny maze finding cheese. Each square has a number representing the negative number of steps needed to reach the cheese. This step is the value. At this point, it is pretty easy to imagine a function recursively updating the values of every square in such a maze until they stabilise, to arrive at the picture.