1203 Unit 1 Quiz 2 https://huggingface.co/blog/deep-rl-intro#value-based-methods Q1: What are the two main approaches to find optimal policy? guess: [i'm spending some tiem thinking. monte carlo and TD are nearest in my mind, and this was before that.] value-based, where a policy is trained to update values for each state, and policy-based, where a policy is trained to directly select actions. Q2: What is the Bellman Equation? guess: a way to calculate the return for a value-based state given [stepped away to do something, now 1211] its reward and the return for the next state, rather than exhaustively summing all of them Q3: Define each part of the Bellman Equation [stepped away again. it's now 1217] V_Pi(s) = E_Pi[R_t+1 + gamma * V_Pi(S_t+1)|S_t = s] V_Pi(s): ____ guess: The value policy function for a state E_Pi[R_t+1: ____ guess: The estimated policy return gamma * V_Pi(S_t+1): ______ guess: The discounted return for the next state S_t = s: ____ guess: for a complete episode that started at this state Q4: What is the difference between Monte Carlo and Temporal Difference learning methods? guess: Temporal Difference methods use entire accumulated episodes with accurate returns calculated, to update the value policy function. Monte Carlo methods use estimated returns based on past experience, to update the value policy every step. Q5: Define each part of the Temporal Difference learning formula V(S_t) <- V(S_t) + alpha[R_t+1 + gamma * V(S_t+1) - V(S_t)] V(S_t): _____ V(S_t): _____ alpha: _____ R_t+1: _____ gamma * V(S_t+1): _____ V(S_t): _____ [R_t+1 + gamma * V(S_t+1)]: _____ [something unexpected is happening. i'm stepping away from this. 1222