Re: [ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2

24 Jun 2022

      1203 Unit 1 Quiz 2 https://huggingface.co/blog/deep-rl-intro#value-based-methods

Q1: What are the two main approaches to find optimal policy?

guess: [i'm spending some tiem thinking. monte carlo and TD are
nearest in my mind, and this was before that.] value-based, where a
policy is trained to update values for each state, and policy-based,
where a policy is trained to directly select actions.

Q2: What is the Bellman Equation?

guess: a way to calculate the return for a value-based state given
[stepped away to do something, now 1211] its reward and the return for
the next state, rather than exhaustively summing all of them

Q3: Define each part of the Bellman Equation

[stepped away again. it's now 1217]
V_Pi(s) = E_Pi[R_t+1 + gamma * V_Pi(S_t+1)|S_t = s]

V_Pi(s): ____ guess: The value policy function for a state
E_Pi[R_t+1: ____ guess: The estimated policy return
gamma * V_Pi(S_t+1): ______ guess: The discounted return for the next state
S_t = s: ____ guess: for a complete episode that started at this state

Q4: What is the difference between Monte Carlo and Temporal Difference
learning methods?

guess: Temporal Difference methods use entire accumulated episodes
with accurate returns calculated, to update the value policy function.
Monte Carlo methods use estimated returns based on past experience, to
update the value policy every step.

Q5: Define each part of the Temporal Difference learning formula
V(S_t) <- V(S_t) + alpha[R_t+1 + gamma * V(S_t+1) - V(S_t)]

V(S_t): _____
V(S_t): _____
alpha: _____
R_t+1: _____
gamma * V(S_t+1): _____
V(S_t): _____
[R_t+1 + gamma * V(S_t+1)]: _____

[something unexpected is happening. i'm stepping away from this. 1222

Re: [ot][spam] Behavior Log For Compliance Examples: HFRL Unit 2

Undiscussed Horrific Abuse, One Victim of Many