# What is RL? A short recap

• Policy-based-methods: Train our policy directly to learn which action to take, given a state.
• Value-based methods: Train a value function to learn which state is more valuable and using this value function to take the action that leads to it.

# The two types of value-based methods

But what means acting according to our policy? We don’t have a policy in value-based methods since we train a value-function and not a policy?

• Policy-based methods: Directly train the policy to select what action to take given a state (or a probability distribution over actions at that state). In this case, we don’t have a value-function. The policy takes a state as input and outputs what action to take at that state (deterministic policy).
• Value-based methods: Indirectly, by training a value-function that outputs the value of a state, or a state-action pair. Given this value function, our policy will take action. Given a state, our action-value function (that we train) outputs the value of each action at that state, then our greedy policy (that we defined) selects the action with biggest state-action pair value.

## The State-Value function If we take the state with value -7: it’s the sum of Expected return starting at that state and taking actions according to our policy (greedy policy) so right, right, right, down, down, right, right.

## The Action-Value function

• In state-value function, we calculate the value of a state (St).
• In action-value function, we calculate the value of state-action pair (St, At) hence the value of taking that action at that state. Note: We didn’t fill all the state-actions pair for the example of Action-value function

## The Bellman Equation: simplify our value estimation To calculate the value of State 1: the sum of rewards if the agent started in that state, and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps. To calculate the value of State 2: the sum of rewards if the agent started in that state, and then followed the policy for all the time steps. To calculate the value of State 1: the sum of rewards if the agent started in that state 1, and then followed the policy for all the time steps. For simplification here we don’t discount so gamma = 1.
• The value of V(St+1) = Immediate reward (Rt+2) + Discounted value of the St+2 (Gamma * V(St+2)).
• And so on.

# Monte Carlo vs Temporal Difference Learning

## Monte Carlo: learning at the end of the episode

• We always start the episode at the same starting point.
• We try actions using our policy (for instance using Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation).
• We get the Reward and the Next State.
• We terminate the episode if the cat eats us or if we move > 10 steps.
• At the end of the episode, we have a list of State, Actions, Rewards, and Next States.
• The agent will sum the total rewards Gt (to see how well it did).
• It will then update V(st) based on the formula.
• Then start a new game with this new knowledge
• We just started to train our Value function so it returns 0 value for each state.
• Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount).
• Our mouse, explore the environment and take random actions, we see what it did here:
• The mouse made more than 10 steps, so the episode ends.
• We have a list of state, action, rewards, next_state, we need to calculate the return Gt.
• Gt = Rt+1 + Rt+2 + Rt+3… (for simplicity we don’t discount the rewards).
• Gt = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1+ 0 + 0
• Gt= 3
• We can now update V(S0):
• New V(S0) = V(S0) + lr * [Gt — V(S0)]
• New V(S0) = 0 + 0.1 * [3 –0]
• The new V(S0) = 0.3

## Temporal Difference Learning: learning at each step

• Temporal Difference, on the other hand, waits for only one interaction (one step) St+1 to form a TD target and update V(St) using Rt+1 and gamma * V(St+1).
• We just started to train our Value function so it returns 0 value for each state.
• Our learning rate (lr) is 0.1 and our discount rate is 1 (no discount).
• Our mouse, explore the environment and take a random action: the action going to the left.
• It gets a reward Rt+1 = 1 since it eat a piece of cheese.
• With Monte Carlo, we update the value function from a complete episode and so we use the actual accurate discounted return of this episode.
• With TD learning, we update the value function from a step, so we replace Gt that we don’t have with an estimated return called TD target.

# Summary

• State-value function: outputs the expected return if I start at that state and then act accordingly to the policy forever after.
• Action-Value function: outputs the expected return if I start in that state and I take that action at that state and then I act accordingly to the policy forever after.
• In value-based methods, we define the policy by hand because we don’t train it, we train a value function. The idea is that if we have an optimal value function, we will have an optimal policy.
• With Monte Carlo, we update the value function from a complete episode and so we use the actual accurate discounted return of this episode.
• With TD learning, we update the value function from a step, so we replace Gt that we don’t have with an estimated return called TD target.

--

--

## More from Thomas Simonini

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 https://bit.ly/3QADz2Q |