Deep Reinforcement Learning Course v2.0

Q-Learning, let’s create an autonomous Taxi 🚖 (Part 1/2)

Chapter 2 of the Deep Reinforcement Learning Course v2.0

Thomas Simonini

--

We launched a new free, updated, Deep Reinforcement Learning Course from beginner to expert, with Hugging Face 🤗

👉 The new version of the course: https://huggingface.co/deep-rl-course/unit0/introduction

The chapter below is the former version, the new version is here 👉 https://huggingface.co/deep-rl-course/unit2/introduction

We launched a new free, updated, Deep Reinforcement Learning Course from beginner to expert, with Hugging Face 🤗

👉 The new version of the course: https://huggingface.co/deep-rl-course/unit0/introduction

The chapter below is the former version, the new version is here 👉 https://huggingface.co/deep-rl-course/unit2/introduction

This article is part of the Deep Reinforcement Learning Course v2.0🕹️. A free course from beginner to expert with Tensorflow and PyTorch. Check the syllabus here.

Before studying this new chapter, you should master the elements that we spoke about in the first chapter. If it’s not the case, you can check the video version here or the article version there.

In the first chapter of this course, we learned about what is Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem.

So today, we’re going to dive deeper into one of these methods: value-based-methods and learn about our first RL algorithm: Q-Learning.

We’ll also implement our first RL agent: a Q-Learning taxi that will need to learn to navigate in a city to transport its passengers from point A to point B.

This chapter will be divided into 2 parts:

In the first part, we’ll learn about the value-based methods and the difference between Monte Carlo and Temporal Difference Learning.

And in the second part, we’ll study our first RL algorithm: Q-Learning, and implement our first RL Agent.

This chapter is fundamental if you want to be able to work on Deep Q-Learning (chapter 3): the first Deep RL algorithm that was able to play Atari games and beat the human level on some of them (breakout, space invaders…).

If you prefer, you can watch the 📹 video version of this chapter here:

So let’s get started!

What is RL? A short recap

In RL, we build an agent that can make smart decisions. For instance, an agent that learns to play a video game. Or a trading agent that learns to maximize its benefits by making smart decisions on what stocks to buy and when to sell.

The RL Process

But, to make smart decisions, our agent will learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

Its goal is to maximize its expected cumulative reward (because of the reward hypothesis).

The agent’s brain is called the policy π. It’s where the agent makes its decision-making process: given a state, our policy will output an action or a probability distribution over actions.

Our goal is to find an optimal policy π*, aka, a policy that leads to the best expected cumulative reward.

And to find this optimal policy (hence solving the RL problem) there are two main types of RL methods:

  • Policy-based-methods: Train our policy directly to learn which action to take, given a state.
  • Value-based methods: Train a value function to learn which state is more valuable and using this value function to take the action that leads to it.

And in this chapter, we’ll dive deeper into the Value-based methods.

The two types of value-based methods

In value-based methods, we learn a value function, that maps a state to the expected value of being at that state.

The value of a state is the expected discounted return the agent can get if it starts at that state, and then acts according to our policy.

But what means acting according to our policy? We don’t have a policy in value-based methods since we train a value-function and not a policy?

Remember that the goal of an RL agent is to have an optimal policy π*.

To find it, we learned that there are two different methods:

  • Policy-based methods: Directly train the policy to select what action to take given a state (or a probability distribution over actions at that state). In this case, we don’t have a value-function.
The policy takes a state as input and outputs what action to take at that state (deterministic policy).

And consequently, we don’t define by hand the behavior of our policy, it’s the training that will define it.

  • Value-based methods: Indirectly, by training a value-function that outputs the value of a state, or a state-action pair. Given this value function, our policy will take action.

But, because we didn’t train our policy, we need to specify its behavior. For instance, if we want a policy that given the value function will take actions that always lead to the biggest value, we’ll create a Greedy Policy.

Given a state, our action-value function (that we train) outputs the value of each action at that state, then our greedy policy (that we defined) selects the action with biggest state-action pair value.

Consequently, whatever method you use to solve your problem, you will have a policy, but in the case of value-based methods you don’t train it, your policy is just a simple function that you specify (for instance greedy policy) and this policy uses the values given by the value-function to select its actions.

So the difference is, in policy-based the optimal policy is found by training the policy directly.

In value-based, finding an optimal value-function leads to having an optimal policy.

In fact, in value-based methods, most of the time you’ll use an Epsilon-Greedy Policy that handles the exploration/exploitation trade-off, we’ll talk about it when we’ll talk about Q-Learning in the second part of this chapter.

So, we have two types of value-based functions:

The State-Value function

We write the state value function under a policy π like this:

For each state, the state-value function outputs the expected return if the agent starts at that state, and then follow the policy forever after (for all future timesteps if you prefer).

If we take the state with value -7: it’s the sum of Expected return starting at that state and taking actions according to our policy (greedy policy) so right, right, right, down, down, right, right.

The Action-Value function

In the Action-value function, for each state and action pair, the action-value function outputs the expected return, if the agent starts in that state and takes the action, and then follows the policy forever after.

The value of taking action a in state s under a policy π is:

We see that the difference is:

  • In state-value function, we calculate the value of a state (St).
  • In action-value function, we calculate the value of state-action pair (St, At) hence the value of taking that action at that state.
Note: We didn’t fill all the state-actions pair for the example of Action-value function

Whatever value function we choose (state-value or action-value function), the value is the expected return.

However, the problem is that it implies that to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.

This can be a dull process and that’s where the Bellman equation comes to help us.

The Bellman Equation: simplify our value estimation

The Bellman equation simplifies our state value or state-action value calculation.

With what we learned from now, we know that if we calculate the V(St) (value of a state), we need to calculate the return starting at that state then follow the policy forever after. (Our policy that we defined in the next example is a Greedy Policy and for simplification, we don’t discount the reward).

So to calculate V(St) we need to make the sum of the expected rewards. Hence:

To calculate the value of State 1: the sum of rewards if the agent started in that state, and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.

Then, to calculate the V(St+1), we need to calculate the return starting at that state St+1.

To calculate the value of State 2: the sum of rewards if the agent started in that state, and then followed the policy for all the time steps.

So you see that’s a quite dull process if you need to do it for each state value or state-action value.

Instead of calculating for each state or each state-action pair the expected return, we can use the Bellman equation.

The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:

The immediate reward (Rt+1) + the discounted value of the state that follows (gamma * V(St+1)).

If we go back to our example, the value of State 1= expected cumulative return if we start at that state.

To calculate the value of State 1: the sum of rewards if the agent started in that state 1, and then followed the policy for all the time steps.

Which is equivalent to V(St) = Immediate reward (Rt+1) + Discounted value of the next state (Gamma * V(St+1)).

For simplification here we don’t discount so gamma = 1.
  • The value of V(St+1) = Immediate reward (Rt+2) + Discounted value of the St+2 (Gamma * V(St+2)).
  • And so on.

To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, which is a long process. This is equivalent to the sum of immediate reward + the discounted value of the state that follows.

This equation will be useful when we’ll learn about Q-Learning in the second part of this chapter.

Monte Carlo vs Temporal Difference Learning

The last thing we need to talk about today is the two ways of learning whatever the RL method we use.

Remember that an RL agent learns by interacting with its environment. The idea is that using the experience taken, given the reward he gets, it will update its value or its policy.

Monte Carlo and Temporal Difference Learning are two different strategies on how to train our value function or our policy function. Both of them use experience to solve the RL problem.

But on the one hand, Monte Carlo uses an entire episode of experience before learning. On the other hand, Temporal Difference uses only a step (St, At, Rt+1, St+1) to learn.

We’ll explain both of them using a value-based method example.

Monte Carlo: learning at the end of the episode

Monte Carlo waits until the end of the episode, then calculates Gt (return) and uses it as a target for updating V(St).

So it requires a complete entire episode of interaction before updating our value function.

If we take an example:

  • We always start the episode at the same starting point.
  • We try actions using our policy (for instance using Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation).
  • We get the Reward and the Next State.
  • We terminate the episode if the cat eats us or if we move > 10 steps.
  • At the end of the episode, we have a list of State, Actions, Rewards, and Next States.
  • The agent will sum the total rewards Gt (to see how well it did).
  • It will then update V(st) based on the formula.
  • Then start a new game with this new knowledge

By running more and more episodes, the agent will learn to play better and better.

For instance, if we train a state-value function using Monte Carlo:

  • We just started to train our Value function so it returns 0 value for each state.
  • Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount).
  • Our mouse, explore the environment and take random actions, we see what it did here:
  • The mouse made more than 10 steps, so the episode ends.
  • We have a list of state, action, rewards, next_state, we need to calculate the return Gt.
  • Gt = Rt+1 + Rt+2 + Rt+3… (for simplicity we don’t discount the rewards).
  • Gt = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1+ 0 + 0
  • Gt= 3
  • We can now update V(S0):
  • New V(S0) = V(S0) + lr * [Gt — V(S0)]
  • New V(S0) = 0 + 0.1 * [3 –0]
  • The new V(S0) = 0.3

Temporal Difference Learning: learning at each step

  • Temporal Difference, on the other hand, waits for only one interaction (one step) St+1 to form a TD target and update V(St) using Rt+1 and gamma * V(St+1).

The idea is that with TD we update the V(St) at each step.

But because we didn’t play during an entire episode, we don’t have Gt (expected return), instead, we estimate Gt by adding Rt+1 and the discounted value of next state.

We speak about bootstrap because TD bases its update part on an existing estimate V(St+1) and not a full sample Gt.

This method is called TD(0) or one step TD (update the value function after any individual step).

If we take the same example,

  • We just started to train our Value function so it returns 0 value for each state.
  • Our learning rate (lr) is 0.1 and our discount rate is 1 (no discount).
  • Our mouse, explore the environment and take a random action: the action going to the left.
  • It gets a reward Rt+1 = 1 since it eat a piece of cheese.

We can now update V(S0):

New V(S0) = V(S0) + lr * [R1 + gamma * V(S1) — V(S0)]

New V(S0) = 0 + 0.1 * [1 + 0.99 * 0–0]

The new V(S0) = 0.1

So we just updated our value function for State 0.

Now we continue to interact with this environment with our updated value function.

If we summarize:

  • With Monte Carlo, we update the value function from a complete episode and so we use the actual accurate discounted return of this episode.
  • With TD learning, we update the value function from a step, so we replace Gt that we don’t have with an estimated return called TD target.

Summary

We have two types of value-based functions:

  • State-value function: outputs the expected return if I start at that state and then act accordingly to the policy forever after.
  • Action-Value function: outputs the expected return if I start in that state and I take that action at that state and then I act accordingly to the policy forever after.
  • In value-based methods, we define the policy by hand because we don’t train it, we train a value function. The idea is that if we have an optimal value function, we will have an optimal policy.

There are two types of methods to learn a policy or a value-function:

  • With Monte Carlo, we update the value function from a complete episode and so we use the actual accurate discounted return of this episode.
  • With TD learning, we update the value function from a step, so we replace Gt that we don’t have with an estimated return called TD target.

So that’s all for today. Congrats on finishing this first part of the chapter! There was a lot of information.

That’s normal if you still feel confused with all these elements. This was the same for me and for all people who studied RL.

Take time to really grasp the material before continuing. In the second part, we’ll study our first RL algorithm: Q-Learning, and implement our first RL Agent: a taxi that will need to learn to navigate in a city to transport its passengers from point A to point B. This will be fun.

If you liked my article, please click the 👏 below as many times as you liked the article so other people will see this here on Medium. And don’t forget to follow me on Medium, on Twitter, and on Youtube.

See you next time,

Keep learning, stay awesome,

--

--

Thomas Simonini

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 https://bit.ly/3QADz2Q |