Deep Reinforcement Learning Course v2.0

Q-Learning, let’s create an autonomous Taxi 🚖 (Part 2/2)

Introducing Q-Learning

What is Q-Learning?

  • “Off-policy”: we’ll talk about that at the end of this chapter.
  • “Value-based method”: it means that it finds its optimal policy indirectly by training a value-function or action-value function that will tell us what’s the value of each state or each state-action pair.
  • “Uses a TD approach”: updates its action-value function at each step.
Given a state and action, our Q Function outputs a state-action value (also called Q-value)
Given a state and action pair, our Q-function will search inside its Q-table to output the state-action pair value (the Q value).
  • The Q-Learning is the RL algorithm that
  • Trains Q-Function, an action-value function that contains, as internal memory, a Q-table that contains all the state-action pair values.
  • Given a state and action, our Q-Function will search into its Q-table the corresponding value.
  • When the training is done, we have an optimal Q-Function, so an optimal Q-Table.
  • And if we have an optimal Q-function, we have an optimal policy, since we know for each state, what is the best action to take.
We see here that with the training, our Q-Table is better since thanks to it we can know the value of each state-action pair.

The Q-Learning algorithm

Source: Udacity
  • With probability 1 — ɛ : we do exploitation (aka our agent selects the action with the highest state-action pair value).
  • With probability ɛ: we do exploration (trying random action).
  • We need St, At, Rt+1, St+1.
  • To update our Q-value at this state-action pair, we form our TD target:

Off-policy vs On-policy

  • Off-policy: using a different policy for acting and updating.
Acting policy
Updating policy
  • On-policy: using the same policy for acting and updating.
Sarsa

An example

  • You’re a mouse in this very small maze. You always start at the same starting point.
  • The goal is to eat the big pile of cheese at the bottom right-hand corner, and avoid the poison.
  • The episode ends if we eat the poison, eat the big pile of cheese or if we spent more than 5 steps.
  • The learning rate is 0.1
  • The gamma (discount rate) is 0.99
  • +0: Going to a state with no cheese in it.
  • +1: Going to a state with a small cheese in it.
  • +10: Going to the state with the big pile of cheese.
  • -10: Going to the state with the poison and thus die.

Let’s train our Q-Learning Taxi agent 🚕

Why we set a -1 for each action?

--

--

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 https://bit.ly/3QADz2Q |

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Thomas Simonini

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 https://bit.ly/3QADz2Q |