# Introducing Q-Learning

## What is Q-Learning?

• “Off-policy”: we’ll talk about that at the end of this chapter.
• “Value-based method”: it means that it finds its optimal policy indirectly by training a value-function or action-value function that will tell us what’s the value of each state or each state-action pair.
• “Uses a TD approach”: updates its action-value function at each step. Given a state and action, our Q Function outputs a state-action value (also called Q-value) Given a state and action pair, our Q-function will search inside its Q-table to output the state-action pair value (the Q value).
• The Q-Learning is the RL algorithm that
• Trains Q-Function, an action-value function that contains, as internal memory, a Q-table that contains all the state-action pair values.
• Given a state and action, our Q-Function will search into its Q-table the corresponding value.
• When the training is done, we have an optimal Q-Function, so an optimal Q-Table.
• And if we have an optimal Q-function, we have an optimal policy, since we know for each state, what is the best action to take. We see here that with the training, our Q-Table is better since thanks to it we can know the value of each state-action pair.

## The Q-Learning algorithm

• With probability 1 — ɛ : we do exploitation (aka our agent selects the action with the highest state-action pair value).
• With probability ɛ: we do exploration (trying random action).
• We need St, At, Rt+1, St+1.
• To update our Q-value at this state-action pair, we form our TD target:

## Off-policy vs On-policy

• Off-policy: using a different policy for acting and updating.
• On-policy: using the same policy for acting and updating.

## An example

• You’re a mouse in this very small maze. You always start at the same starting point.
• The goal is to eat the big pile of cheese at the bottom right-hand corner, and avoid the poison.
• The episode ends if we eat the poison, eat the big pile of cheese or if we spent more than 5 steps.
• The learning rate is 0.1
• The gamma (discount rate) is 0.99
• +0: Going to a state with no cheese in it.
• +1: Going to a state with a small cheese in it.
• +10: Going to the state with the big pile of cheese.
• -10: Going to the state with the poison and thus die.

# Let’s train our Q-Learning Taxi agent 🚕

Why we set a -1 for each action?

--

--

## More from Thomas Simonini

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 https://bit.ly/3QADz2Q |