Reinforcement learning, line by line: Q-learning

This is the third post of the blog series Reinforcement learning: line by line. The interactive
sketch shows an implementation of the tabular *Q-learning* algorithm (Watkins, 1989)
applied to a simple game, called the *Pancakes Gridworld*. See this post for more information about the Pancakes
Gridworld as well as the notation and foundational concepts required to
understand the algorithm. In case you are completely new to reinforcement
learning (RL), see here for an informal
introduction.

The algorithm: tabular Q-learning

The agent's goal is to learn an optimal action value function through interaction with the environment. The Pancakes Gridworld can be modeled as a Markov decision process (MDP) with finite state and action sets and we can thus represent the value function as a set of $Q(s, a)$-values, one for each state-action pair $(s, a)$. The agents starts with a random estimate of each $Q$-value (in the example shown above, we initialize all $Q(s, a)$-values to $0$).In the sketch shown above, the Q-values are represented by four arrows in each cell. In the beginning these arrows are gray but they become blue (for negative Q-values) or red (for positive Q-values) during the learning process. During the learning process, the agent uses the current value estimates to make decisions and uses the reward signals provided by the environment to continuously improve its value estimates.

$0: \text{Loop for each episode: }$- This is the “outer loop”. This simply indicates that the value function is usually learnt across different episodes.

- Set the agent to a starting state (in our case, this is always the same state $s_0$ in the left-bottom corner of the grid).

- This is the “inner loop”; it represents one episode (that is, until one of the terminal states 🍄 or 🥞 is reached).

The so-called

*greedy policy*in a state $s$ is to select any value-maximizing action in that state, that is, $\pi(s) = \argmax_{a \in \mathcal{A}(s)} \ Q(s, a)$. In other words, the agent selects an action that is assumed to be best, according to the current beliefs ($Q$-values).The

*$\epsilon$-greedy policy*is a stochastic policy based on the greedy policy. With probability of $(1-\epsilon)$ it selects a value-maximizing, greedy action. Yet with probability of $\epsilon$, the $\epsilon$-greedy policy selects a uniformly randomly chosen action!Note that the $\epsilon$-greedy policy contains the greedy policy (for $\epsilon = 0$) and the uniformly random policy (for $\epsilon = 1$) as special cases. Occasionally selecting non-value-maximizing actions is called*exploration*and is required for the algorithm to converge to an optimal policy, see also the FAQs further below.

- The response of the environment to the action executed by the agent.

*The learning update. This is where all the magic happens!*The agent updates the value estimate for $Q(S, A)$, where $S$ is the current state and $A$ is the action that was selected by the $\epsilon$-greedy policy. Let's start to unpack the update formula a bit:$\underbrace{Q(S, A)}_{\text{new estimate}} \leftarrow \textcolor{red}{(1 - \alpha)} \ \underbrace{Q(S, A)}_{\text{old estimate}} + \textcolor{red}{\alpha} \ \underbrace{[R + \gamma \max_{a} Q(S', a)]}_{\text{Bellman estimate}}.$The new estimate of $Q(S, A)$ is a weighted average of the agent's previous estimate of $Q(S, A)$ and the “Bellman estimate” of $Q(S, A)$ (see below). The learning rate $\textcolor{red}{\alpha}$ determines how much weight we give to either of these estimates. For example, say our learning rate is $\alpha = 0.1$, then our new value estimate consists of $1 - 0.1 = 90\%$ of the previous estimate of $Q(S, A)$ and $10\%$ of the Bellman estimate. If the learning rate is too low (in the extreme case, $\alpha = 0$), we never actually learn anything new because the new estimate always equal the old estimate. If, on the other hand, the learning rate is too high (say, $\alpha = 1$), we “throw away” everything we've learned about $Q(S, A)$ during the update.

The Bellman estimate $\hat{Q}(S, A) = [R + \gamma \max_{a} Q(S', a)]$ is based on the Bellman equations of the optimal action-value function. In fact, the

*deterministic*Bellman equations for the optimal value function, given by $q_{\ast}(s, a) = \gamma q_{\ast}(s', a_\ast)$, for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$, look almost the same as the Bellman estimator! Intuitively speaking, the Bellman estimator decomposes the expected return $\hat{Q}(S, A)$ into a) the reward that is obtained in the following time step, given by $R$, and b) the return that is expected*after*that time step (estimated by the currently best $Q$-value in the next state, $\max_{a} Q(S', a)$). One important difference between these two components is that $R$ is an actual reward that was just observed, whereas the value of the next state is an estimate itself (one, that at the beginning of the learning process is initialized randomly and thus usually not very accurate).Especially in the beginning, most of the learning progress is thus made for state-action pairs that lead to an important reward $R$ (for example, an action that leads the agent onto the 🥞-cell). Over time this knowledge is then propagated via the “bootstrapping” mechanism that, simply put, relates the value of a state $S$ to the value of the following state $S'$. You can observe this behavior in the sketch above. The first red arrow (positive value) occurs when the agent hits the 🥞-cell for the first time. After that, the red arrows (that is, the knowledge that 🥞 are close) slowly but surely propagate to the starting state.

- The “next state” becomes the “current state”. If the current state is now a terminal state, then the episode is over and the algorithm jumps back to line 1. If the episode is not over yet, the algorithm jumps back to line 3 to select the next action.

FAQ

**(Why) does the agent need to explore?** An agent that never explores might get stuck in
a sub-optimal policy. Consider the illustrative example shown in the sketch
below. You can see the agent's current Q-value estimates as shown by the red
arrows. The greedy policy with respect to this value function actually leads the
agent to the 🥞-cell, just not in an optimal way (the expected episode return of
this policy is $0$). The exploration parameter in this sketch is set to
$\epsilon = 0$ by default, so if you press the play_arrow button, the agent will always
follow the same, inefficient path.

Now increase the exploration parameter a little bit to, say, $\epsilon = 0.3$ (using the slider next to the bottom-right corner of the gridworld). You will see that, after some time, the agent finds a shorter route to the pancakes and updates its action-value estimates accordingly. Once an optimal policy is found (you can check this using the “Greedy policy” button below the sketch), you could dial down again the agent's exploration behavior to see that the agent now yields the optimal episode return of $4$.

**Why should I care? Isn't the Pancakes Gridworld way too easy?** Your friends
probably wouldn't be particularly impressed if you showed them that *you* could
“solve” the Pancakes Gridworld. So why should we be impressed by the
Q-learning agent? The difference is that the agent initially knows absolutely
nothing about its environment. Yes, you've read that correctly; the agent finds
itself in the unimaginable situation of not even knowing about the existence of
pancakes, let alone their deliciousness. Almost worse, the agent initially
doesn't even know the concept of a direction, or how the different cells of the
gridworld are connected to each other. All that the agent ever gets is the info
that there are four actions in every state and a reward signal after very step.
We, on the other hand, know that pancakes are good and your eyes help you to
navigate safely to them. The RL agent learns all this from scratch, which, in my
opinion, is a quite impressive achievement.

**What is the “Bellman error”?**

You might have noticed that many papers and books write the Q-learning update rule as

$\underbrace{Q(S, A)}_{\text{new estimate}} \leftarrow
\underbrace{Q(S, A)}_{\text{old estimate}} + \alpha \
\underbrace{[\textcolor{#7FB069}{R + \gamma \max_{a} Q(S', a)} - \textcolor{#FA7921}{Q(S, A)}]}_{\text{Bellman
error}}.$

This is of course just a re-arrangement of the formula shown and discussed in
the pseudo code above, but it offers another nice interpretation. The $\underbrace{\textcolor{#FA7921}{Q(S, A)}}_{\text{LHS}} =
\underbrace{\textcolor{#7FB069}{R + \gamma \max_{a} Q(S', a)}}_{\text{RHS}}.$

The Q-learning update is thus simply a way of reducing the Bellman error by adding a
tiny bit of it to the current estimate of $Q(S, A)$ (the amount that is added is
determined by the learning rate $\alpha$). The smaller the Bellman
error across all state-action pairs, the closer are our value-function estimates
to the optimal action-value function.If you liked this post, please consider following me on Twitter for updates on new blog posts. In the next post we compare Q-learning to the SARSA algorithm.