Reinforcement learning, line by line: Q-learning
This is the third post of the blog series Reinforcement learning: line by line. The interactive sketch shows an implementation of the tabular Q-learning algorithm (Watkins, 1989) applied to a simple game, called the Pancakes Gridworld. See this post for more information about the Pancakes Gridworld as well as the notation and foundational concepts required to understand the algorithm. In case you are completely new to reinforcement learning (RL), see here for an informal introduction.
The algorithm: tabular Q-learning
The agent's goal is to learn an optimal action value function through interaction with the environment. The Pancakes Gridworld can be modeled as a Markov decision process (MDP) with finite state and action sets and we can thus represent the value function as a set of -values, one for each state-action pair . The agents starts with a random estimate of each -value (in the example shown above, we initialize all -values to ).In the sketch shown above, the Q-values are represented by four arrows in each cell. In the beginning these arrows are gray but they become blue (for negative Q-values) or red (for positive Q-values) during the learning process. During the learning process, the agent uses the current value estimates to make decisions and uses the reward signals provided by the environment to continuously improve its value estimates.
The so-called greedy policy in a state is to select any value-maximizing action in that state, that is, . In other words, the agent selects an action that is assumed to be best, according to the current beliefs (-values).
The -greedy policy is a stochastic policy based on the greedy policy. With probability of it selects a value-maximizing, greedy action. Yet with probability of , the -greedy policy selects a uniformly randomly chosen action!Note that the -greedy policy contains the greedy policy (for ) and the uniformly random policy (for ) as special cases. Occasionally selecting non-value-maximizing actions is called exploration and is required for the algorithm to converge to an optimal policy, see also the FAQs further below.
The learning update. This is where all the magic happens! The agent updates the value estimate for , where is the current state and is the action that was selected by the -greedy policy. Let's start to unpack the update formula a bit:
The new estimate of is a weighted average of the agent's previous estimate of and the “Bellman estimate” of (see below). The learning rate determines how much weight we give to either of these estimates. For example, say our learning rate is , then our new value estimate consists of of the previous estimate of and of the Bellman estimate. If the learning rate is too low (in the extreme case, ), we never actually learn anything new because the new estimate always equal the old estimate. If, on the other hand, the learning rate is too high (say, ), we “throw away” everything we've learned about during the update.
The Bellman estimate is based on the Bellman equations of the optimal action-value function. In fact, the deterministic Bellman equations for the optimal value function, given by , for all and , look almost the same as the Bellman estimator! Intuitively speaking, the Bellman estimator decomposes the expected return into a) the reward that is obtained in the following time step, given by , and b) the return that is expected after that time step (estimated by the currently best -value in the next state, ). One important difference between these two components is that is an actual reward that was just observed, whereas the value of the next state is an estimate itself (one, that at the beginning of the learning process is initialized randomly and thus usually not very accurate).
Especially in the beginning, most of the learning progress is thus made for state-action pairs that lead to an important reward (for example, an action that leads the agent onto the 🥞-cell). Over time this knowledge is then propagated via the “bootstrapping” mechanism that, simply put, relates the value of a state to the value of the following state . You can observe this behavior in the sketch above. The first red arrow (positive value) occurs when the agent hits the 🥞-cell for the first time. After that, the red arrows (that is, the knowledge that 🥞 are close) slowly but surely propagate to the starting state.
(Why) does the agent need to explore? An agent that never explores might get stuck in a sub-optimal policy. Consider the illustrative example shown in the sketch below. You can see the agent's current Q-value estimates as shown by the red arrows. The greedy policy with respect to this value function actually leads the agent to the 🥞-cell, just not in an optimal way (the expected episode return of this policy is ). The exploration parameter in this sketch is set to by default, so if you press the play_arrow button, the agent will always follow the same, inefficient path.
Now increase the exploration parameter a little bit to, say, (using the slider next to the bottom-right corner of the gridworld). You will see that, after some time, the agent finds a shorter route to the pancakes and updates its action-value estimates accordingly. Once an optimal policy is found (you can check this using the “Greedy policy” button below the sketch), you could dial down again the agent's exploration behavior to see that the agent now yields the optimal episode return of .
Why should I care? Isn't the Pancakes Gridworld way too easy? Your friends probably wouldn't be particularly impressed if you showed them that you could “solve” the Pancakes Gridworld. So why should we be impressed by the Q-learning agent? The difference is that the agent initially knows absolutely nothing about its environment. Yes, you've read that correctly; the agent finds itself in the unimaginable situation of not even knowing about the existence of pancakes, let alone their deliciousness. Almost worse, the agent initially doesn't even know the concept of a direction, or how the different cells of the gridworld are connected to each other. All that the agent ever gets is the info that there are four actions in every state and a reward signal after very step. We, on the other hand, know that pancakes are good and your eyes help you to navigate safely to them. The RL agent learns all this from scratch, which, in my opinion, is a quite impressive achievement.
What is the “Bellman error”?
You might have noticed that many papers and books write the Q-learning update rule as