Q-learning: Line by Line

Pancakes, anyone?

The above sketch shows an implementation of the tabular Q-learning algorithm, applied to a simple game, called the Pancakes Gridworld.See this post to learn more about the Pancakes Gridworld environment. We would like the agent to find the delicious pancakes as quickly as possible, and to avoid the toxic mushrooms along the way. The big problem is that the agents starts off knowing absolutely nothing about its environment. Yes, you've read that correctly; the agent finds itself in the unimaginable situation of not even knowing about the existance of pancakes, let alone their deliciousness. Almost worse, the agent initially doesn't even know the concept of a direction, or how the different cells of the gridworld are connected to each other. It is thus difficult to build a rule-based system along the lines of if you see a mushroom, don't go there; if you see pancakes, go thereif the agent doesn't even know what pancakes are!

This leads to the following question: how can we design a system that allows the agent to learn autonomously to do the right thing (that is, eating pancakes, as often as possible)?

Running into danger that my beloved pancakes soon will be snatched by pancakes-loving robots, this post aims to explain a popular reinforcement learning algorithm (RL), called Q-learning. Before doing that, I'll provide a very brief introduction to the general ideas behind RL itself, in the context of our example.

Reinforcement learning.

Reinforcement learning (RL) is learning from experience that is gathered through repeated interaction with the environment. Further above, we said that the agent doesn't really have any knowledge about its environment whatsoever, yet here we say the agent can interact with its environmentso what does the agent actually know, and what is the agent able to do with that information?

On a very high level, one step of this continuous interaction looks as follows.

  1. The agent observes the current state of the environment, which in our case, is the cell that the agent currently occupies.

  2. The agent selects an action among all available actions, based only on the information gathered so far.

  3. Actions. In every step, the agent chooses one of four actions available in our gridworld (let's call them north, east, south, and west) to move to an adjacent cell in the grid.

  4. After every such move the agent receives a numerical feedback signal, called reward, which depends on the desirability of the cell that is entered.

In our example, we desire pancakes and we dislike toxic mushrooms, so we give a high positive reward to the former and a high negative reward to the latter.In the sketch shown above, rewards are indicated by the colored numbers in the bottom-right corner of each cell. For example, every mushroom cell yields a negative reward of 10-10 while the delicious pancakes yield a reward of 1010. Moving around the grid requires a small amount of energy and thus we also give a small negative reward for every other cell that is entered.

Given such an interactive feedback-system, we are now in a place to proivde the agent with a rather simple instruction:

Learn how to choose actions so as to maximize long-term reward.

If this learning process is successful, the agent will always use the shortest path to the pancakes without touching the mushroomsbecause this is how we designed the algorithm in the first place.

Of course, just because the instruction is simple to understand, it doesn't mean that the learning process itself is trivial. Note that we

Up until now we have basically redefined the simple concept of "pancakes \rightarrow yummy" into a complex system of states, actions, and rewards. What's the point? you might ask. Imagine for a brief moment that our goal is not a robot that learns to love pancakes but a self-driving car that should learn to drive (safely). We could define a similar system that rewards the agent (which is now a car) for every, say, second thay it drives safely, while giving negative rewards whenever the car crashes into a wall. Note how much easier it is to define this reward function than to specify, for each possible situation the car could encounter, what the correct action would be.

long-term reward such as, for example, the sum of all rewards obtained during one episode. An episode ends when a so-called terminal state is entered. In our example, all mushroom cells and the pancake cells are terminal states. The how

We say that the agent is trying to learn an optimal policy. A (behavioral) policy is simply a mapping from so-called statesThe current state of our gridworld is simply given by the grid cell that is currently occupied by the agent. to actions. In other words, the policy tells the agent what to do in which cell of the grid.

Q-learning is one among many RL algorithms to learn an optimal policy

More formally, an agent tries to learn an optimal (behavioral) policy \pi^\{ast}, that

Reinforcement learning (RL) algorithm try to learn an optimal (behavioral) policy. A policy in RL is a mapping from states to actions. In our gridworld, the current state is simply given by the cell the agent currently occupies. The policy therefore tells the agent in which direction to go next.

Q-learning is a popular reinforcement learning algorithm that can be used to learn an optimal policy in such an environment. A policy is a mapping from states to actions


Instead The In each step, the agent can move from grid cell to grid cell and can learn from a feedback signal, called reward, that is obtained after each move in the Gridworld.

The sketch allows you to advance through the pseudo codeline by lineand observe how that line affects the agent's learning process. In what follows, I will provide a very brief introduction to what Q-learning is aboutFor more background knowledge and a more thorough introduction to Q-learning, please see Section 6.5 of Sutton & Barto's book Reinforcement Learning: an introduction. The pseudo code used in the sketch is taken from the same section. and then provide more explanation about every single line of the algorithm's pseudo code.

This post is the first of a series of posts called Reinforcement learning: Line by Line.

© 2021