Jan Malte Lichtenberg

JAN MALTE LICHTENBERG · SEPTEMBER 11, 2021

Reinforcement learning, line by line: An introduction

The goal of this upcoming series of blog posts is to facilitate the understanding of reinforcement learning algorithms by visualizing how pseudo code translates to learning. For each algorithm presented, you will be able to advance through the pseudo code line by line, at your own speed, and observe how each line affects the agent's beliefs and behavior.

This post provides a brief and informal introduction to reinforcement learning (RL) itself. The next post introduces a toy environment in which it is easy to visualize the learning progress of an RL algorithm, and which we therefore will use as a running example throughout this series. It also introduces the mathematical notation and some basic RL concepts that are needed to understand the pseudo code of RL algorithms. If you are already familiar with the general idea behind RL, you might just skip ahead and read about the scope of this blog series or directly dive into the post about Q-learning ( $\leftarrow$ click here if you just want to see RL in action 🤖).

What is RL? And why could it be useful?

Let's start with an example. Imagine you want to create an artificially intelligent system for a self-driving car. We call this system the agent. The goal is to design an agent that can drive the car from its current position to any given goal destination, without endangering anyone in the streets. How can we apply RL to this problem, and how does it differ from other approaches?

A first idea for designing a self-driving agent could be to come up once with a plan that covers the entire car trip and has instructions on what to do at any point of time,Think, for example, of the output given by any route planner, but in much more detail. and to follow through on that plan. While plans can be very useful for high-level tasks such as route planning, they cannot adapt to any unpredictable situations that might arise during the trip such as red traffic lights, ghost drivers, or ball-playing kids on the street. Remember, we don't want to cause any accidents.

sequential decision making. To allow the agent to be more reactive, we could instead model the agent's car trip as a sequence of individual decisions. In every time step (for example, every second), the agent has to decide whether it wants to change the car's speed by a certain amount, or turn the steering wheel by a certain angle, or just leave things as they are. We say that the agent has to decide between different actions. We allow the agent to base each decision on a set of sensory inputs that are observed just before that decision is made. Such sensory inputs could include, for example, the car's current speed, the current angle of the steering wheel, or an image of the current view through the windshield. This input to the decision-making process is often called the state of the environment (or an observation thereof).

⊕

Sequential decision making: The agent (🤖) observes the state of the environment (🌍) and selects an action. The action usually changes the state of the environment. The agent then observes this new state and selects a new action.

The agent's decision-making system is called policy. In the simplest case, a policy is simply a mapping that associates each states to one action that is available in that state. That is, the policy takes a description of state as input and outputs the action, which is then executed by the agent.

Note that here we only really care about which action the agent selects, not how that action is then actually implemented. A helpful (albeit not perfect) analogy is to think of this as modelling a robot's mind (or for the sake of our example, a car's mind) rather than its mechanics.Reinforcement learning applications differ a lot in the level of abstraction of the actions given to the agent. In our example, we're interested in knowing by how much we should accelerate or break in a specific situation, not how much pressure a robot arm has to apply on which pedal in the car in order to execute that action. This, however, is a design choice rather than a hard constraint on the problem formulation.

Compared to a static planning approach, the sequential decision-making approach allows the agent to react to changing circumstances. For example, if a kid suddenly runs onto the street, this will change the state of the environment, and a good policy would react to this new state by stopping the car.

Within this framework, our goal is now to design a good policy—or even better, a system that allows the agent to learn a good policy on its own! In what follows, I'll present three approaches to designing such a system.

expert systems. A first, naïve approach could be to come up with a large set of hand-crafted if/then-rules that associate each state with a good action. For example, assume that $(x_1, y_1)$ and $(x_2, y_2)$ are two sets of longitude/latitude coordinates that represent two very close-by locations on a street. Then, one such hand-crafted if-/then rule could be

“If the car's current longitude/latitude coordinates are $(x_1, y_1)$ , and there's a kid playing with a ball at long/lat coordinates $(x_2, y_2)$ , then slow down immediately to 10 km/h”.

This is an obviously ridiculous and virtually impossible approach to designing a self-driving car because the car could face an infinite number of different situations (and therefore, states of the environment). For the self-driving car to be driving safely, someone would have to come up with a specific rule for every imaginable state.

learning through generalization. To cope with the large number of states the car has to deal with, it would be helpful to have a mechanism that learns new rules automatically. There is certainly hope that this is indeed possible because similar situations (states) often require similar actions. Consider the following simplified example. If you had access to, say, 20 if/then-rules similar to the one shown above (that is, they would each involve a kid playing close to the car but each rules is for different sets of long/lat coordinates) and you would show them to a friend of yours, they would probably come up with a new rule similar to, say,

“If there's a ball-playing kid within eyesight of the driver, then slow down immediately to 10 km/h”.

Compared to the hand-crafted rule from above, this new rule has at least two advantages: a) it reduces the total number of rules we have to keep in memory from 20 to 1 without losing important information, and b) it is also applicable in locations that weren't even considered in the original set of rules—we say the rule generalizes to previously unseen examples. We can make these kinds of generalizations from existing rules, because the world we live in shows certain regularities, and our minds know how to exploit them.

In machine learning, generalization from examples is called supervised learning. Supervised learning algorithms are conceptually quite similar to the rule-generalization example we just went through.

Applied to our car example, a supervised learning algorithm would first define a parametrized mathematical function (also simply called the model) that maps states to actions. The algorithm is then given a set of correctWhat constitutes a correct action for a given state is a difficult question on its own. For the time being, and in the context of our example, you can think of a correct action as one that, at the very least, doesn't immediately lead to an accident. state-action pairs (often called the training data), which is used as follows. The algorithm tweaks the model parameters such that if the model is applied to the states in the training data, the model outputs (called predictions) are as close as possible to the respective correct actions that are given in the training data.

A more detailed introduction to supervised learning is beyond the scope of this post but there are many great introductions available online.One of my favorites for a general intro is R2D3. For neural networks in particular, I recommend to start playing around with the neural network playground. The important thing to know here is that a) supervised learning is great because it allows us to generalize beyond our existing hand-crafted rules, which in turn increases the number of situations the agent can deal with effectively, but b) the supervised learning model will probably only make good inferences in situations that are somewhat closeDon't expect the system to learn how to drive during a rainy night on a motorway, if you only feed it with data about kids playing ball on a sunny day. to situations that the model was trained on.

So whilst a system based on supervised learning is certainly more likely to succeed than a completely hand-crafted system, we still require a representative data set of hand-crafted rules for the training procedure. In practice, this would require that some poor soul would have to generate (probably millions of) representative driving situations and would have to decide, in every single case, what the “correct action” would be.

How can we design an agent that can learn a good policy without requiring explicit supervision from an expert?

learning from experience. As is done so often in AI, we can look for inspiration in how people and animals are learning. And while people certainly learn a lot of things from experts (e.g., parents and teachers) through demonstration and subsequent generalization (which corresponds to the two learning approaches discussed so far), psychologists and biologists have put forward various theories about how we and other animals also learn from our own experience, which we get from our everyday interaction with our environment.

Usually learning from experience involves some form of trial-and-error learning or learning through reinforcement (also called operant conditioning in animal learning). Simply put, we repeatedly try out different actions in similar states and we learn to select those actions that bring us feelings of joy, success, recognition and other positively associated feelings—while we learn to avoid actions that lead to sadness, failure, errors, and depreciation.

However, if we try to apply these rather simple ideas from animal learning to our self-driving car example, we immediately run into a problem: as far as I am aware, cars do not have any built-in mechanisms to experience joy or sadness—so how could the self-driving car agent learn which actions to repeat, and which ones to avoid in the future? In reinforcement learning (RL) the required feedback is usually provided externally, in the form of a numerical signal called reward. In every time step the agent receives one such reward, as shown in the following figure.

⊕

Agent-environment interaction in RL: The agent (🤖) observes the state of the environment (🌍) and selects an action. The action potentially changes the state of the environment. The agent obtains a reward, observes the new state of the environment, and selects a new action based on this newly observed state, and so on.

Given such a system,This system can be formalized as a so-called Markov Decision Process, as shown further below. the reinforcement learning objective (that is, the agent's learning goal) is to learn how to select actions so as to maximize the cumulative reward received over the agent's lifetime.There are many different ways of defining an RL objective in a much more precise manner. The important bit here is that there is always some notion of maximizing long-term cumulative reward rather than just the immediate reward obtained in, say, the current or next time step. Accordingly, we design the reward function such that high rewards are given for behavior that leads to “good” states and lower (or negative) rewards are given for states that lead to “bad” states. Applied to our car example, we could define the following simple reward function:

Give a small positive reward of $1$ in every time step in which the car gets closer to the goal destination, a large negative reward of $-1000$ for every action that leads to a car crash, and a neutral reward of $0$ for whenever the car doesn't come closer to the goal destination and isn't involved in a an accident.

If the learning process is successful (that is, the agents learns how to act so as to maximize cumulative reward) then we know that our initial goal of driving safely from A to B will indeed be met, because this is how we designed the rewards in the first place.

But hey, you might ask, did we not move on from expert systems and supervised learning because we were looking for a way of learning that does not require us to provide supervision from an expert? Isn't defining a reward function the same thing as labeling a data set for supervised learning?

When defining a reward function, we do indeed provide information to the agent, and thus the agents is clearly not learning in a fully autonomous way. But note how much easier it was to define this one reward function (which covers all possible situations the car could face), compared to designing individual rules for all the different types of situation the car might face. It is, in fact, quite often the case that it is easier to quantify the usefulness of an agent's behavior than the correctness of its actions.Another example is the problem of designing a chess-playing agent. It is much easier to define a reward function that gives a positive reward for a winning move, a negative reward for a losing move, and zero reward for any other move—compared to providing the optimal move for millions of possible states (that is, board configurations).

Reinforcement learning, of course, comes with its own set of difficulties. One of these is particularly relevant for our self-driving car example. The agent initially exhibits essentially random behavior. In order to learn how to distinguish good from bad behavior, the agent first needs to experience the consequences of some bad decisions. This learning process could get quite expensive, even dangerous, if the self-driving car has to make accidents before learning how to avoid them (remember that the reward is observed only after an action has been executed).There are many approaches trying to tackle this problem including the pre-training of an agent in a simulated environment or learning from demonstration, see also offline reinforcement learning.

learning from experience & generalization. Reinforcement learning and supervised learning are not mutually exclusive approaches. In fact, they are often combined! One way of thinking about such a combined approach is to see the reinforcement learning part as a way for the agent to autonomously create a training data set, which is then used by the supervised learning part of the algorithm to generalize the learned behavior from states that were visited by the agent to previously unseen states.

To summarize, a reinforcement learning agent learns how to act from the experience gathered while interacting with its environment. This general idea is rather easy to understand, yet RL remains a challenging problem and the specifics are subject to a lot of past and present research. Apart from more practical issues such as how to obtain enough real-world experience in a safe way, there are some more basic algorithmic questions, including How can the agent maximize future cumulative reward if it only observes the reward given in the current time step? And how should the agent estimate the importance of previous decisions made in the process of getting to the current state (and receiving the current reward)?

There are some major algorithmic ideas that have found at least partial answers to these questions, and the goal of this blog series is to provide an intuitive introduction to some of these basic RL algorithms. Yet because this would be somewhat difficult to do in the context of our self-driving car example, the next post introduces a very simple sequential decision-making environment that will allow us to follow the agent's learning process step by step. Pancakes, anyone?

Who is this blog series for?

This blog series is not a standalone tutorial on RL. I started developing the material presented in this blog when I was tutoring for Özgür Şimşek's RL unit at the University of Bath. That unit was partially based on Sutton & Barto's excellent book RL: an introduction. The interactive examples shown in this blog are based on pseudo code presented in Sutton & Barto's book. That book might therefore be the best reference for you in case you want to get more background knowledge about the various RL algorithms.

If you liked this post, please consider following me on Twitter for updates on new blog posts.