JAN MALTE LICHTENBERG   ·   FEBRUARY 1, 2021

Reinforcement Learning: Line by Line

In this upcoming series of blog posts, I aim to facilitate the understanding of various reinforcement learning (RL) algorithms by visualizing how their pseudo code translates to learning. For each algorithm considered, you will be able to advance through the pseudo codeline by line, at your own speedand observe how each line affects learning.

This post defines the scope of this series and provides a very brief introduction to RL itself. It also introduces a toy environment in which it is easy to visualize the learning progress of an RL algorithm. If you are already a bit familiar with RL, you might skip ahead and directly dive into the first post about Q-learning.

#### What is RL? And why could it be useful?

Let's start with an example. Imagine you want to create an artificially intelligent system for a self-driving car. In other words, the goal is to create a system that can drive the car from one place to another without endangering anyone in the streets. We call this system the agent.

sequential decision making. We can model the agent's car ride as a sequence of individual decisions. That is, every second or so, the agent has to decide whether it wants to change the car's speed by a certain amount, or turn the steering wheel by a certain angle, or any combination of the two. We say that the agent has to decide between different actions. We allow the agent to base its decision on a set of sensory inputs that are observed just before every decision. Such sensory inputs could include, for example, the car's current speed, the current angle of the steering wheel, or an image of the current view through the windshield. This input to the decision-making process is often called the state of the environment (or an observation thereof). To summarize: in every time step the agent (🤖) makes a decision of the following form.

Note that here we are only concerned with the software-side of things. In other words, we only care about which action to choosenot how that action is actually implemented.For example, we're interested in whether we should accelarate or break in a specific situation, but not how much pressure a robot arm has to apply on which pedal in the car in order to execute that action. You can think of it as modelling a robot's mind (or for the sake of our example, a car's mind, if that's a thing) rather than its mechanics.

We've now broken down the overarching problem (drive a car from A to B without endangering anyone) into many small but slightly easier problems. The question now becomes: How can we design an agent that in every time step selects an action such that the overarching goal remains attainable?

In what follows, I'll present three approaches to designing such an agent.

expert systems. A first, naïve approach could be to come up with a large set of hand-crafted if/then-rules that associate each state with a good action. For example, assume that $(x_1, y_1)$ and $(x_2, y_2)$ are two sets of longitude/latitude coordinates that represent two very close-by locations on a specific street. Then one such hand-crafted if-/then rule could be

if the car currently has longitude/latitude coordinates $(x_1, y_1)$, and there's a kid playing with a ball at long/lat coordinates $(x_2, y_2)$, then slow down immediately to 10 km/h.

This is an obviously ridiculous and virtually impossible approach to designing a self-driving car because the car could face an infinite number of different situations (and therefore, states). For the self-driving car to be driving savely, someone would have to come up with a specific rule for every imaginable situation.

learning through generalization. To cope with the large number of states the car has to deal with, it would be helpful to have a mechanism that learns new rules automatically. There is certainly hope that this is indeed possible because similar situations (states) often require similar actions. Consider the following simplified example. If you were to present 20 if/then-rules similar to the one shown above (that is, they would each involve a kid playing close to the car but each rules is for different sets of long/lat coordinates) to a friend of yours who has never driven a car, then they would probably come up with a new rule similar to, say,

if there's a ball-playing kid within eyesight of the car, then slow down immediately to 10 km/h.

Compared to the hand-crafted rule from above, this new rule has at least two advantages: (a) it reduces the total number of rules we have to keep in memory from 20 to 1, and (b) it is also applicable in locations that weren't even considered in the original set of ruleswe say the rule generalizes to previously unseen examples. We can make these kinds of generalizations from existing rules, because the world we live in shows certain regularities, and our minds know how to exploit them.

In machine learning, generalization from examples is called supervised learning. On a very high level, supervised learning algorithms are very similar to the example we just went through. Applied to our car example, a supervised learning algorithm would first define a parametrized mathematical function (also simply called the model) that maps states to actions. It would then take a set of correctWhat constitues a correct action for a given state is a difficult question on its own. For the time being, and in the context of our example, you can think of a correct action as one that, at the very least, doesn't immediately lead to an accident. state-action pairs (often called training data) as input and choose the model parameters such that if the model is applied to the states in the training data, the outputs (called predictions) would be close to actions in the training data. A full introduction to supervised learning is beyond the scope of this post but there are many great introductions available online.One of my favourites is R2D3.

The important thing to know here is that

1. (1) supervised learning is great because it allows us to generalize beyond our existing hand-crafted rules, which in turn increases the number of situations the agent can deal with effectively, but
2. (2) the supervised learning model will probably only make good inferences in situations that are somewhat closeDon't expect the system to learn how to drive on a motorway during the night while it's raining if you only feed it with data about kids playing ball on a sunny day. to situations that the model was trained on.

So whilst a system based on supervised learning is certainly more likely to succeed than a completely hand-crafted system, we still require a representative data set of hand-crafted rules for the training procedure. In practice, this would require that some poor soul would have to generate (probably) millions of representative driving situations and would have to decide, for every single case, what the correct action would be!

learning from experience. How can we design an agent that can learn useful behavior without requiring explicit supervision from an expert? As is done so often in AI, we can look for inspiration in how people and animals are learning. And while people certainly learn a lot of things from experts through demonstration and subsequent generalization (which corresponds to the two learning approaches discussed so far), we certainly also learn many things from our own experience, by trial and error, and thus without the help of anybody else.

For learning from experience to be possible, we usually require two things: (a) the learner can repeatedly try out different actions in the same (or at least similar) state, and (b) the learner occasionally receives some form of feedback on the goodness of the actions. This feedback could be external

Learning from experience can sometimes be much easier than other learning approaches. Consider, for example, a toddler learning that lemons are sour. A simple bite into a lemon slice is biting into a lemon slice is For example, a toddler

There is one From a young age on, we've learned learn by example, from all sorts of friends, family, teachers, and colleagues, who are experts in all sorts of domains. Yet there are many things that we also learn by ourself, through the experiences that we've made. Sometimes this type of learning is much easier than learning from experts. Consider, for example, you want to teach a toddler that a lemons are sour. It is much to let

Reinforcement learning is a way of learning useful behavior that does not require explicit supervision in the form of labeled instances. Instead, a reinforcement learning agent learns from the experience that it obtains while interacting with the environment. More precisely, we assume that the

Simply put, we allow the agent to learn through trial-and-error. This feedback is given in the form of a numerical signal, called reward.

Feedback on the usefulness of an action rather than its correctness.

Note that reinforcement learning and supervised learning are not mutually exclusive approaches. In fact, they are often combined! One way of thinking of such a combined approach is to see the reinforcement learning part as a way to autonomously create a training data set, which then is used by the supervised learning part to learn a policy that generalizes to previously unseen situations. That is, instead of being provided by some external labeler, the training data is created by the agent itself, building on the information contained in the reward signal.

#### Who is this for?

This blog series is not a standalone tutorial on RL. In particular, I use as little maths as possible and I will (at least, for the time being) not release any actual code examples in an actual programming language. I do this not because I think that maths and code are useless when learning RL (quite the opposite), but because there is already a lot of mathsy material out there, and even more so ready-to-use code repositories that provide Python or JavaScript implementations of all the basic algorithms.

I hope that this blog series can be complementary to all the other existing materials and thus help you to progress faster with your coding efforts as well with understanding the maths behind RL.

The pseudo codes of many of the algorithms considered here are taken from Sutton & Barto's RL: an introduction. I hope that my blog can complement this great book by giving the reader a tool to directly interact with the pseudo code of various algorithms. As I will also use that book's mathematical notation throughout this blog series, the book might be the best reference for you in case you want to get more background knowledge about the various algorithms.

So if you just started learning RL, or if you know RL but want to deepen your understanding, or even if you know everything about RL but you just like to #feelthelearn, I very much hope that you enjoy reading my blog. Please feel free to write me an email if you have any comments or questions!