JAN MALTE LICHTENBERG   ·   SEPTEMBER 12, 2021

On pancakes, Markov decision processes (MDPs), and value functions

This is the second post of the blog series Reinforcement learning: line by line.

The previous post introduced the reinforcement learning (RL) problem in a non-mathematical way, using self-driving cars as an example application. This post introduces a much simpler sequential decision-making problem, called the Pancakes Gridworld, which allows us to focus on algorithmic details rather than on the many technical details present in real-world applications.

Ideally, once we've found and understood an RL algorithm that solves problems in small toy domains, we would like to go back to the larger, more interesting domain and apply the same learning algorithm, generalizing our newly found problem-solving capabilities. To facilitate this process, it is useful if all domains and algorithms speak the same language. In particular, we would like the notions of state, action, and rewardGo back to this section for an intuitive explanation of these terms. to represent somewhat similar concepts across the various domains we consider.

In reinforcement learning, Markov decision processes, or MDPs, are by far the most commonly used mathematical framework to model sequential decision-making environments. Many RL algorithms are defined for arbitrary MDPs. Therefore, if we find an algorithm that can solve the simple MDP corresponding to the Pancakes Gridworld, there is hope that the same algorithm (or a scaled-up version thereof) can also solve more complex MDPs that correspond to more interesting problems.

In the following sections, we will introduce the Pancakes Gridworld and formulate it as an MDP.I closely follow Sutton & Barto's treatment and notation of MDPs (Chapter 3 in RL: an introduction) Using this MDP formulation we can then formulate the RL objective introduced in the previous post as a mathematically precise optimization problem. Finally, this post briefly introduces the concept of a value function, which is one of the main building blocks for the Q-learning algorithm discussed in the next post.

#### The Pancakes Gridworld

This is the Pancakes Gridworld, an extremely boring game. In case you actually played it, you probably immediately knew what to do in order to win: move the agent towards the pancakes as quickly as possible and avoid the fly agaric mushrooms along the waythey are toxic!

Despite its dullness, the Pancakes Gridworld satisfies two requirements that make it a great example environment for the purposes of this blog series.Gridworlds are used in many introductory materials on RL, including Sutton & Barto's book Reinforcement learning: an introduction and Andrej Karpathy's reinforce.js, which was a major inspiration for this blog.

• First, it can be easily formulated as an MDP (see next section).

• Second, the Pancakes Gridworld is small and simple enough so that we can visualize the agent's current policy on all states at the same time. Such a global perspective on the agent's currently learned behavior can sometimes reveal patterns that a local perspective (which we are usually forced to take when dealing with larger MDPsGoing back to our previous example of self-driving cars, it is not practicable to inspect the driving agent's behavior for every possible state at any given time step. There are simply way too many states.) could not.

One note: please do not get lost in any mathematical details if you read about MDPs for the first time. Instead, try to get a high-level idea of how state, action, and reward interact with each other.

#### MDP formulation of the Pancakes Gridworld

The following sketch shows again the Pancakes Gridworld, but this time with additional information that corresponds to an episodic MDP formulation that is further explained below. You can again play the game, but this time, observe how the values of various MDP variables change as you move around the agent.

notation. An MDP is usually denoted by a tuple

$\left\langle \ \mathcal{S}, \mathcal{A}, s'(s, a), r(s, a), \gamma, \rho_0 \ \right\rangle,$

where $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions, $s'(s, a)$ is the state-transition function that specifies the new state $s'$ when taking action $a$ in state $s$, and $r(s, a)$ is the reward function, which provides the reward $r$ the agent receives when taking action $a$ in state $s$.To make things easier in the beginning, we assume that the state-transition function and the reward function are deterministic. In many real-world examples both these functions are stochastic, that is, they are defined as probability distributions over states or rewards. The parameter $\gamma$ is a temporal discount factor and $\rho_0$ is a starting state distribution.

a pancakes mdp. Next I explain how each MDP component is defined in the Pancakes Gridworld sketch shown above.

• States. The current state of the environment is simply the cell currently inhibited by the agent. We name the states (cells) from $s_0$ to $s_{24}$ going from the bottom-left to the upper-right cell.In principle, we could use any other naming convention but this row-wise naming makes orientation easier, when we later talk about specific states. Under this specification, states $s_8$, $s_{10}$ (🍄) and $s_{23}$ (🥞) are so-called terminal states (because the current episode terminates when the agent hits any of these states). All other states are states are non-terminal states.

• Actions. The set of actions $\mathcal{A}$ is identical in each state and consists of the four actions $(a_n, a_e, a_s, a_w)$ = $(\text{north, east, south, west})$. Again, the naming chosen here is arbitrary and any other naming would do the job as well as long as we consistently use the same names throughout the learning process.

• Transition function. The gridworld described here is deterministic in the sense that the agent always moves in the intended direction. For example, if the agent is in state $s_2$ and takes action $\text{north}$ ($a_n$), the agent will transition to state $s_7$ with absolute certainty. We can write $s'(s_2, a_n)=s_7$.

• Reward. The reward function takes a particularly simple form in the Pancakes world because it doesn't matter from which side a cell is entered. We can thus simply define the reward function as a function of next state, that is, $r(s, a) = r(s'(s, a))$. For example, in the sketch above, $r(s_{10}) = r(🍄) = -10$, $r(s_{23}) = r(🥞) = 10$, and $r(s_{5}) = -1$.

• Temporal discounting. The temporal discount factor $0 \leq \gamma \leq 1$ is a parameter that specifies the relative value of receiving a certain amount of reward in the future vs. receiving the same amount now. Temporal discounting is a well-studied concept in economics and you can read about it, for example, in this wiki article. In the case of the Pancakes Gridworld temporal discounting is actually not very important and so we could simply set $\gamma = 1$ (that is, no temporal discounting). However, in domains with a long (or infinite) time horizon, temporal discounting (that is, $\gamma < 1$) is often indispensable.

• Starting state. Here the agent simply always starts in state $s_0$ but in principle $\rho_0$ could be a distribution of states (for example, the starting state is sampled uniformly randomly from the bottom row of states at the beginning of each episode).

#### Reinforcement learning in an MDP

Using the MDP formulation just introduced, we can describe more formally several concepts introduced in the previous post: the agent's policy, the agent-environment interaction, and the RL objective.

policies. The agent selects actions according to a (behavioral) policy $\pi: \ \mathcal{S} \rightarrow \mathcal{A}$, which is a mapping from states to actions. That is, for every state $s \in \mathcal{S}$, the policy produces an action $a = \pi(s)$, which is then executed by the agent.

In the Pancakes Gridworld, a deterministic policy can be represented by a dictionary that simply assigns to each of the 25 states any of the four available actions. A stochastic policy instead would assign a probability distribution over the four available actions to each state. We will see stochastic policies in future posts but for now we only consider deterministic policies.

agent-environment interaction. Once the agent executes the action that was determined by the policy, the environment responds with a next state and a reward signal (as determined by the transition function $s'(s, a)$ and the reward function $r(s, a)$, respectively). The agent then observes the new state and selects the next action, and so on. In the previous post we used a very similar sequential decision-making protocol in the context of a self-driving car example. We can now update the agent-environment interaction graph using the MDP framework.

More specifically, we formalize the agent-environment interaction as a sequence of random variables

$S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots, R_t, S_t, A_t, \dots,$

where $S_t$, $A_t$, and $R_t$ denote, respectively, state, action, and reward at decision stage (or time step) $t$. The MDP variables of time step $t+1$ depend on variables $S_t$, $A_t$, and $R_t$, as shown in the following graph.

Agent-environment interaction in an MDP: The agent (🤖) observes the current state $S_t$ of the environment (🌍) and selects an action $A_t$. according to the current policy $\pi$. The action is executed, and the environment responds with a new state $S_{t+1}$. The agent also observes a reward $R_t$, which can be used to improve the current policy.

For example, in the Pancakes Gridworld, a (very short) trajectory is produced by the policy that selects the action $\text{north}$ for every state, that is, $\pi(s) = a_n$ for every state $s$. An agent following this (quite naïve) policy creates the following trajectory:

• $S_0 = s_0$ (the starting state),
• $A_0=a_n$ (go north),
• $R_1 = -1$ (reward for going onto an empty cell),
• $S_1=s_5$ (the new state),
• $A_1=a_n$ (go north again),
• $R_2 = -10$ (toxic 🍄!), and
• $S_2=s_{10}$.

This terminates the episode because $s_{10}$ is a terminal state. You can imitate this trajectory very easily in the interactive sketch shown above by pressing, you guessed it, the up-button twice.

the rl objective (revisited). In the previous post we stated the RL objective as learn how to select actions so as to maximize long-term reward. Given an MDP, we can formulate this objective more precisely. For example, by interpreting long-term reward as the expected cumulative, discounted sum across all future rewards.There are other notions of long-term reward and thus other RL objectives (see, e.g., Ch. 10.3 in S&B 2018). In this blog series we only use the discounted return formulation described here. More formally, the RL objective is to find a policy $\pi_\ast$ that satisfies

$\tag{1} \pi_{\ast} \in \textcolor{#0892A5}{\argmax_{\pi \in \Pi}} \ \textcolor{#7FB069}{\mathbb{E}_{A_t \sim \pi(S_t) \text{ for } t \geq 0} [}\textcolor{#FA7921}{\sum_{t=0}^\infty \gamma^t \ R_t}\textcolor{#7FB069}{]}.$
Any such policy $\pi_\ast$ is called an optimal policy. Let's have a look at the individual bits of Equation 1:

• $\textcolor{#FA7921}{\sum_{t=0}^\infty \gamma^t \ R_t}$ is the cumulative (hence the summation term), discounted (hence the $\gamma^t$ term) sum of rewards obtained during an episode and is often simply called return.

• $\textcolor{#7FB069}{\mathbb{E}_{A_t \sim \pi(S_t) \text{ for } t \geq 0} [ \cdot]}$, means that we are interested in the expected return, where the expectation is taken over trajectories produced by an agent following the policy $\pi$.The $\textcolor{#7FB069}{\sim}$ in $\textcolor{#7FB069}{A_t \sim \pi(S_t)}$ means that the action $A_t$ is sampled from the policy $\pi(S_t)$. This notation includes the case of a stochastic policy $\pi$. In the case of a deterministic policy and a deterministic environment (this is the case in our always-go-north example discussed above), the agent always follows the same trajectory and the expected return thus simply equals the return of this one trajectory. In the more realistic case that the policy, the environment, or both are stochastic, the expectation is usually much more difficult to compute. Note that the expected return $\textcolor{#7FB069}{\mathbb{E}_{A_t \sim \pi(S_t) \text{ for } t \geq 0} [}\textcolor{#FA7921}{\sum_{t=0}^\infty \gamma^t \ R_t}\textcolor{#7FB069}{]}$ is a function of the policy $\pi$...

• ...and $\textcolor{#0892A5}{\argmax_{\pi \in \Pi}}$ thus simply means that we aim to find the policy $\pi$ that maximizes the expected return, among a candidate set of policies $\Pi$. In our example here, $\Pi$ is simply the set of all mappings from the set of states $\mathcal{S}$ to the set of actions $\mathcal{A}$.

#### (Optimal) value functions

One way of structuring the search for an optimal policy is to learn a so-called optimal action-value function, denoted by $q_\ast(s, a)$. For a given state-action pair $(s, a),$ the optimal value function tells us the expected return obtained, if the agent started in state $s$, took action $a$, and followed the optimal policy thereafter. More formally, let $G_t = \sum_{k=0}^\infty \gamma^{k} R_{t+k+1}$ denote the return of an episode (or trajectory) after time step $t$. Then the optimal value function is given by

$q_{\ast}(s, a) = \mathbb{E}_{\pi_\ast}[G_t | S_t = s, A_t = a].$

If we have access to the optimal value function, we can easily obtain an optimal policy by comparing, for each state $s$, the action-value of each action in that state and letting the policy select a value-maximizing action, that is,

$\tag{2} \pi_{\ast}(s) = \argmax_{a \in \mathcal{A}} \ q_{\ast}(s, a).$

If both the state set and the action set are finite, the value function consists of a finite number of action values (also called Q-values). The following sketch shows a visual representation of the optimal value function for our Pancakes Gridworld for a discount factor of $\gamma = 1$.

Hover over a cell to see the actual numerical action-values (also called Q-values) for the corresponding state. Some (not very surprising) observations:

• State-action pairs that move the agent onto a 🍄-cell have a value of $-10$ (because 🍄 is a terminal state and the value thus equals the reward that is obtained upon entering the terminal state, given by $q_{\ast}(s, a) = r(\text{🍄}) = -10$).

• Accordingly, those state-action pairs that move the agent onto the 🥞-cell have a value of $10$.

• State-action values are higher, the closer the corresponding cell is to the 🥞-cell. Within one cell, the values corresponding to actions that lead the agent toward the yummy pancakes are higher than those that leave the agent hungry a little bit longer.

Click on the Optimal policy button below the sketch to reveal the maximum-value action(s) in each state. Many states have multiple optimal actions meaning that there are multiple optimal policies (but only one optimal value function)!

Learning an optimal value function. Value-based RL algorithms learn an optimal value function and then use Equation (2) to compute an optimal policy. Many value-based RL algorithms are based on the Bellman equations of the optimal value function. For a deterministic environment and policy, the Bellman equations are given by

\begin{aligned} q_{\ast}(s, a) &= \mathbb{E}_{\pi_\ast}[G_t | S_t = s, A_t = a] \\ &= \mathbb{E}_{\pi_\ast}[R_{t+1}(S_t, A_t) + \sum_{k=1}^\infty \gamma^{k} R_{t+k+1} | S_t = s, A_t = a] \\ &= r(s, a) + \mathbb{E}_{\pi_\ast}[\sum_{k=1}^\infty \gamma^{k} R_{t+k+1}] \\ &= r(s, a) + \mathbb{E}_{\pi_\ast}[\sum_{k=0}^\infty \gamma^{k + 1} R_{t+k+2}] \\ &= r(s, a) + \gamma \mathbb{E}_{\pi_\ast}[\sum_{k=0}^\infty \gamma^{k} R_{t+1+k+1}] \\ &= r(s, a) + \gamma q_{\ast}(s', a_\ast), \quad \text{ for all } s \in \mathcal{S} \text{ and } a \in \mathcal{A}, \end{aligned}

where $s' = s'(s, a)$ is the next state and $a_\ast = \pi_\ast(s')$ is any optimal action in state $s'$. In particular, we can represent the value $q_{\ast}(s, a)$ as a function of the optimal value of the next state. This is great because if we learn something about the value of any particular state, we can also update our beliefs about the previous state's value. In other words, the Bellman equations allow us to generalize knowledge across trajectories.

How exactly the Bellman equation can be transformed into learning algorithms is the topic of many upcoming posts. Before we can get started, however, we need one more piece of notation: analogously to the optimal value function, the action-value function for policy $\pi$ is given by

$q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a].$

The only difference with respect to the optimal value function is that the expectation is taken over trajectories produced by $\pi$ (which is not necessarily optimal) rather than $\pi_\ast$.

In the next post, we explore how the Q-learning algorithm can approximate the optimal value function in the Pancakes Gridworld, using only the reward received after every step.

If you liked this post, please consider following me on Twitter for updates on new blog posts.