On pancakes, Markov decision processes (MDPs), and value functions

This is the second post of the blog series Reinforcement learning: line by line.

The previous post introduced the reinforcement
learning (RL) problem in a non-mathematical way, using self-driving cars as an
example application. This post introduces a much simpler sequential
decision-making problem, called the *Pancakes Gridworld*, which allows us to
focus on algorithmic details rather than on the many technical details present
in real-world applications.

Ideally, once we've found and understood an RL algorithm that solves problems in
small toy domains, we would like to go back to the larger, more interesting
domain and apply the same learning algorithm, generalizing our newly found
problem-solving capabilities. To facilitate this process, it is useful if all
domains and algorithms “speak the same language”. In particular, we
would like the notions of *state*, *action*, and *reward*Go back to this section for an
intuitive explanation of these terms. to represent somewhat
similar concepts across the various domains we consider.

In reinforcement learning, *Markov decision processes*, or MDPs, are by far the
most commonly used mathematical framework to model sequential decision-making
environments. Many RL algorithms are defined for arbitrary MDPs. Therefore, if
we find an algorithm that can solve the simple MDP corresponding to
the Pancakes Gridworld, there is hope that the same algorithm (or a scaled-up
version thereof) can also solve more complex MDPs that correspond to
more interesting problems.

In the following sections, we will introduce the Pancakes Gridworld and
formulate it as an MDP.I closely follow Sutton & Barto's treatment and notation of MDPs (Chapter 3 in
RL: an
introduction) Using this MDP formulation we can then formulate the RL objective introduced
in the previous post as a mathematically precise optimization problem. Finally, this post briefly
introduces the concept of a *value function*, which is one of the main building
blocks for the Q-learning algorithm discussed in the next post.

This is the *Pancakes Gridworld*, an extremely boring game. In case you actually
“played” it, you probably immediately knew what to do in order to win: move the
agent towards the pancakes as quickly as possible and avoid the fly agaric
mushrooms along the way—they are toxic!

Despite its dullness, the Pancakes Gridworld satisfies two requirements that make it a great example environment for the purposes of this blog series.Gridworlds are used in many introductory materials on RL, including Sutton & Barto's book Reinforcement learning: an introduction and Andrej Karpathy's reinforce.js, which was a major inspiration for this blog.

First, it can be easily formulated as an MDP (see next section).

Second, the Pancakes Gridworld is small and simple enough so that we can visualize the agent's current policy on all states at the same time. Such a global perspective on the agent's currently learned behavior can sometimes reveal patterns that a local perspective (which we are usually forced to take when dealing with larger MDPsGoing back to our previous example of self-driving cars, it is not practicable to inspect the driving agent's behavior for every possible state at any given time step. There are simply way too many states.) could not.

One note: please do not get lost in any mathematical details if you read about
MDPs for the first time. Instead, try to get a high-level idea of how *state*,
*action*, and *reward* interact with each other.

MDP formulation of the Pancakes Gridworld

The following sketch shows again the Pancakes Gridworld, but this time with additional information that corresponds to an episodic MDP formulation that is further explained below. You can again “play” the game, but this time, observe how the values of various MDP variables change as you move around the agent.

notation. An MDP is usually denoted by a tuple

$\left\langle \ \mathcal{S}, \mathcal{A}, s'(s, a), r(s, a), \gamma, \rho_0 \ \right\rangle,$

where $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions, $s'(s, a)$ is the state-transition function that specifies the new state $s'$ when taking action $a$ in state $s$, and $r(s, a)$ is the reward function, which provides the reward $r$ the agent receives when taking action $a$ in state $s$.To make things easier in the beginning, we assume that the state-transition function and the reward function are deterministic. In many real-world examples both these functions are stochastic, that is, they are defined as probability distributions over states or rewards. The parameter $\gamma$ is a temporal discount factor and $\rho_0$ is a starting state distribution.

a pancakes mdp. Next I explain how each MDP component is defined in the Pancakes Gridworld sketch shown above.

**States.**The current state of the environment is simply the cell currently inhibited by the agent. We name the states (cells) from $s_0$ to $s_{24}$ going from the bottom-left to the upper-right cell.In principle, we could use any other naming convention but this row-wise naming makes orientation easier, when we later talk about specific states. Under this specification, states $s_8$, $s_{10}$ (🍄) and $s_{23}$ (🥞) are so-called*terminal states*(because the current episode terminates when the agent hits any of these states). All other states are states are*non-terminal*states.**Actions.**The set of actions $\mathcal{A}$ is identical in each state and consists of the four actions $(a_n, a_e, a_s, a_w)$ = $(\text{north, east, south, west})$. Again, the naming chosen here is arbitrary and any other naming would do the job as well as long as we consistently use the same names throughout the learning process.**Transition function.**The gridworld described here is*deterministic*in the sense that the agent always moves in the intended direction. For example, if the agent is in state $s_2$ and takes action $\text{north}$ ($a_n$), the agent will transition to state $s_7$ with absolute certainty. We can write $s'(s_2, a_n)=s_7$.**Reward.**The reward function takes a particularly simple form in the Pancakes world because it doesn't matter from which side a cell is entered. We can thus simply define the reward function as a function of “next state”, that is, $r(s, a) = r(s'(s, a))$. For example, in the sketch above, $r(s_{10}) = r(🍄) = -10$, $r(s_{23}) = r(🥞) = 10$, and $r(s_{5}) = -1$.**Temporal discounting.**The temporal discount factor $0 \leq \gamma \leq 1$ is a parameter that specifies the relative value of receiving a certain amount of reward in the future vs. receiving the same amount now. Temporal discounting is a well-studied concept in economics and you can read about it, for example, in this wiki article. In the case of the Pancakes Gridworld temporal discounting is actually not very important and so we could simply set $\gamma = 1$ (that is, no temporal discounting). However, in domains with a long (or infinite) time horizon, temporal discounting (that is, $\gamma < 1$) is often indispensable.**Starting state.**Here the agent simply always starts in state $s_0$ but in principle $\rho_0$ could be a distribution of states (for example, the starting state is sampled uniformly randomly from the bottom row of states at the beginning of each episode).

Using the MDP formulation just introduced, we can describe more formally several concepts introduced in the previous post: the agent's policy, the agent-environment interaction, and the RL objective.

policies. The agent selects actions according to a
(behavioral) *policy* $\pi: \ \mathcal{S} \rightarrow \mathcal{A}$, which is a
mapping from states to actions. That is, for every state $s \in \mathcal{S}$,
the policy produces an action $a = \pi(s)$, which is then executed by the agent.

In the Pancakes Gridworld, a *deterministic* policy can be represented by a
dictionary that simply assigns to each of the 25 states any of the four
available actions. A *stochastic* policy instead would assign a probability
distribution over the four available actions to each state. We will see
stochastic policies in future posts but for now we only consider deterministic
policies.

agent-environment interaction. Once the agent executes the action that was determined by the policy, the environment “responds” with a next state and a reward signal (as determined by the transition function $s'(s, a)$ and the reward function $r(s, a)$, respectively). The agent then observes the new state and selects the next action, and so on. In the previous post we used a very similar sequential decision-making protocol in the context of a self-driving car example. We can now “update” the agent-environment interaction graph using the MDP framework.

More specifically, we formalize the agent-environment interaction as a sequence of random variables

$S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots, R_t, S_t, A_t, \dots,$

where $S_t$, $A_t$, and $R_t$ denote, respectively, state, action, and reward at decision stage (or time step) $t$. The MDP variables of time step $t+1$ depend on variables $S_t$, $A_t$, and $R_t$, as shown in the following graph.

Agent-environment interaction in an MDP: The agent (🤖) observes the current state $S_t$ of the environment (🌍) and selects an action $A_t$. according to the current policy $\pi$. The action is executed, and the environment responds with a new state $S_{t+1}$. The agent also observes a reward $R_t$, which can be used to improve the current policy.

**For example**, in the Pancakes Gridworld, a (very short) trajectory is produced by
the policy that selects the action $\text{north}$ for every state, that is,
$\pi(s) = a_n$ for every state $s$. An agent following this (quite naïve) policy
creates the following trajectory:

- $S_0 = s_0$ (the starting state),
- $A_0=a_n$ (go north),
- $R_1 = -1$ (reward for going onto an empty cell),
- $S_1=s_5$ (the new state),
- $A_1=a_n$ (go north again),
- $R_2 = -10$ (toxic 🍄!), and
- $S_2=s_{10}$.

This terminates the episode because $s_{10}$ is a terminal state. You can imitate this trajectory very easily in the interactive sketch shown above by pressing, you guessed it, the “up”-button twice.

the rl objective (revisited). In the previous post we
stated the RL objective as *learn how to select actions so as to maximize
long-term reward*. Given an MDP, we can formulate this objective more precisely.
For example, by interpreting long-term reward as the expected cumulative, discounted sum
across all future rewards.There are other notions of long-term reward and thus other RL
objectives (see, e.g., Ch. 10.3 in S&B 2018). In this blog series we only use
the discounted return formulation described here. More formally, the RL objective
is to find a policy $\pi_\ast$ that satisfies

$\tag{1}
\pi_{\ast} \in
\textcolor{#0892A5}{\argmax_{\pi \in \Pi}} \
\textcolor{#7FB069}{\mathbb{E}_{A_t \sim \pi(S_t) \text{ for } t \geq 0}
[}\textcolor{#FA7921}{\sum_{t=0}^\infty \gamma^t
\ R_t}\textcolor{#7FB069}{]}.$

Any such policy $\pi_\ast$ is called an $\textcolor{#FA7921}{\sum_{t=0}^\infty \gamma^t \ R_t}$ is the cumulative (hence the summation term), discounted (hence the $\gamma^t$ term) sum of rewards obtained during an episode and is often simply called

*return*.$\textcolor{#7FB069}{\mathbb{E}_{A_t \sim \pi(S_t) \text{ for } t \geq 0} [ \cdot]}$, means that we are interested in the expected return, where the expectation is taken over trajectories produced by an agent following the policy $\pi$.The “$\textcolor{#7FB069}{\sim}$” in $\textcolor{#7FB069}{A_t \sim \pi(S_t)}$ means that the action $A_t$ is

*sampled*from the policy $\pi(S_t)$. This notation includes the case of a stochastic policy $\pi$. In the case of a deterministic policy and a deterministic environment (this is the case in our “always-go-north” example discussed above), the agent always follows the same trajectory and the expected return thus simply equals the return of this one trajectory. In the more realistic case that the policy, the environment, or both are stochastic, the expectation is usually much more difficult to compute. Note that the expected return $\textcolor{#7FB069}{\mathbb{E}_{A_t \sim \pi(S_t) \text{ for } t \geq 0} [}\textcolor{#FA7921}{\sum_{t=0}^\infty \gamma^t \ R_t}\textcolor{#7FB069}{]}$ is a function of the policy $\pi$......and $\textcolor{#0892A5}{\argmax_{\pi \in \Pi}}$ thus simply means that we aim to find the policy $\pi$ that

*maximizes*the expected return, among a candidate set of policies $\Pi$. In our example here, $\Pi$ is simply the set of all mappings from the set of states $\mathcal{S}$ to the set of actions $\mathcal{A}$.

(Optimal) value functions

One way of structuring the search for an optimal policy is to learn a so-called
*optimal action-value function*, denoted by $q_\ast(s, a)$. For a given
state-action pair $(s, a),$ the optimal value function tells us the expected
return obtained, if the agent started in state $s$, took action $a$, and
followed the optimal policy thereafter. More formally, let $G_t =
\sum_{k=0}^\infty \gamma^{k} R_{t+k+1}$ denote the return of an episode (or
trajectory) after time step $t$. Then the optimal value function is given by

$q_{\ast}(s, a) = \mathbb{E}_{\pi_\ast}[G_t | S_t =
s, A_t = a].$

If we have access to the optimal value function, we can easily obtain an optimal policy by comparing, for each state $s$, the action-value of each action in that state and letting the policy select a value-maximizing action, that is,

$\tag{2}
\pi_{\ast}(s) = \argmax_{a \in \mathcal{A}} \ q_{\ast}(s, a).$

If both the state set and the action set are finite, the value function consists
of a finite number of action values (also called *Q-values*). The following
sketch shows a visual representation of the optimal value function for our
Pancakes Gridworld for a discount factor of $\gamma = 1$.

State-action pairs that move the agent onto a 🍄-cell have a value of $-10$ (because 🍄 is a terminal state and the value thus equals the reward that is obtained upon entering the terminal state, given by $q_{\ast}(s, a) = r(\text{🍄}) = -10$).

Accordingly, those state-action pairs that move the agent onto the 🥞-cell have a value of $10$.

State-action values are higher, the closer the corresponding cell is to the 🥞-cell. Within one cell, the values corresponding to actions that lead the agent toward the yummy pancakes are higher than those that leave the agent hungry a little bit longer.

Click on the “Optimal policy” button below the sketch to reveal the maximum-value action(s) in each state. Many states have multiple optimal actions meaning that there are multiple optimal policies (but only one optimal value function)!

Learning an optimal value function. *Value-based RL
algorithms* learn an optimal value function and then use Equation (2) to compute an optimal policy. Many value-based RL
algorithms are based on the *Bellman equations* of
the optimal value function. For a deterministic environment and policy, the
Bellman equations are given by

$\begin{aligned}
q_{\ast}(s, a) &= \mathbb{E}_{\pi_\ast}[G_t | S_t =
s, A_t = a] \\
&= \mathbb{E}_{\pi_\ast}[R_{t+1}(S_t, A_t) + \sum_{k=1}^\infty \gamma^{k} R_{t+k+1} | S_t =
s, A_t = a] \\
&= r(s, a) + \mathbb{E}_{\pi_\ast}[\sum_{k=1}^\infty \gamma^{k}
R_{t+k+1}] \\
&= r(s, a) + \mathbb{E}_{\pi_\ast}[\sum_{k=0}^\infty \gamma^{k +
1} R_{t+k+2}] \\
&= r(s, a) + \gamma \mathbb{E}_{\pi_\ast}[\sum_{k=0}^\infty
\gamma^{k} R_{t+1+k+1}] \\
&= r(s, a) + \gamma q_{\ast}(s', a_\ast), \quad \text{ for
all } s \in \mathcal{S} \text{ and } a \in \mathcal{A},
\end{aligned}$

where $s' = s'(s, a)$ is the next state and $a_\ast = \pi_\ast(s')$ is any
optimal action in state $s'$. In particular, we can represent the value
$q_{\ast}(s, a)$ as a function of the optimal value
of the *next state*. This is great because if we learn something about the value
of any particular state, we can also update our beliefs about the
“previous” state's value. In other words, the Bellman equations
allow us to generalize knowledge across trajectories.

How exactly the Bellman equation can be transformed into learning algorithms is
the topic of many upcoming posts. Before we can get started, however, we need
one more piece of notation: analogously to the optimal value function, the
*action-value function for policy* $\pi$ is given by

$q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t =
s, A_t = a].$

The only difference with respect to the optimal value function is that the expectation is taken over trajectories produced by $\pi$ (which is not necessarily optimal) rather than $\pi_\ast$.

In the next post, we explore how the Q-learning algorithm can approximate the optimal value function in the Pancakes Gridworld, using only the reward received after every step.

If you liked this post, please consider following me on Twitter for updates on new blog posts.