FEBRUARY 9, 2020 · Jan Malte Lichtenberg

Markov decision processes

Many reinforcement learning (RL) algorithms are defined on Markov decision processes (or MDPs). More precisely, the problem that the RL agent is trying to solve is often formulated as an MDP. In this mini-post, I try to answer the following questions.

  • What are MDPs? Or Who is Markov? and what exactly is a decision process?
  • How are MDPs defined and do I really have to learn all of this if I am just interested in learning the basics of RL?
  • When is a MDP formulation insufficient? What are the alternatives?

What are these MDPs?

Gridworlds are very simple toy domains that are used in many introductory materialsSee, for example, Sutton & Barto's Intro to RL book or Andreij Karpathy's excellent reinforce.js. on dynamic programming and reinforcement learning.

Why gridworlds? Gridworlds are incredibly boring games, but their simplicity allows us to visualize, understand, and ultimately compare the learning behaviours of various algorithms.

The following sketch shows a 5x5 gridworld. You can play it using the arrow keys on your keyboard or the four arrowheads

Game objective. The agent (red circle) tries to collect the gold bar while trying to avoid the bomb. Whenever the gold or the bomb is reached, the current episode ends and the agent is reset to one of the bottom cells.

The reinforcement learning algorithms that we are going to encounter will try to learn an optimal policySee Tufte’s comment in the Tufte book fonts thread. π\pi without any human input. In the most general case, the agent has no knowledge about the structure of the environment. That is, the agent neither knows that the states form a grid nor that, for example, the action "up" will take it to the cell above it's current one. The agent also does not know how many gold or bomb cells there are, let alone their position.

MDP formulation of a gridworld. The gridworld environment can be formulated as an episodic Markov Decision Process (MDP), denoted by the tuple <S,A,P,γ><\mathcal{S}, \mathcal{A}, P, \gamma>, where S\mathcal{S} is the set of states, A\mathcal{A} is the set of actions, P(s,rs,a)P(s', r | s, a) is the transition function that specifies the probability of transitioning to state ss' and receiving reward rr, when taking action aa in state ss, and 0<γ<10 < \gamma < 1 is the discount factor. For our 5x5 gridworld, we define the corresponding MDP as follows.

  • S\mathcal{S} is the set of all 25 cells sis_i for i=0,1,...,24i = 0, 1, ... , 24.
  • A\mathcal{A} consists of the four actions a0,a1,a2,a3a_0, a_1, a_2, a_3 that are available in each state.
  • P()P()

The reinforcement learning algorithms that we are going to encounter will try to learn an optimal policySee Tufte’s comment in the Tufte book fonts thread. π\piwithout any further human input. In the most general case, the agent has no knowledge about the structure of the environment. That is, the agent neither knows that the states form a grid nor that, for example, the action "up" will take it to the cell above it's current one. The agent also does not know how many gold or bomb cells there are, let alone their position.

Gridworlds are incredibly boring. "I saw RL algorithms play Go and ATARI games, why are you showing me this boring stuff??"" Well... gridworlds are incredibly useful to understand how reinforcement algorithms work. Their simplicity allows us to track, visualize, and compare the learning behaviour of various algorithms.

Don't worry, we'll get to more complex domains.

Gridworlds are small toy domains in which an agent navigates between cells on a 2D grid. The agent receives a reward rr for each move. Some of the grid cells are preferable to others (they yield higher reward). The goal is to find a policy (a mapping from cell to move direction) such that the cumulative reward is maximized.

Try it, using the arrow keys on your keyboards. Just by looking at the game, you probably have figured out an optimal policy, that is, for every state (cell) the agent is in, you know which action to take such that the agent takes the shortest path to the gold bar, while avoiding the bomb.

© 2020