The Gridworld environment

I saw how RL algorithms learn to play ATARI games from pixel input. Why are you showing me this boring stuff?

**Why gridworlds?** Gridworlds are very simple toy domains that are used in many introductory materialsSee, for example, Sutton & Barto's Intro to RL book or Andreij Karpathy's excellent reinforce.js example page.
on dynamic programming and reinforcement learning.
Gridworlds are incredibly boring, but their simplicity allows us to **visualize**, **understand**, and ultimately **compare** the **learning behaviours**
of various algorithms.

The following sketch shows a 5x5 gridworld. You can play it with the arrow keys on your keyboard (sorry, dear mobile user).

**Game rules**. The agent (red cell) tries to collect the gold bar
while trying to avoid the bomb. Whenever the gold or the bomb is reached, the current
episode ends and the agent is reset to one of the bottom cells.

The reinforcement learning algorithms that we are going to encounter will try to learn an optimal policySee Tufte’s comment in the Tufte book fonts thread. $\pi$without any further human input. In the most general case, the agent has no knowledge about the structure of the environment. That is, the agent neither knows that the states form a grid nor that, for example, the action "up" will take it to the cell above it's current one. The agent also does not know how many gold or bomb cells there are, let alone their position.

**MDP formulation of a gridworld.** The gridworld environment can be formulated as an episodic Markov Decision Process (MDP),
denoted by the tuple $<\mathcal{S}, \mathcal{A}, P, \gamma>$, where $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions,
$P(s', r | s, a)$ is the transition function that specifies the probability of transitioning to state $s'$ and receiving reward $r$,
when taking action $a$ in state $s$,
and $0 < \gamma < 1$ is the discount factor. For our 5x5 gridworld we define the corresponding MDP as follows.

- $\mathcal{S}$ is the set of all 25 cells $s_i$ for $i = 0, 1, ... , 24$.
- $\mathcal{A}$ consists of the four actions $a_0, a_1, a_2, a_3$ that are available in each state.
- $P()

Gridworlds are incredibly boring. "I saw RL algorithms play Go and ATARI games, why are you showing me this boring stuff??"" Well... gridworlds are incredibly useful to understand how reinforcement algorithms work. Their simplicity allows us to track, visualize, and compare the learning behaviour of various algorithms.

Don't worry, we'll get to more complex domains.

Gridworlds are small toy domains in which an agent navigates between cells on a 2D grid. The agent receives a *reward* $r$ for each move.
Some of the grid cells are preferable to others (they yield higher reward). The goal is to find a *policy* (a mapping from cell to move direction)
such that the cumulative reward is maximized.

Try it, using the arrow keys on your keyboards.
Just by looking at the game, you probably have figured out an optimal **policy**,
that is, for every state (cell) the agent is in, you know which action to take such
that the agent takes the shortest path to the gold bar, while avoiding the bomb.