The (Pancakes) Gridworld Environment

The sketch below shows an example 5x5 gridworld. You can play it using the arrow keys on your keyboard or, on mobile, using the four arrowheads.

The agent (red sphere) can navigate the gridworld environment using
one of four *actions* at each time step. The four actions correspond to the
cardinal directions "north", "east", "south", and "west". The agent receives a
*reward*, **Reward** received when a cell is entered.

The reward is +10 for a gold cell, -10 for a bomb cell, and -1 for an empty cell (which can be interpreted as the "effort" that is required for moving). When an action would move the agent into a wall, the agent remains instead in the current cell and nevertheless receives a reward of -1 (banging your head against a wall requires effort, too).

The gold and bomb cells are *terminal states*. Whenever the agent enters a
terminal state, the current episode ends and the agent restarts at one of the
bottom cells, chosen uniformly at random.

In reinforcement learning, the agent's goal is usually to
maximize the cumulative reward (also called *return*). In the gridworld
environment this objective is reached (that is, return is maximized) if the
agent finds the shortest path to the gold cell from each possible starting
position, while avoiding the bomb cell.

I saw how RL algorithms learn to play ATARI games from pixel input. Why are you showing me this boring stuff?

The gridworld environment can be formulated as an episodic Markov Decision Process (MDP), denoted by the tuple

$\langle \ \mathcal{S}, \mathcal{A}, P(s' | s, a), P(r | s, a) \ \rangle,$

where $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions, $P(s'| s, a)$ is the state-transition function that specifies the probability of transitioning to state $s'$ when taking action $a$ in state $s$, and $P(r| s, a)$ is the probability distribution of reward $r$ that the agent receives when taking action $a$ in state $s$.

For our 5x5 gridworld we can define a corresponding MDP as follows. -
$\mathcal{S}$ is the set of all 25 cells $s_i$ for $i = 0, 1, ... , 24$. **State** respresentation used.

Common changes to the basic gridworld described above are often made along the
following dimensions: - *Stochasticity.* The agent moves in the intended
direction (e.g., up for action "north") with probability $1-\epsilon$ and moves
with probaility of $\epsilon$ in any other direction. In the context of a
gridworld like the one described here, this effect if sometimes referred to as
"wind" or "slipperiness" (of the surface). - Availability of a *model of the
environment*. In the most basic version of the gridworld, the transition
dynamics are generally not known. That is, the agent does *not* know beforehand
that selecting the action "north" leads to an "upward" movement. When transition
dynamics are given, the agent can use this information to plan ahead, and thus
to learn more efficiently.

ATARI games, Starcraft, and DotA are certainly much more fun to look at than the
gridworld. However, it is difficult to get an understanding of how exactly
reinforcement learning algorithms actually learn by looking at videos of
well-performing agents. The simplicity of a gridworld allows us to
**visualize**, **understand**, and ultimately **compare** the **learning
behaviours** of various algorithms.

Is your entire PhD about Gridworlds?