The (Pancakes) Gridworld Environment

The sketch below shows an example 5x5 gridworld. You can play it using the arrow keys on your keyboard or, on mobile, using the four arrowheads.


The agent (red sphere) can navigate the gridworld environment using one of four actions at each time step. The four actions correspond to the cardinal directions "north", "east", "south", and "west". The agent receives a reward, Reward received when a cell is entered.

which depends on the type of cell that is entered.

The reward is +10 for a gold cell, -10 for a bomb cell, and -1 for an empty cell (which can be interpreted as the "effort" that is required for moving). When an action would move the agent into a wall, the agent remains instead in the current cell and nevertheless receives a reward of -1 (banging your head against a wall requires effort, too).

The gold and bomb cells are terminal states. Whenever the agent enters a terminal state, the current episode ends and the agent restarts at one of the bottom cells, chosen uniformly at random.


In reinforcement learning, the agent's goal is usually to maximize the cumulative reward (also called return). In the gridworld environment this objective is reached (that is, return is maximized) if the agent finds the shortest path to the gold cell from each possible starting position, while avoiding the bomb cell.

I saw how RL algorithms learn to play ATARI games from pixel input. Why are you showing me this boring stuff?


MDP formulation

The gridworld environment can be formulated as an episodic Markov Decision Process (MDP), denoted by the tuple

 S,A,P(ss,a),P(rs,a) ,\langle \ \mathcal{S}, \mathcal{A}, P(s' | s, a), P(r | s, a) \ \rangle,

where S\mathcal{S} is the set of states, A\mathcal{A} is the set of actions, P(ss,a)P(s'| s, a) is the state-transition function that specifies the probability of transitioning to state ss' when taking action aa in state ss, and P(rs,a)P(r| s, a) is the probability distribution of reward rr that the agent receives when taking action aa in state ss.

For our 5x5 gridworld we can define a corresponding MDP as follows. - S\mathcal{S} is the set of all 25 cells sis_i for i=0,1,...,24i = 0, 1, ... , 24. State respresentation used.

The figure on the right shows one arbitrary assignment of state ids to cells of the gridworld. Under this specification, state 18 (bomb) and 23 (gold) are terminal states. All other states are states are non-terminal states. - The set of actions A\mathcal{A} is identical in each state and consists of the four actions (a0,a1,a2,a3)(a_0, a_1, a_2, a_3) = (north, east, south, west)(\text{north, east, south, west}). - The gridworld described so far is deterministic in the sense that the agent always moves in the intended direction. For example, if the agent is in state s2s_2 and takes action north\text{north}, the agent actually transitions to state s7.s_7. We can write P(72,north)=1\mathbb{P}(7 | 2, \text{north})=1 and P(s2,north)=0\mathbb{P}(s' | 2, \text{north})=0 for all s7s' \ne 7. See further below for an example of a stochastic transition probability distribution. - The reward function also takes a particularly simple form in this small and deterministic environment. Because it doesn't matter from which side a cell is entered, P(rs,a)=1\mathbb{P}(r|s, a)=1 for the reward r=r(s)r = r(s') that is given for entering the state ss' when taking action aa in state ss, and P(rs,a)=0\mathbb{P}(r|s, a)=0 for all other rr. For example, P(1017,east)=1\mathbb{P}(-10|17, \text{east})=1 and P(r17,east)=0\mathbb{P}(r|17, \text{east})=0 for all r10r \ne -10.

Possible rule modifications

Common changes to the basic gridworld described above are often made along the following dimensions: - Stochasticity. The agent moves in the intended direction (e.g., up for action "north") with probability 1ϵ1-\epsilon and moves with probaility of ϵ\epsilon in any other direction. In the context of a gridworld like the one described here, this effect if sometimes referred to as "wind" or "slipperiness" (of the surface). - Availability of a model of the environment. In the most basic version of the gridworld, the transition dynamics are generally not known. That is, the agent does not know beforehand that selecting the action "north" leads to an "upward" movement. When transition dynamics are given, the agent can use this information to plan ahead, and thus to learn more efficiently.

This really is boring... why should I care?

ATARI games, Starcraft, and DotA are certainly much more fun to look at than the gridworld. However, it is difficult to get an understanding of how exactly reinforcement learning algorithms actually learn by looking at videos of well-performing agents. The simplicity of a gridworld allows us to visualize, understand, and ultimately compare the learning behaviours of various algorithms.

Is your entire PhD about Gridworlds?


© 2021