SARSA (vs. Q-learning)

This is the fourth post of the blog series Reinforcement learning: line by line. Here we take a look at the tabular Sarsa algorithm and compare it to the Q-learning algorithm (discussed in the previous post). In case you are completely new to reinforcement learning (RL), see here for an informal introduction.

The interactive sketch below shows an implementation of the tabular Sarsa
algorithm applied to a version of a simple game, called the *Pancakes
Gridworld*.See here for more information
about the Pancakes Gridworld.

The tabular Sarsa algorithm is conceptually very similar to the Q-learning algorithm in that, in every time step, the agent uses only information from the current transition to improve its action-value estimates. The main conceptual difference between the two algorithms is the update rule (line 6 of the pseudo code), as discussed further below in more detail.

One technical consequence of using this other update rule is that the Sarsa algorithm has
a slightly different algorithmic structure. Specifically, the Sarsa algorithm
requires in each iteration the current state $S$, the current action $A$, the reward $R$, the next
state $S'$, *and the next action* $A'$. The Q-learning update,
on the other hand, only requires the variables $S, A, R, S'$. This leads to
a different structure of learning updates:

$\text{Q-learning: } \ \rlap{$\overbrace{\phantom{S_0, A_0, R_1, S_1}}^{1^\text{st}
\text{update}}$} S_0, A_0, R_1,
\rlap{$\underbrace{\phantom{S_1, A_1, R_1, S_2}}_{2^\text{nd}
\text{update}}$} S_1, A_1, R_2,
\rlap{$\overbrace{\phantom{S_2, A_2, R_3, S_3}}^{3^\text{rd}
\text{update}}$} S_2, A_2, R_3,
\rlap{$\underbrace{\phantom{S_3, A_3, R_4, S_4}}_{4^\text{th}
\text{update}}$} S_3, A_3, R_4, S_4, A_4, \dots$

$\text{Sarsa: } \ \rlap{$\overbrace{\phantom{S_0, A_0, R_1, S_1, A_1}}^{1^\text{st}
\text{update}}$} S_0, A_0, R_1,
\rlap{$\underbrace{\phantom{S_1, A_1, R_1, S_2, A_2}}_{2^\text{nd}
\text{update}}$} S_1, A_1, R_2,
\rlap{$\overbrace{\phantom{S_2, A_2, R_3, S_3, A_3}}^{3^\text{rd}
\text{update}}$} S_2, A_2, R_3,
\rlap{$\underbrace{\phantom{S_3, A_3, R_4, S_4, A_4}}_{4^\text{th}
\text{update}}$} S_3, A_3, R_4, S_4, A_4, \dots$

For the Q-learning algorithm only state is shared between two successive updates. For the Sarsa algorithm, both state and action are shared between successive updates. This difference is the reason for the different “rhythm” in the algorithm structure, as further explained in the algorithm description. In particular, for the Sarsa algorithm we always have to carry around two actions (current action $A$ and next action $A'$), whereas the Q-learning algorithm requires only one action variable ($A$).

The remainder of this section goes through the Sarsa pseudo code, line by line. It focuses on the differences to the Q-learning pseudo code, which was explained here (pseudo code lines that are identical in both algorithms are grayed out).

$0: \text{Loop for each episode:}$ (“outer loop”, same as in Q-learning)

$1: S \leftarrow s_0$ (“set agent to starting state” same as in Q-learning)

$2: \text{Select action } A \text{ from state } S \text{ using } \epsilon\text{-greedy policy}$- This is similar to Q-learning's code line 3, yet here we select an action already before entering the inner loop (code line 4). This is the different “rhythm” I talked about earlier: the Sarsa algorithm requires the current action and the next action for its update.

$3: \text{Loop until a terminal state is reached:}$ (“inner loop”, same as in Q-learning)

$4: \text{Take action } A, \text{ observe reward } R \text{ and new state } S'$ (same as in Q-learning)

$5: \text{Select action } A' \text{ from state } S' \text{ using } \epsilon\text{-greedy policy}$- Here the
*next action*$A'$ is selected according to an $\epsilon$-greedy policy, based on the current Q-value estimates.

The learning update. The same general intuition from the Q-learning update applies here as well (the new estimate is a weighted average of the old estimate and a Bellman estimate, as described here). The difference is that for Sarsa the Bellman estimate is not based on the Bellman equation of the optimal value function (as it is for Q-learning), but on the (deterministic) Bellman equation of the

**agent's current policy**$\pi$, given by$q_{\pi}(s, a) = r(s, a) + q_{\pi}(s', a') , \quad \text{ for all } s \in \mathcal{S} \text{ and } a \in \mathcal{A},$where $s'(s, a)$ is the next state as determined by the transition function, and $a' = \pi(s')$ is the action chosen by the current policy in the next state.The Bellman equation looks slightly more complicated if the policy and/or environment is/are stochastic, yet the intuition remains the same. They are provided in Sutton & Barto's book. The Sarsa update is said to be

*on-policy*because the Bellman estimate uses the Q-value of the next action that is actually chosen by the current policy. The Q-learning update, on the other hand, is said to be*off-policy*, because it always considers the maximum Q-value in the next state, regardless of which action the agent chooses next.

- The “next state” becomes the “current state” of the next iteration, just as in Q-learning. In addition, the “next action” becomes the “current action”.

What difference does it make to use the Sarsa update rule in place of the Q-learning update rule? Let's first compare the policies learned by the two algorithms in the Pancakes Gridworld and then try to generalize our findings.

You might have noticed that the Pancakes Gridworld in the sketch above is different from the one that was used in the Q-learning post. The version used in the present post imitates the “Cliff Walking” example from Sutton & Barto's book.Example 6.6. (page 132) in Sutton & Barto (2018). It's a great example to show the difference between on-policy and off-policy learning for an explorative, stochastic policy (for example, the $\epsilon$-greedy policy ), as you will see further below.

The following sketch shows a **Q-learning** agent in the
same Pancakes environment that was used in the Sarsa-sketch at the top of this
post.

If you run both algorithms for long enough, the typical (non-exploring) trajectories taken
by the respective agents will eventually look as follows.
Assuming that both algorithms used a constant, small exploration
parameter $\epsilon$ *during* the learning process and then turned off
exploration (i.e., $\epsilon=0$) to create these trajectories.

*Why are the policies learned by Q-learning and Sarsa different?*

The different policies are a direct consequence of the different update rules used by the two algorithms. In what follows, we'll consider an example transition and compare the corresponding learning updates.

Assume that the agent has already interacted with the environment and has learned the following value estimates (hover your mouse above a grid cell to see numerical values).

Furthermore assume that the agent transitioned from $s_5$ to $s_6$ and then makes an exploratory Recall that sometimes the agent needs to randomly explore the environment during the learning process, as described here. move and selects action “south”, or more formally:

$S = s_5, \ A = east, \ R = -1, \ S' = s_6, \ A' = south.$

The Sarsa update for this transition is given by

$Q(s_5, east) \leftarrow (1 - \alpha) \ Q(s_5, east) +
\alpha \ [-1 + \gamma Q(s_6, \textcolor{blue}{south})],$

whereas the Q-learning update is given by

$Q(s_5, east) \leftarrow (1 - \alpha) \ Q(s_5, east) +
\alpha \ [-1 + \gamma Q(s_6, \textcolor{red}{east})].$

To put it in words, the Sarsa update uses the value corresponding to the “on-policy action” for its update. Q-learning, on the other hand, uses the value-maximizing action (in $s_6$, the action “$east$” has the highest value according to the current estimates). If we plug in the actual values (and setting $\alpha = 0.2$ and $\gamma = 1$), we see a big difference in the new value estimates for $Q(s_5, east)$, depending on which update rule is used: For Sarsa, the new value is given by

$Q(s_5, east) = 0.8 \times (-1.82) +
0.2 \times [-1 + (\textcolor{blue}{-36.00})] = -8.86.$

For Q-learning, the new value estimate for $Q(s_5, east)$ is given by

$Q(s_5, east) = 0.8 \times (-1.82) +
0.2 \times [-1 + \textcolor{red}{1.22}] = -1.41.$

If compared with the value estimate of $Q(s_5, east)$ before the update ($=-1.82$), the Q-learning update increases the estimate but the Sarsa update decreases the estimate. Importantly, the new value-maximizing action in state $s_5$ is now $north$ for the Sarsa agent, whereas it stays $east$ for the Q-learning agent. This is consistent with the typical trajectories of both algorithms shown further above.

*So... which policy is optimal?*

It depends! The Sarsa algorithm learns the optimal policy for an agent that never ceases to explore. For an exploring agent it is dangerous to walk too close to the mushrooms and it thus pays off to make the two extra steps required to take the middle lane ($s_{10}$ to $s_{14}$) further away from the mushrooms.

The Q-learning algorithm, on the other hand, learns the optimal policy for an agent that at some point stops exploration and thus can safely walk close to the mushrooms without ever accidentally eating them.

In the particular setting of the Pancakes world studied here, nothing speaks against

stopping exploration at some point after the environment has been
thoroughly explored. Therefore, the policy learned by the Q-learning algorithm
seems like the better option. However, in general, one could think of environments that change over
time (for example, the positions of gold and/or mushrooms could change over
time) and thus may require the agent to continue exploring forever. In such a
case the Sarsa algorithm might be the better option.

If you liked this post, please consider following me on Twitter for updates on new blog posts.