Sarsa vs Q-learning

7/7/2018

Most important difference in between them is as follows:
Q-Learning: Off-Policy Learning
Sarsa: On-Policy Learning

So, as per the above mathematical equations SARSA uses the action value function (Q-value) for the action which has been taken in the step and hence it is an on-policy learning. While Q-learning, explores action values for all possible actions in given state and selects the one having the maximum action value. This clearly indicates that Q-Learning is an off policy learning.

Q-learning has the following advantages and disadvantages compared to SARSA:

Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. If you want to learn an optimal policy using SARSA, then you will need to decide on a strategy to decay ϵϵ in ϵϵ-greedy action choice, which may become a fiddly hyperparameter to tune.
Q-learning (and off-policy learning in general) has higher per-sample variance than SARSA, and may suffer from problems converging as a result. This turns up as a problem when training neural networks via Q-learning.
SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. The classic toy problem that demonstrates this effect is called cliff walking.

In practice the last point can make a big difference if mistakes are costly - e.g. you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk, if there was real time and money at stake if the robot was damaged.
If your goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.

1 Comment

Learn Robotics

Sarsa vs Q-learning

Leave a Reply.

Author

Categories