VINIT SARODE - Blogs

Sarsa vs Q-learning

7/7/2018

Most important difference in between them is as follows:
Q-Learning: Off-Policy Learning
Sarsa: On-Policy Learning

So, as per the above mathematical equations SARSA uses the action value function (Q-value) for the action which has been taken in the step and hence it is an on-policy learning. While Q-learning, explores action values for all possible actions in given state and selects the one having the maximum action value. This clearly indicates that Q-Learning is an off policy learning.

Q-learning has the following advantages and disadvantages compared to SARSA:

Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. If you want to learn an optimal policy using SARSA, then you will need to decide on a strategy to decay ϵϵ in ϵϵ-greedy action choice, which may become a fiddly hyperparameter to tune.
Q-learning (and off-policy learning in general) has higher per-sample variance than SARSA, and may suffer from problems converging as a result. This turns up as a problem when training neural networks via Q-learning.
SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. The classic toy problem that demonstrates this effect is called cliff walking.

In practice the last point can make a big difference if mistakes are costly - e.g. you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk, if there was real time and money at stake if the robot was damaged.
If your goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.

On-Policy vs Off-Policy Learning

5/29/2018

There are two main ways to train reinforcement learning algorithms:
On-Policy - Agent takes actions inside an environment. e.g. SARSA, TD(lambda), Actor-critic
Off-Policy - Something else takes actions in an environment, your agent trains on recorded trajectories with those actions and then tries to act optimally by itself. e.g. Q-learning, R-learning

On-Policy
1. Agent can pick actions.

2. On-policy algorithms can't learn off-policy.

3. They are faster and better.

4. Agents always follow his own policy.

5. Start with a simple soft policy.

6. Sample state space with this policy.

7. Improve policy.

Off-Policy
1. Agents can't pick actions.

2. Off-policy algorithms can learn on-policy

3. Learn optimal policy even if agent takes random actions.

4. Learning from expert (expert is imperfect). Learning from sessions (recorded data). Learning with exploration, playing without exploration.

5. Gather information from random moves

6. Evaluate states as if greedy policy was used.

7. Slowly reduces randomness.

Experience Replay vs Multi-step Learning

5/29/2018

Experience replays are generally used for off policy learning. After taking some random actions from the action space, the tuple <s(t) ,a ,r ,s(t+1)> is stored in the memory. All such tuples are used in the end to train the algorithm by learning it multiple times. On-policy learning requires on the spot update of policy and so it can't use experience replay. TD(lambda), actor-critic, SARSA are the examples of on-policy learning. While A3C can be used as either on-policy or off-policy.

The above figure shows the pseudo code for experience replay reinforcement learning. All the experiences are stored in a database. Then these experiences are used to train the Q-values using neural network.

Multi-step Learning

It is simply a look forward in the training. n-step MC, TD (lambda), n-step A3C are some examples of it. It simply starts from one state and takes 'n' actions to receive further states and rewards. Then these rewards are used to get the discounted rewards or returns and then the policy is updated using them. This method also requires the storage of n-tuples in the memory. But there is no multiple times learning using these tuples. This gives an explicit exploration to the algorithm. As the policy used to explore will explore various states, action pairs and will give good results.

Experience Replay

Stores the experiences in database to train the network/algorithm multiple times using same data.
Used on off-policy algorithms.
Used when the exploration is bit costly. Like use of robots to explore an environment is a costly affair.

Multi-step Learning

Stores the experiences in database to get the discounted rewards and for forward view algorithm.
Used on on- or off- policy algorithms.
Can be used to improve the algorithm. When the learning should be affected by various steps and not only on single action.

Model-Based vs Model-Free RL

5/27/2018

Reinforcement Learning involves two types of value iterations. 1) Value Iteration V(s) and 2) Policy Iteration Q(s,a).
Also V(s) is called the state value function and Q(s,a) is the action value function. Value iteration finds how good it is to be in that particular state while the action value function evaluates that how good will it be to take a given action from the given state. This is an important explanation for the further topic.

Let us consider a game of 4x3 grid where the agent starts at a fix location (bottom left corner) with a fixed goal (top right corner). Here we already knew the probability to reach certain new state with given initial state and action taken. So, we already know the dynamics of the game, which means we are relying on the given model of game. So this is example of Model-Based RL. But let us suppose an autonomous car which is driven by RL network, then there won't be an exact dynamics. Just like atari games, can be played using the images of the screen. So, we cannot predict the new state even if know the action taken. Such environments are known as Model-Free Environment. Such environment is called a black box.

We can use any of the two methods mentioned in first passage to find an optimal policy for model-based RL. As the action value function can be found out using environment dynamics/model and state value function. Both will work similarly. But in case of Model-Free RL, we cannot use state value function as it will indicate any significance. Action value function will be the most important to calculate in such situation. Because here we do not know which action will lead to which state and what will be the reward. So, it is needed to explore most of the actions and states. In such cases, Action value function will tell us about the value associated with each state and each action.

State value function can found by using Action value function and model of environment. But in Model-Free environment, policy pi (a | s) is unknown and state value iteration will not be effective to find the policy. The reason behind this is we cannot obtain action-value function as we don't have any idea that which action will lead to which state. It means, we have complete knowledge about which state will be better in future, but we have no clue which action should be selected to reach that state.

On the opposite, if we choose action value function, we knew that which action will be the best to choose for a given state from the action space and hence we can find the optimal policy using it in Model-Free environment.

What is Reinforcement Learning?

5/23/2018

There may be many concepts or definitions of reinforcement learning. Some say it is an area of Machine Learning inspired by behaviourist psychology. A branch of AI that deals with software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. According to KDnuggets, "Reinforcement Learning is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward."

But I will give you a very simple example of Reinforcement learning which will make you understand its concept. When you learn to ride a motorcycle is the best example of RL. From the state to switch on the bike, to control the balance, to take turn and till the state to reach destination, every stage is a state space for the human agent and doing it properly gives a positive reward of being fit. Taking a wrong step in state will lead to an accident. Injury will be the negative reward.

After riding the bike for few months, we gather a lot of experience to drive in a harsh traffic or in crowded place. On this basis, our brain having numerous neurons chooses the optimal policy using the sensation of ears and vision. How brilliant our brain is!!! I was driving a bike and a car went across me. I slowed down my bike and waited the car to pass. As the car was half passed I raised the speed to go ahead. This needed the a proper guess of speed of the car, time it will take to cross and speed of my bike. All this comes with an experience. And you will master the driving after a lot of experience. Hence it is a very clear example of RL.

Multiprocessing Lock

5/17/2018

We use lock to keep our valuables safe from thieves or mischief. Lock to door is for the valuables in house. Lock to bank locker is to keep gold or documents safe. Lock/password to an account is to keep is private.
In the same manner, multiple processes requires lock to keep shared memory addresses private for particular operations. Lock will ensure that only one process will perform operation on it and others may not even access its value.

Consider a bank software with two process. One is withdrawal and another is deposition. If both the processes run parallel with balance as shared memory address then, the value will of balance at the end of both processes will be wrong. To do this lock is applied to shared memory during each process.

import multiprocessing
...
lock = multiprocessing.Lock( )
...
lock.acquire( )
...
lock.release( )
...

Acquire is the function to lock the shared memory for one particular process for that moment. Release function will release the lock and this memory address is free to be accessed by another process.

Github Link

Shared Memory (in Multiprocessing)

5/17/2018

As we already know, each process has its own memory allocation. So, if a variable is declared in one process, then we can't update it in another process. Even if we declare a variable as global variable, it will be defined only for a process.

Suppose main program is process 1 and calculation of squares of given numbers is the process 2. An empty list named result is declared in both the processes. If we append the squares of numbers to result list in process 2 then, it will not update the result list declared in process 1 as both of these lists have different memory addresses.

So, we will create a shared memory which can be used by both the processes. Below is the syntax to define shared Array, Value and Queue.

import multiprocessing
...
arr = multiprocessing.Array( data_type, size )
val = multiprocessing.Value( data_type, numerical_value )
q = multiprocessing.Queue( )
...

All these memories can be shared in another process by passing them as argument in the process. Here updating the value of these variables in any process will update their value in other process too.

To understand the use of this concept check links below:
github link 1
github link 2

Multi-Processing

5/17/2018

Multi-processing is used to do multiple tasks at a time in computer. It speeds up the process. We already have seen the difference between multiprocessing and threading. So, here we will study about how begin with multiprocessing.

Python contains a multiprocessing module. This module has various functions available to create a process. Below is a piece of code to create a process.

import multiprocessing
...
p = multiprocessing.Process( target = function_name, args = (arg1,arg2,...))
p.start()
p.join()
...

Process is a function used to create a process. It has the target function which will be executed as the process begins and arguments for the target function as its arguments. Start() function will initiate the process and Join() function will wait till the process is finished.

To understand in a better way, visit the github link.

Multiprocessing vs Threading

5/16/2018

Both are the ways of multi-tasking. Computer can do multiple tasks at a time. Just like running google chrome, VLC media player and python program at same time is multi-tasking. Each of these is known as a process. These processes can be seen in windows task manager (in Windows) and in terminal (in Linux). Each process has a process id known as PID.

Each process has its own memory location. Each thread acts within a process. There can be multiple threads within a single process. The memory address is shared by each thread. If a variable is globally declared in a process, then each thread can access this variable. The above image makes the concept more clear.

On the other hand, processes has their own address space. They need inter-process communication techniques to communicate with each other. Globally declared variable for one process can't be accessed by other process. The above image will make it more clear. File, shared memory & message pipes are some examples of inter-process communication techniques.

Advantages of Process over Threading:
Error or memory leak in one process won't hurt execution of another process. But if there is memory leak in one thread then it can damage other threads. Hence every software creates its own process. So, if there is any failure in one software, it will not harm another soft wares in computer.

Difference between Collaborative and Cooperative Policies

5/7/2018

Collaboration indicates work of a team to achieve certain with an individual task for everyone. Collaboration doesn’t require sharing of information amongst each other. Here the success doesn’t depend on everyone.

Cooperative policy indicates the team work with everyone working towards the goal and the results depend on the struggle of complete team.

Let’s understand it with an example! Suppose, there is a house which has to emptied. Now, there are three persons in house. If each person takes each item out of house without having any conversation in between them then it is a Collaborative Policy. No one knows what others thinking or what strategy they have in mind. But task will be accomplished.

On the other hand, if there is a sofa in house. Then these three persons will have to lift the sofa together and will have to take it out. So, this will require a lot coordination among them. They must communicate amongst themselves about their policy to achieve the goal. This is called a Cooperative Policy.

Cooperative Policy