Most important difference in between them is as follows: Q-Learning: Off-Policy Learning Sarsa: On-Policy Learning So, as per the above mathematical equations SARSA uses the action value function (Q-value) for the action which has been taken in the step and hence it is an on-policy learning. While Q-learning, explores action values for all possible actions in given state and selects the one having the maximum action value. This clearly indicates that Q-Learning is an off policy learning.
Q-learning has the following advantages and disadvantages compared to SARSA:
If your goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.
1 Comment
There are two main ways to train reinforcement learning algorithms: On-Policy - Agent takes actions inside an environment. e.g. SARSA, TD(lambda), Actor-critic Off-Policy - Something else takes actions in an environment, your agent trains on recorded trajectories with those actions and then tries to act optimally by itself. e.g. Q-learning, R-learning
Experience replays are generally used for off policy learning. After taking some random actions from the action space, the tuple <s(t) ,a ,r ,s(t+1)> is stored in the memory. All such tuples are used in the end to train the algorithm by learning it multiple times. On-policy learning requires on the spot update of policy and so it can't use experience replay. TD(lambda), actor-critic, SARSA are the examples of on-policy learning. While A3C can be used as either on-policy or off-policy. The above figure shows the pseudo code for experience replay reinforcement learning. All the experiences are stored in a database. Then these experiences are used to train the Q-values using neural network. Multi-step Learning It is simply a look forward in the training. n-step MC, TD (lambda), n-step A3C are some examples of it. It simply starts from one state and takes 'n' actions to receive further states and rewards. Then these rewards are used to get the discounted rewards or returns and then the policy is updated using them. This method also requires the storage of n-tuples in the memory. But there is no multiple times learning using these tuples. This gives an explicit exploration to the algorithm. As the policy used to explore will explore various states, action pairs and will give good results.
Reinforcement Learning involves two types of value iterations. 1) Value Iteration V(s) and 2) Policy Iteration Q(s,a). Also V(s) is called the state value function and Q(s,a) is the action value function. Value iteration finds how good it is to be in that particular state while the action value function evaluates that how good will it be to take a given action from the given state. This is an important explanation for the further topic. Let us consider a game of 4x3 grid where the agent starts at a fix location (bottom left corner) with a fixed goal (top right corner). Here we already knew the probability to reach certain new state with given initial state and action taken. So, we already know the dynamics of the game, which means we are relying on the given model of game. So this is example of Model-Based RL. But let us suppose an autonomous car which is driven by RL network, then there won't be an exact dynamics. Just like atari games, can be played using the images of the screen. So, we cannot predict the new state even if know the action taken. Such environments are known as Model-Free Environment. Such environment is called a black box. We can use any of the two methods mentioned in first passage to find an optimal policy for model-based RL. As the action value function can be found out using environment dynamics/model and state value function. Both will work similarly. But in case of Model-Free RL, we cannot use state value function as it will indicate any significance. Action value function will be the most important to calculate in such situation. Because here we do not know which action will lead to which state and what will be the reward. So, it is needed to explore most of the actions and states. In such cases, Action value function will tell us about the value associated with each state and each action. State value function can found by using Action value function and model of environment. But in Model-Free environment, policy pi (a | s) is unknown and state value iteration will not be effective to find the policy. The reason behind this is we cannot obtain action-value function as we don't have any idea that which action will lead to which state. It means, we have complete knowledge about which state will be better in future, but we have no clue which action should be selected to reach that state.
On the opposite, if we choose action value function, we knew that which action will be the best to choose for a given state from the action space and hence we can find the optimal policy using it in Model-Free environment. There may be many concepts or definitions of reinforcement learning. Some say it is an area of Machine Learning inspired by behaviourist psychology. A branch of AI that deals with software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. According to KDnuggets, "Reinforcement Learning is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward." But I will give you a very simple example of Reinforcement learning which will make you understand its concept. When you learn to ride a motorcycle is the best example of RL. From the state to switch on the bike, to control the balance, to take turn and till the state to reach destination, every stage is a state space for the human agent and doing it properly gives a positive reward of being fit. Taking a wrong step in state will lead to an accident. Injury will be the negative reward.
After riding the bike for few months, we gather a lot of experience to drive in a harsh traffic or in crowded place. On this basis, our brain having numerous neurons chooses the optimal policy using the sensation of ears and vision. How brilliant our brain is!!! I was driving a bike and a car went across me. I slowed down my bike and waited the car to pass. As the car was half passed I raised the speed to go ahead. This needed the a proper guess of speed of the car, time it will take to cross and speed of my bike. All this comes with an experience. And you will master the driving after a lot of experience. Hence it is a very clear example of RL. We use lock to keep our valuables safe from thieves or mischief. Lock to door is for the valuables in house. Lock to bank locker is to keep gold or documents safe. Lock/password to an account is to keep is private. In the same manner, multiple processes requires lock to keep shared memory addresses private for particular operations. Lock will ensure that only one process will perform operation on it and others may not even access its value. Consider a bank software with two process. One is withdrawal and another is deposition. If both the processes run parallel with balance as shared memory address then, the value will of balance at the end of both processes will be wrong. To do this lock is applied to shared memory during each process. import multiprocessing ... lock = multiprocessing.Lock( ) ... lock.acquire( ) ... lock.release( ) ... Acquire is the function to lock the shared memory for one particular process for that moment. Release function will release the lock and this memory address is free to be accessed by another process.
Github Link As we already know, each process has its own memory allocation. So, if a variable is declared in one process, then we can't update it in another process. Even if we declare a variable as global variable, it will be defined only for a process. Suppose main program is process 1 and calculation of squares of given numbers is the process 2. An empty list named result is declared in both the processes. If we append the squares of numbers to result list in process 2 then, it will not update the result list declared in process 1 as both of these lists have different memory addresses. So, we will create a shared memory which can be used by both the processes. Below is the syntax to define shared Array, Value and Queue. import multiprocessing ... arr = multiprocessing.Array( data_type, size ) val = multiprocessing.Value( data_type, numerical_value ) q = multiprocessing.Queue( ) ... All these memories can be shared in another process by passing them as argument in the process. Here updating the value of these variables in any process will update their value in other process too.
To understand the use of this concept check links below: github link 1 github link 2 Multi-processing is used to do multiple tasks at a time in computer. It speeds up the process. We already have seen the difference between multiprocessing and threading. So, here we will study about how begin with multiprocessing. Python contains a multiprocessing module. This module has various functions available to create a process. Below is a piece of code to create a process. import multiprocessing ... p = multiprocessing.Process( target = function_name, args = (arg1,arg2,...)) p.start() p.join() ... Process is a function used to create a process. It has the target function which will be executed as the process begins and arguments for the target function as its arguments. Start() function will initiate the process and Join() function will wait till the process is finished.
To understand in a better way, visit the github link. Both are the ways of multi-tasking. Computer can do multiple tasks at a time. Just like running google chrome, VLC media player and python program at same time is multi-tasking. Each of these is known as a process. These processes can be seen in windows task manager (in Windows) and in terminal (in Linux). Each process has a process id known as PID. Each process has its own memory location. Each thread acts within a process. There can be multiple threads within a single process. The memory address is shared by each thread. If a variable is globally declared in a process, then each thread can access this variable. The above image makes the concept more clear. On the other hand, processes has their own address space. They need inter-process communication techniques to communicate with each other. Globally declared variable for one process can't be accessed by other process. The above image will make it more clear. File, shared memory & message pipes are some examples of inter-process communication techniques. Advantages of Process over Threading:
Error or memory leak in one process won't hurt execution of another process. But if there is memory leak in one thread then it can damage other threads. Hence every software creates its own process. So, if there is any failure in one software, it will not harm another soft wares in computer. Collaboration indicates work of a team to achieve certain with an individual task for everyone. Collaboration doesn’t require sharing of information amongst each other. Here the success doesn’t depend on everyone.
Cooperative policy indicates the team work with everyone working towards the goal and the results depend on the struggle of complete team. Let’s understand it with an example! Suppose, there is a house which has to emptied. Now, there are three persons in house. If each person takes each item out of house without having any conversation in between them then it is a Collaborative Policy. No one knows what others thinking or what strategy they have in mind. But task will be accomplished. On the other hand, if there is a sofa in house. Then these three persons will have to lift the sofa together and will have to take it out. So, this will require a lot coordination among them. They must communicate amongst themselves about their policy to achieve the goal. This is called a Cooperative Policy. |
AuthorVinit is a computer scientist and roboticist. His research focuses on making machines intelligent. Categories |