1. 程式人生 > >Deep Q-learning and Policy Gradients ( towards AGI ).

Deep Q-learning and Policy Gradients ( towards AGI ).

Ch:13: Deep Reinforcement learning — Deep Q-learning and Policy Gradients ( towards AGI ).

One of the most exciting developments in AI is #DeepRL. Today we are gonna talk about that #getready

In this story I only talk about two different algorithms in deep reinforcement learning which are Deep Q learning

and Policy Gradients.

Spoiler alert!!!

Before I get started , I assume you have checked my other stories from the previous chapter “Reinforcement learning Part 1 and part 2. if not please check those out otherwise it’s difficult to catch up with me.

okay.

Why Deep reinforcement learning???

you know I am a simon sinek fan. I always start with “why” then “how” and “what”.

in the last story we talked about “Q-learning”

Q-learning estimates the state-action value function(Q_SA) for a target policy that deterministically selects the action of highest value.

Here we have this table Q of size of SxA.

to calculate the current state-action value , it takes the next best action from the table Q (max_a` Q(S`,A`))

let’s say we have a problem which has a total of 4 different actions and 10 different states, then we got the Q of 10x4 size.

it works perfectly fine if we have a limted state space/action space

but what if the state space is much bigger?? let’s say we take an Atari game as an example where each frame in the game is treated as a singe state then we have millions of states(depends on the type of game).

Or we can think of a robot walking/moving ( here action space could be more and unknown)

so we need to have millions of records stored in a table in the program memory. that’s something we don’t do so we need a better solution.

so we use a function approximation ex: neural network to calculate the Q values.

Which is why we use supervised learning to build better things with reinforcement learning.

so how do we do this??

Q-learning function for estimating Q values from the last story

Q(S,A)← Q(S,A)+ α(R + γ max A` Q(S`, A`)- Q(S, A))

And this is how neural network Q function works

left:

The network takes a state(‘s) as the input and produces the Q-values for every action in the action space ( no of actions depends on the game/environment).

The neural network job is to learn the parameters so assume that the training is done and we got the final network ready so..

at the time of prediction, we use this trained network to predict the next best action to take in the environment so we give a input state and the network gives the Q values for all actions then we take the max Q-value to take the corresponding action in the environment.

best_action = arg max(NN predicted Q-values ).

Which means → for this state , this is the best action to take in the env.

Simple right!

Now lets talk about the training part

This is a regression problem so we use any regression loss function to minimize the total error of the training data.

neural network loss function for predicting the Q values

l2_loss = (predicted — actual ) **2

assume the “predicted” gives the list of action values from the neural network where we take the maximum value(maximum reward).

Since we don’t know the “actual” as it’s a reinforcement learning ( not supervised so no labels) so we have to estimate the “actual” value

actual= R + γ max A` Q(S`, A`)

R → the current immediate reward

S` → Next state

max A` Q(S`, A`) → max( NN output list of Q-values)

γ → the discount factor γ → {0,1}

The reason for this is

if you take that particular action you will know the reward R and predict the next max reward with some discount.

assume the next max reward predicted by the model is completely wrong because the model just started training but we still have the R reward that is 100% right so slowly we adjust network parameters to predict the reasonable predictions.

In a sense it predicts it’s own actual value but since R is not biased ( when we step into the environment we observe the reward which is fully right ), the network is gonna update it’s gradients using backprop to converge.

so we got the inputs , labels , rewards and loss function, we can build a neural network which predicts the Q values for every action value.

which means → for this state , these are the actual values for all actions.

Here is the final image

at the time of predicting it seems very fine but at the time of training you may have some doubts how this is gonna work??

I expect you to have a doubt here , if you don’t have the doubt , here is my question to you

“Since we don’t have labels, we said actual= R + γ max A` Q(S`, A`)

but how do we get the values for R and max A` Q(S`, A`) from the environment for a particular state???? ”

Hint: Usually we get the reward R and Next state S` when we perform an action in the env only. #thinkaboutit.

There are few tricks that I have to explain to make you understand it well,for the time being just assume that we know the action and reward R (we ll see more about it down below).

Anyway, that’s how it works , we got labels ,rewards ,losses and parameters so we can do the network training and that acts as a function approximation to estimate the Q values.

so now what is Deep reinforcement learning???

Deep reinforcement learning = Deep learning+ Reinforcement learning

“Deep learning with no labels and reinforcement learning with no tables”.

I hope you get the idea of Deep RL.

now let’s take a problem to understand it’s implementation better.

in 2013 Deepmind developed the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

they apply their method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. they find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them. they published a paper.

I am not gonna explain the paper and code deeply because there are a lot of awesome articles out there talking about it, here are some

I would mainly focus on the concepts, math and some ideas.

In simple terms

  • Take the image, turn it to a gray scale image and crop the necessary image part.
  • Apply some convolution filters and full connected layers with the output layers.
  • Give that image ( we call “state”) , calculate the qvalues , find the error and backpropagate.
  • Repeat this process as long as you want.

There are some problems in this process which led us to the Tricks I mentioned before.

Deep Q learning Tricks

  1. Skipping frames: in just one frame we can’t observe the motion of the picture ex: we cant tell in a picture whether the ball is moving up or down by just seeing one frame so we use this trick called Skipping frames which skips every 4 frames to take an action.

earlier in this story I asked a question, how do we get the values for R and max A` Q(S`, A`) from the environment for a particular state???

the answer is we can take random actions to collect the data and learn from it. but the environment is a continuous state space so there is so much correlation between one frame and the subsequent frame so the network is gonna forget what it has seen long ago which makes the network very biased and bad. so…

2. Experience replay : we store every experience(current state, current action, reward, next state ) in the memory then we take a sample of batches from the memory.

in this way, we can make training better and learning is reasonable.

3. Fixed Target network: since we don’t have labels , we make this equation to get the labels for the training.

actual= R + γ max A` Q(S`, A`)

and max A` Q(S`, A`) → is the NN output list of Q-values.

here we maintain another network we call the target network, and we assign the actual network weights to the target network for every N no of frames /iterations.

so now max A` Q(S`, A`) → is the target network output list of Q-values.

The reason for this is we update the actual network gradients for every frame/action so the network keeps on changing so it’s not feasible to use the same network for calculating the actual reward.

4. Reward Clipping : every game / environment has it’s own reward system, some games it might have points like 100,200,300 and so on some games might have 1,2,3 and so on. so t o normalize the reward and penalty uniformly across all settings of environment, reward clipping is used.

so here is the final algorithm

The training network Q and The target network Q` will get updated as we run more and more episodes.

I trained this model on 16 cores CPU with 16 GB ram (i5 Processor) it took me 2 days to give the below results.

here is the small snippet of code ( written in #python #keras #tensorflow #gym #ubuntu #sublime)

You can find the full code here,

Note: The code has been slightly modified after this results I made little tweaks so feel free to play with it and final model files are also available there to let the computer play the game.

with that, Deep Q learning is done!

Since DQN was introduced , a lot of improvements have been made to it like Prioritized Experience Replay, double DQNs, dueling DQN etc.. which we will discuss in the next stories.