深度強化學習演算法 A3C （Actor-Critic Algorithm）

阿新 • • 發佈：2018-11-14

對於 A3C 演算法感覺自己總是一知半解，現將其梳理一下，記錄在此，也給想學習的小夥伴一個參考。

　　想要認識清楚這個演算法，需要對 DRL 的演算法有比較深刻的瞭解，推薦大家先了解下 Deep Q-learning 和 Policy Gradient 演算法。

　　我們知道，DRL 演算法大致可以分為如下這幾個類別：Value Based and Policy Based，其經典演算法分別為：Q-learning 和Policy Gradient Method。

　　而本文所涉及的 A3C 演算法則是結合 Policy 和 Value Function 的產物，其中，基於 Policy

的方法，其優缺點總結如下：

　　Advantages:
　　　　1. Better convergence properties （更好的收斂屬性）
　　　　2. Effective in high-dimensional or continuous action spaces（在高維度和連續動作空間更加有效）
　　　　3. Can learn stochastic policies（可以Stochastic 的策略）
　　Disadvantages:
　　　　1. Typically converge to a local rather than global optimum（通常得到的都是區域性最優解）
　　　　2. Evaluating a policy is typically inefficient and high variance （評價策略通常不是非常高效，並且有很高的偏差）

　　我們首先簡要介紹一些背景知識（Background）：

　　在 RL 的基本設定當中，有 agent，environment, action, state, reward 等基本元素。agent 會與 environment 進行互動，而產生軌跡，通過執行動作 action，使得 environment 發生狀態的變化，s -> s' ；然後 environment 會給 agent 當前動作選擇以 reward（positive or negative）。通過不斷的進行這種互動，使得積累越來越多的 experience，然後更新 policy，構成這個封閉的迴圈。為了簡單起見，我們僅僅考慮 deterministic environment

，即：在狀態 s 下，選擇 action a 總是會得到相同的狀態 s‘。

　　為了清楚起見，我們先定義一些符號：

　　1. stochastic policy π(s)π(s) 決定了 agent's action, 這意味著，其輸出並非 single action，而是 distribution of probability over actions (動作的概率分佈)，sum 起來為 1.

　　2. π(a|s)π(a|s) 表示在狀態 s 下，選擇 action a 的概率；

　　而我們所要學習的策略 ππ，就是關於 state s 的函式，返回所有 actions 的概率。

　　我們知道，agent 的目標是最大化所能得到的獎勵（reward），我們用 reward 的期望來表達這個。在概率分佈 P 當中，value X 的期望是：

　　其中 Xi 是 X 的所有可能的取值，Pi 是對應每一個 value 出現的概率。期望就可以看作是 value Xi 與權重 Pi 的加權平均。

　　這裡有一個很重要的事情是： if we had a pool of values X, ratio of which was given by P, and we randomly picked a number of these, we would expect the mean of them to be $E_{P}[X]$ . And the mean would get closer to $E_{P}[X]$ as the number of samples rise.

　　我們再來定義 policy ππ 的 value function V(s)，將其看作是 期望的折扣回報 (expected discounted return)，可以看作是下面的迭代的定義：

　　這個函式的意思是說：當前狀態 s 所能獲得的 return，是下一個狀態 s‘ 所能獲得 return 和在狀態轉移過程中所得到 reward r 的加和。

　　此外，還有 action value function Q(s, a)，這個和 value function 是息息相關的，即：

　　此時，我們可以定義一個新的 function A(s, a) ，這個函式稱為 優勢函式（advantage function）:

　　其表達了在狀態 s 下，選擇動作 a 有多好。如果 action a 比 average 要好，那麼，advantage function 就是 positive 的，否則，就是 negative 的。

　　Policy Gradient：

　　當我們構建 DQN agent 的時候，我們利用 NN 來估計的是 Q(s, a) 函式。這裡，我們採用不同的方法來做，既然 policy ππ 是 state ss 的函式，那麼，我們可以直接根據 state 的輸入來估計策略的選擇嘛。

　　這裡，我們 NN 的輸入是 state s，輸出是 an action probability distribution πθπθ，其示意圖為：

　　實際的執行過程中，我們可以按照這個 distribution 來選擇動作，或者 直接選擇概率最大的那個 action。

　　但是，為了得到更好的 policy，我們必須進行更新。那麼，如何來優化這個問題呢？我們需要某些度量（metric）來衡量 policy 的好壞。

　　我們定一個函式 J(π)J(π)，表示一個策略所能得到的折扣的獎賞，從初始狀態 s0 出發得到的所有的平均：

　　我們發現這個函式的確很好的表達了，一個 policy 有多好。但是問題是很難估計，好訊息是：we don't have to。

　　我們需要關注的僅僅是如何改善其質量就行了。如果我們知道這個 function 的 gradient，就變的很 trivial （專門查了詞典，這個是：瑣碎的，微不足道的，的意思，恩，不用謝）。

　　有一個很簡便的方法來計算這個函式的梯度：

　　這裡其實從目標函式到這個梯度的變換，有點突然，我們先跳過這個過程，就假設已經是這樣子了。後面，我再給出比較詳細的推導過程。

　　這裡可以參考 Policy Gradient 的原始paper：Policy Gradient Methods for Reinforcement Learning with Function Approximation

　　或者是 David Silver 的 YouTube 課程：https://www.youtube.com/watch?v=KHZVXao4qXs

　　簡單而言，這個期望內部的兩項：

　　第一項，是優勢函式，即：選擇該 action 的優勢，當低於 average value 的時候，該項為 negative，當比 average 要好的時候，該項為 positive；是一個標量（scalar）；

　　第二項，告訴我們了使得 log 函式增加的方向；

　　將這兩項乘起來，我們發現：likelihood of actions that are better than average is increased, and likelihood of actions worse than average is decreased.

　　Fortunately, running an episode with a policy π yields samples distributed exactly as we need. States encountered and actions taken are indeed an unbiased sample from the $\rho^\pi$ and π(s) distributions. That’s great news. We can simply let our agent run in the environment and record the (s, a, r, s’) samples. When we gather enough of them, we use the formula above to find a good approximation of the gradient $\nabla_\theta\;J(\pi)$ . We can then use any of the existing techniques based on gradient descend to improve our policy.

　　Actor-Critic：

　　我們首先要計算的是優勢函式 A(s, a)，將其展開：

　　執行一次得到的 sample 可以給我們提供一個 Q(s, a) 函式的 unbiased estimation。我們知道，這個時候，我們僅僅需要知道 V(s) 就可以計算 A(s, a）。

　　這個 value function 是容易用 NN 來計算的，就像在 DQN 中估計 action-value function 一樣。相比較而言，這個更簡單，因為每個 state 僅僅有一個 value。

　　我們可以將 value function 和 action-value function 聯合的進行預測。最終的網路框架如下：

　　這裡，我們有兩個東西需要優化，即： actor 以及 critic。

　　actor：優化這個 policy，使得其表現的越來越好；

　　critic：嘗試估計 value function，使其更加準確；

　　這些東西來自於 the Policy Gradient Theorem :

　　簡單來講，就是：actor 執行動作，然後 critic 進行評價，說這個動作的選擇是好是壞。

　　Parallel agents：

　　如果只用單個 agent 進行樣本的採集，那麼我們得到的樣本就非常有可能是高度相關的，這會使得 machine learning 的model 出問題。因為 machine learning 學習的條件是：sample 滿足獨立同分布的性質。但是不能是這樣子高度相關的。在 DQN 中，我們引入了 experience replay 來克服這個難題。但是，這樣子就是 offline 的了，因為你是先 sampling，然後將其儲存起來，然後再 update 你的引數。

　　那麼，問題來了，能否 online 的進行學習呢？並且在這個過程中，仍然打破這種高度相關性呢？

　　We can run several agents in parallel, each with its own copy of the environment, and use their samples as they arrive.

　　1. Different agents will likely experience different states and transitions, thus avoiding the correlation2.

　　2. Another benefit is that this approach needs much less memory, because we don’t need to store the samples.

　　此外，還有一個概念也是非常重要的：N-step return 。

　　通常我們計算的 Q(s, a), V(s) or A(s, a) 函式的時候，我們只是計算了 1-step 的 return。

　　在這種情況下，我們利用的是從 sample （s0, a0, r0, s1）獲得的 即刻獎勵（immediate return），然後該函式下一步預測 value 給我們提供了一個估計 approximation。但是，我們可以利用更多的步驟來提供另外一個估計：

　　或者 n-step return：

　　The n-step return has an advantage that changes in the approximated function get propagated much more quickly.Let’s say that the agent experienced a transition with unexpected reward. In 1-step return scenario, the value function would only change slowly one step backwards with each iteration. In n-step return however, the change is propagated n steps backwards each iteration, thus much quicker.

　　N-step return has its drawbacks. It’s higher variance because the value depends on a chain of actions which can lead into many different states. This might endanger the convergence.

　　這個就是 非同步優勢actor-critic 演算法（Asynchronous advantage actor-critic , 即：A3C）。

　　以上是 A3C 的演算法部分，下面從 coding 的角度來看待這個演算法：

　　基於 python+Keras+gym 的code 實現，可以參考這個 GitHub 連結：https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py

　　所涉及到的大致流程，可以歸納為：

　　在這其中，最重要的是 loss function 的定義：

　　其中， $L_{\pi}$ is the loss of the policy, $L_v$ is the value error and $L_{reg}$ is a regularization term. These parts are multiplied by constants $c_v$ and $c_{reg}$ , which determine what part we stress more.

　　下面分別對這三個部分進行介紹：

　　1. Policy Loss：

　　我們定義 objective function J(π)J(π) 如下：

　　這個是：通過策略 ππ 平均所有起始狀態所得到的總的 reward（total reward an agent can achieve under policy ππ averaged over all starting states）。

　　根據 Policy Gradient Theorem 我們可以得到該函式的 gradient：

　　我們嘗試最大化這個函式，那麼，對應的 loss 就是這個負函式：

　　我們將 A(s,a) 看做是一個 constant，然後重新將上述函式改寫為如下的形式：

　　我們就對於minibatch 中所有樣本進行平均，來掃一遍這個期望值。最終的 loss 可以記為：

　　2. Value Loss:

　　the truth value function V(s) 應該是滿足 Bellman Equation 的：

　　而我們估計的 V(s) 應該是收斂的，那麼，根據上述式子，我們可以計算該 error：

　　這裡大家可能比較模糊，剛開始我也是比較暈，這裡的 groundtruth 是怎麼得到的？？？

　　其實這裡是根據 sampling 到的樣本，然後計算兩個 V(s) 之間的誤差，看這兩個 value function 之間的差距。

　　所以，我們定義 Lv 為 mean squared error （given all samples）:

　　3. Regularizaiton with Policy Entropy :

　　為何要加這一項呢？我們想要在 agent 與 environment 進行互動的過程中，平衡探索和利用，我們想去以一定的機率來嘗試其他的 action，從而不至於取樣得到的樣本太過於集中。所以，引入這個 entropy，來使得輸出的分佈，能夠更加的平衡。舉個例子：

　　fully deterministic policy [1, 0, 0, 0] 的 entropy 是 0 ；而 totally uniform policy[0.25, 0.25, 0.25, 0.25]的 entropy 對於四個value的分佈，值是最大的。

　　我們為了使得輸出的分佈更加均衡，所以要最大化這個 entropy，那麼就是 minimize 這個負的 entropy。

　　總而言之，我們可以藉助於現有的 deep learning 的框架來 minimize 這個這些 total loss，以達到優化網路引數的目的。

　　Reference：

　　1. https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py

　　2. https://jaromiru.com/2017/03/26/lets-make-an-a3c-implementation/

　　3. https://www.youtube.com/watch?v=KHZVXao4qXs

　　4. https://github.com/ikostrikov/pytorch-a3c

　　======================================================

　　 Policy Gradient Method 目標函式梯度的計算過程：

　　======================================================

　　reference paper：policy-gradient-methods-for-reinforcement-learning-with-function-approximation （NIPS 2000, MIT press）

　　過去有很多演算法都是基於 value-function 進行的，雖然取得了很大的進展，但是這種方法有如下兩個侷限性：　　
　　首先，這類方法的目標在於找到 deterministic policy，但是最優的策略通常都是 stochastic 的，以特定的概率選擇不同的 action；

　　其次，一個任意的小改變，都可能會導致一個 action 是否會被選擇。這個不連續的改變，已經被普遍認為是建立收斂精度的關鍵瓶頸。

　　而策略梯度的方法，則是從另外一個角度來看待這個問題。我們知道，我們的目標就是想學習一個，從 state 到 action 的一個策略而已，那麼，我們有必要非得先學一個 value function 嗎？我們可以直接輸入一個 state，然後經過 NN，輸出action 的distribution 就行了嘛，然後，將 NN 中的引數，看做是可調節的 policy 的引數。我們假設 policy 的實際執行的表現為 ρρ，即：the averaged reward per step。我們可以直接對這個 ρρ 求偏導，然後進行引數更新，就可以進行學習了嘛：