深度強化學習cs294 HW1: Imitation Learning

阿新 • • 發佈：2018-12-06

終於把第一次作業完成了，不過實現效果貌似很差，調不好了就這樣吧。

Section 1

第一部分先裝環境。沒啥好說的。我用的anaconda直接pip install 對應的作業1資料夾裡的requirement.txt。其中MuJoCo需要啟用個key，可以去官網使用學生郵箱申請一個免費的，時間為一年。

這次要用的6個環境如下，盜圖一張：
在這裡插入圖片描述

Section2

進入正題。這次的作業是完成模仿學習。在gym模擬環境裡，依賴於MoJoCo的模擬器。提供了6個環境的專家策略，執行run_expert.py生成對應的狀態動作的資料，然後根據這些資料來進行模仿學習。

第一步自然是看懂run_expert.py的程式碼然後執行啦。將產生的資料放進expert_data資料夾。利用的是預設的引數，20個rollouts，每個最多1000步。因此有20000個對應的狀態動作對。當然有的任務裡走不到1000步一次rollout。

第二步進行模擬學習：
模擬學習的思想很簡單，本質上就是個監督學習。根據給定的狀態-動作對資料，擬合那個策略函式，其實就是個迴歸問題。
在這裡插入圖片描述
第一步裡面已經根據專家策略生成了對應的資料，只需要利用這些資料進行一個迴歸即可。啪啪啪寫一通，大概用2到3層全連線層即可。狀態作為x，動作作為對應的y。

然後開始訓練。超參什麼的也沒什麼要求，但是我怎麼調參訓練都得不到太好的效果，很難受。
在這裡插入圖片描述
這是最開始訓練得到的效果，後來調了大半天比這個好很多，但是本質上沒啥差別，可能多訓練了幾輪稍微好點。總之比專家策略的效果差到不知道哪裡去了。
不同迭代次數下對應6個環境的模仿學習效果

Section3

看來直接使用模仿學習不太行，不過還有個更好的演算法Dagger。Dagger的演算法重點在於解決訓練集合與應用學到的策略時遇到的狀態集合不一致的情況，演算法如下：
在這裡插入圖片描述

其實就是個不斷融合資料集的過程，次數足夠多之後訓練集與會遇到的狀態趨於同一個分佈。（將測試集加到訓練集裡的感覺23333）

程式碼的部分沒啥好說的，就是先按BC的演算法訓練得到一個模型，然後應用到環境中。把應用的時候的狀態存下來，使用專家策略policy_fn來對這些狀態進行標註（相當於演算法圖中的3），之後融合資料繼續訓練。

然而我實在沒有調出什麼好的效果，只能宣告gg。

在這裡插入圖片描述
這張圖算是6個環境裡我的Dagger演算法比BC效果好的最明顯的一個，然而比專家演算法的效果差了實在太多。其中超參設定為batch_size=200，Iteration = 200，epoch = 20。這裡的關係不太嚴謹，對於用來對比的BC演算法來說因為只有20000條資料，所以一個epoch用了batch_size*iteration/20000=2輪資料。對於Dagger而言引數設定也是一樣，但是每一個epoch之後Dagger演算法又會增加20000條新的資料。比如epoch=3時，BC演算法迭代了3*200*200條資料，但是總數只有20000條，每條資料都用了6次。DAgger也迭代了3*200*200條資料，但Dagger的資料有60000條，因此每條只用了2次。

我的程式碼實現在github cs294作業，雖然實現得效果不好，但還是求star。

這裡順便推一下知乎看到的一個實現，他實現的效果很好。強化學習傳說。我把網路結構和超參改得和他一樣效果還是半死不活，百思不得其解。大概是因為版本的原因吧，他做的是春季的，mojoco版本不太一樣。whatever，這次的作業就是這樣。

深度強化學習cs294 HW1: Imitation Learning

終於把第一次作業完成了，不過實現效果貌似很差，調不好了就這樣吧。 Section 1 第一部分先裝環境。沒啥好說的。我用的anaconda直接pip install 對應的作業1資料夾裡的requirement.txt。其中MuJoCo需要啟用個key，可以去官網使用學生郵箱申請一個

深度強化學習cs294 Lecture2: Supervised Learning of behaviors

cs294 Lecture2: Supervised Learning of behaviors Definition of sequential decision problems Terminology & notation

深度強化學習cs294 Lecture3&Lecture4: Introduction to Reinforcement Learning

深度強化學習cs294 Lecture3&Lecture4: Introduction toReinforcement Learning 1. Definition of a Markov decision process 2. Definit

深度強化學習cs294 Lecture8: Deep RL with Q-Function

深度強化學習cs294 Lecture8: Deep RL with Q-Function 1. How we can make Q-learning work with deep networks 2. A generalized view of Q

深度強化學習cs294 Lecture7: Value Function Methods

深度強化學習cs294 Lecture7: Value Function Methods Value-based Methods Q-Learning Value Function Learning Theory 回憶

深度強化學習cs294 Lecture6: Actor-Critic Algorithms

深度強化學習cs294 Lecture6: Actor-Critic Algorithms 1. Improving the policy gradient with a critic 2. The policy evaluation problem

深度強化學習cs294 Lecture5: Policy Gradients Introduction

深度強化學習cs294 Lecture5: Policy Gradients Introduction 1. The policy gradient algorithm 2. What does the policy gradient do?

深度強化學習cs294 Lecture1: Introduction and Course Overview

cs294 Lecture1: Introduction and Course Overview 強化學習介紹為什麼現在學習深度強化學習序列決策需要解決的其他問題 reward從哪裡來其他型別的監督學習

深度強化學習（Deep Reinforcement Learning）的資源

Google的Deep Mind團隊2013年在NIPS上發表了一篇牛x閃閃的文章，亮瞎了好多人眼睛，不幸的是我也在其中。前一段時間收集了好多關於這方面的資料，一直躺在收藏夾中，目前正在做一些相關的工作（希望有小夥伴一起交流）。一、相關文章關於DRL，這方面的工作基本

CS294-112 深度強化學習秋季學期（伯克利）NO.6 Value functions introduction NO.7 Advanced Q learning

ted 分享圖片 enc cti solution function part related ons -------------------------------------------------------------------------------

CS294-112 深度強化學習秋季學期（伯克利）NO.9 Learning policies by imitating optimal controllers

image TP 分享圖片 BE http com bubuko cos .com

CS294-112 深度強化學習秋季學期（伯克利）NO.19 Guest lecture: Igor Mordatch (Optimization and Reinforcement Learning in Multi-Agent Settings)

nbsp setting TP for agent image learn ctu Go

深度強化學習cs294 HW1: Imitation Learning

Section 1

Section2

Section3

深度強化學習cs294 HW1: Imitation Learning

深度強化學習cs294 Lecture2: Supervised Learning of behaviors

深度強化學習cs294 Lecture3&Lecture4: Introduction to Reinforcement Learning

深度強化學習cs294 Lecture8: Deep RL with Q-Function

深度強化學習cs294 Lecture7: Value Function Methods

深度強化學習cs294 Lecture6: Actor-Critic Algorithms

深度強化學習cs294 Lecture5: Policy Gradients Introduction

深度強化學習cs294 Lecture1: Introduction and Course Overview

深度強化學習（Deep Reinforcement Learning）的資源

CS294-112 深度強化學習秋季學期（伯克利）NO.6 Value functions introduction NO.7 Advanced Q learning

CS294-112 深度強化學習秋季學期（伯克利）NO.9 Learning policies by imitating optimal controllers

CS294-112 深度強化學習秋季學期（伯克利）NO.19 Guest lecture: Igor Mordatch (Optimization and Reinforcement Learning in Multi-Agent Settings)

CS294-112 深度強化學習秋季學期（伯克利）NO.4 Policy gradients introduction

CS294-112 深度強化學習秋季學期（伯克利）NO.5 Actor-critic introduction

深度強化學習 Deep Reinforcement Learning 學習整理

【李巨集毅深度強化學習2018】P3 Q-learning（Basic Idea）

Deep Reinforcement Learning深度強化學習_論文大集合

深度強化學習：入門(Deep Reinforcement Learning: Scratching the surface)

深度強化學習入門-05DQN實現高速超車（復現 deeptraffic:MIT 6.S094: Deep Learning for Self-Driving Cars）

深度強化學習（一）： Deep Q Network(DQN)

深度強化學習cs294 HW1: Imitation Learning

Section 1

Section2

Section3

相關推薦