Reinforcement Learning強化學習系列之五：值近似方法Value Approximation

阿新 • • 發佈：2019-01-17

引言

前面說到了強化學習中的蒙特卡洛方法(MC)以及時序差分(TD)的方法，這些方法針對的基本是離散的資料，而一些連續的狀態則很難表示，對於這種情況，通常在強化學習裡有2中方法，一種是針對value function的方法，也就是本文中提到的值近似(value approximation)；另一種則是後面要講到的policy gradient。

值近似的方法

這裡寫圖片描述
值近似的方法根本上是使用一個值函式來近似表示該狀態的返回值，對於狀態 $S$ ，在一個序列中間，我們使用一個引數函式 $\hat{v} (S, w)$ 來近似表示觀測到的真實值 $v_{π} (S)$ ，學習使用普通的梯度下降的方式進行，對於一個觀察序列的每一個step均可以作為一個訓練的過程。當然這個值函式可以加上動作 $a$

a

表示成為

Q

函式的近似

\hat{v} (S, a, w)

示例

這裡寫圖片描述
問題描述：一個汽車從谷底向上開，但是汽車的馬力不足以支撐其到終點，因此最好的策略是需要先開到谷底的左邊然後再加速，利用一部分慣性到達終點。

這裡面的狀態可以描述為： $(横向位置 x_{t}), (速度 \hat{x_{t}})$
動作空間為3個， $- 1, 0, 1$ ，分別表示全力向左，不動和全力向右
狀態序列更新的方式為：
$x_{t + 1} = b o u n d [x_{t} + \hat{x_{t + 1}}]$
$\hat{x_{t + 1}} = b o u n d [\hat{x_{t}} + 0.001 A - 0.0025 c o s (3 x_{t})]$

這裡bound表示其約束範圍，橫軸座標 $x_{t}$ 的範圍是 $- 1.5 \leq x_{t} \leq 0.5$ ，速度的範圍是 $- 0.07 \leq \hat{x_{t}} \leq 0.07$ ，當 $x_{t}$ 行到最座標的時候，將會被置零。

在本示例中，將使用Q-learning的值近似方法，採用的線性函式來表示Q函式。

實驗環境

實驗將基於openAI所提供的gym包的mountaincar-v0這一個環境，openAI提供了很多的遊戲環境，都可以進行相關的強化學習實驗。
openAI目前支援mac OS 和Linux環境，可以直接使用pip install gym的方式安裝其最新的版本的gym，但是對於python2.7來說，安裝最新的版本0.9.6，可能會出現cannot import name spaces

的問題，選擇安裝0.9.5則沒有這個問題

關鍵程式碼

class Estimator(object):
    def __init__(self):
        self.models=[]
        for _ in range(env.action_space.n):
            model = SGDRegressor(learning_rate="constant")
            model.partial_fit([self.feature_state(env.reset())],[0])
            self.models.append(model)

    def predict(self,s,a=None):
        s=self.feature_state(s)
        if a:
            return self.models[a].predict([s])[0]
        else:
            return [self.models[m].predict([s])[0] for m in range(env.action_space.n)]

    def update(self,s,a,target):
        s=self.feature_state(s)
        self.models[a].partial_fit([s],[target])

    def feature_state(self,s):

        return featurizer.transform(scaler.transform([s]))[0]

def make_epsilon_greedy_policy(estimator,nA,epsilon):

    def epsilon_greedy_policy(observation):

        best_action = np.argmax(estimator.predict(observation))
        A =np.ones(nA,dtype=np.float32)*epsilon/nA
        A[best_action] += 1-epsilon
        return A

    return epsilon_greedy_policy


def Q_learning_with_value_approximation(env,estimator,epoch_num
                                        ,discount_factor=1.0, epsilon=0.1, epsilon_decay=1.0):

    # stats = plotting.EpisodeStats(
    #     episode_lengths=np.zeros(epoch_num),
    #     episode_rewards=np.zeros(epoch_num))
    for i_epoch_num in range(epoch_num):

        policy = make_epsilon_greedy_policy\
            (estimator,env.action_space.n,epsilon*epsilon_decay**i_epoch_num)
        state = env.reset()

        for it in itertools.count():

            action_probs = policy(state)
            action = np.random.choice(np.arange(len(action_probs)), p=action_probs)

            next_state,reward,done,_=env.step(action)
            q_values_next = estimator.predict(next_state)
            td_target = reward + discount_factor * np.max(q_values_next)
            estimator.update(state, action, td_target)

            # stats.episode_rewards[i_epoch_num] += reward
            # stats.episode_lengths[i_epoch_num] = it
            print("\rStep {} @ Episode {}/{}".format(it, i_epoch_num + 1, epoch_num))

            if done:
                print it
                break
            state = next_state

其中，將兩個狀態引數使用RBF核函式進行轉換為一維長度為400的特徵向量，使用的普通的SGDRegressor。

結果

執行100代後的函式cost值為
這裡寫圖片描述

Reinforcement Learning強化學習系列之五：值近似方法Value Approximation

引言

值近似的方法

示例

實驗環境

關鍵程式碼

結果

程式碼連結

Reinforcement Learning強化學習系列之五：值近似方法Value Approximation

Reinforcement Learning強化學習系列之二：MC prediction

Reinforcement Learning強化學習系列之一：model-based learning

Netty4.0學習筆記系列之五：自定義通訊協議

強化學習系列（五）：蒙特卡羅方法（Monte Carlo)

深入理解Tomcat系列之五：Context容器和Wrapper容器

數據庫面試系列之五：mysql的存儲引擎

Office 365 系列之五：創建新用戶

mongo 3.4分片集群系列之五：詳解平衡器

大數據學習系列之五 ----- Hive整合HBase圖文詳解

Spark2.0機器學習系列之7： MLPC（多層神經網絡）

多線程系列之五：Balking 模式

[Reinforcement Learning] 強化學習介紹

Memcached學習筆記之五：同一臺Windows機器中啟動多個Memcached服務

Kubernetes系列之五：使用yaml檔案建立service向外暴露服務

Spring Boot 系列之五：Spring Boot 通過devtools進行熱部署

碼農裝13寶典系列之五：Ubuntu自定義字型縮放級別

Jenkins學習使用之五：Opening Robot Framework report failed問題

輕鬆上雲系列之五：阿里雲容災與備份方案

Docker系列之五：Volume 卷的使用——以Redis為例

Reinforcement Learning強化學習系列之五：值近似方法Value Approximation

引言

值近似的方法

示例

實驗環境

關鍵程式碼

結果

程式碼連結

相關推薦