【04】蒙特卡洛樹入門學習筆記

阿新 • • 發佈：2018-04-17

alter ike cnblogs append 最大的有趣的控制理論空間 nod

蒙特卡洛樹學習筆記

1. 強化學習（RL）

概念

? 強化學習是機器學習中的一個領域，強調如何基於環境而行動，以取得最大化的預期利益。其靈感來源於心理學中的行為主義理論，即有機體如何在環境給予的獎勵或懲罰的刺激下，逐步形成對刺激的預期，產生能獲得最大利益的習慣性行為。這個方法具有普適性，因此在其他許多領域都有研究，例如博弈論、控制論、運籌學、信息論、仿真優化、多主體系統學習、群體智能、統計學以及遺傳算法。在運籌學和控制理論研究的語境下，強化學習被稱作“近似動態規劃”（approximate dynamic programming，ADP）。在最優控制理論中也有研究這個問題，雖然大部分的研究是關於最優解的存在和特性，並非是學習或者近似方面。在經濟學和博弈論中，強化學習被用來解釋在有限理性的條件下如何出現平衡。

? 在機器學習問題中，環境通常被規範為馬爾可夫決策過程（MDP），所以許多強化學習算法在這種情況下使用動態規劃技巧。傳統的技術和強化學習算法的主要區別是，後者不需要關於MDP的知識，而且針對無法找到確切方法的大規模MDP。

? 強化學習和標準的監督式學習之間的區別在於，它並不需要出現正確的輸入/輸出對，也不需要精確校正次優化的行為。強化學習更加專註於在線規劃，需要在探索（在未知的領域）和遵從（現有知識）之間找到平衡。強化學習中的“探索-遵從”的交換，在多臂Lao Hu機（英語：multi-armed bandit）問題和有限MDP中研究得最多。

引用自：https://zh.wikipedia.org/zh-hans/強化學習

強化學習的基本組件

環境/狀態（標準的為靜態stationary，對應的non-stationary）
agent（與環境交互的對象）
動作（action space，環境下可行的動作集合，離散/連續）
反饋（回報，reward，正是有了反饋，RL才能叠代，才會學習到策略鏈）

2. 馬爾可夫決策過程（MDP）

馬爾可夫過程

? 在概率論及統計學中，馬爾可夫過程（Markov process）又叫馬爾可夫鏈(Markov Chain)，是一個具備了馬爾可夫性質的隨機過程，因為俄國數學家安德雷·馬爾可夫得名。馬爾可夫過程是不具備記憶特質的（memorylessness）。換言之，馬爾可夫過程的條件概率僅僅與系統的當前狀態相關，而與它的過去歷史或未來狀態，都是獨立、不相關的。馬爾可夫過程可以用一個元組

馬爾可夫獎勵過程

? 馬爾可夫獎勵過程（Markov Reward Process）在馬爾可夫過程的基礎上增加了獎勵R和衰減系數γ：

馬爾可夫決策過程

? 相較於馬爾可夫獎勵過程，馬爾可夫決策過程（Markov Decision Process）多了一個行為集合A，它是這樣的一個元組:

RL與MDP

? 在強化學習中，馬爾可夫決策過程是對完全可觀測的環境進行描述的，也就是說觀測到的狀態內容完整地決定了決策的需要的特征。幾乎所有的強化學習問題都可以轉化為MDP。

引用自：https://zh.wikipedia.org/zh-hans/馬可夫過程 | https://zhuanlan.zhihu.com/p/28084942

3. 蒙特卡洛方法（MCM）

簡介

? 蒙特卡羅方法（Monte Carlo Method），也稱統計模擬方法，是1940年代中期由於科學技術的發展和電子計算機的發明，而提出的一種以概率統計理論為指導的數值計算方法。是指使用隨機數（或更常見的偽隨機數）來解決很多計算問題的方法。

? 20世紀40年代，在馮·諾伊曼，斯塔尼斯拉夫·烏拉姆和尼古拉斯·梅特羅波利斯在洛斯阿拉莫斯國家實驗室為核武器計劃工作時，發明了蒙特卡羅方法。因為烏拉姆的叔叔經常在摩納哥的蒙特卡洛賭場輸錢得名，而蒙特卡羅方法正是以概率為基礎的方法。

? 通常蒙特卡羅方法可以粗略地分成兩類：一類是所求解的問題本身具有內在的隨機性，借助計算機的運算能力可以直接模擬這種隨機的過程。例如在核物理研究中，分析中子在反應堆中的傳輸過程。中子與原子核作用受到量子力學規律的制約，人們只能知道它們相互作用發生的概率，卻無法準確獲得中子與原子核作用時的位置以及裂變產生的新中子的行進速率和方向。科學家依據其概率進行隨機抽樣得到裂變位置、速度和方向，這樣模擬大量中子的行為後，經過統計就能獲得中子傳輸的範圍，作為反應堆設計的依據。

? 另一種類型是所求解問題可以轉化為某種隨機分布的特征數，比如隨機事件出現的概率，或者隨機變量的期望值。通過隨機抽樣的方法，以隨機事件出現的頻率估計其概率，或者以抽樣的數字特征估算隨機變量的數字特征，並將其作為問題的解。這種方法多用於求解復雜的多維積分問題。

一個例子

? 使用蒙特卡羅方法估算π值。放置30000個隨機點後，π的估算值與真實值相差0.07%。

技術分享圖片

引用自：https://zh.wikipedia.org/zh-hans/蒙特卡羅方法

前景

? 就單純的用蒙特卡洛方法來下棋（最早在1993年被提出，後在2001被再次提出），我們可以簡單的用隨機比賽的方式來評價某一步落子。從需要評價的那一步開始，雙方隨機落子，直到一局比賽結束。為了保證結果的準確性，這樣的隨機對局通常需要進行上萬盤，記錄下每一盤的結果，最後取這些結果的平均，就能得到某一步棋的評價。最後要做的就是取評價最高的一步落子作為接下來的落子。也就是說為了決定一步落子就需要程序自己進行上萬局的隨機對局，這對隨機對局的速度也提出了一定的要求。和使用了大量圍棋知識的傳統方法相比，這種方法的好處顯而易見，就是幾乎不需要圍棋的專業知識，只需通過大量的隨機對局就能估計出一步棋的價值。再加上一些優化方法，基於純蒙特卡洛方法的圍棋程序已經能夠匹敵最強的傳統圍棋程序。

? 既然蒙特卡洛的路似乎充滿著光明，我們就應該沿著這條路繼續前行。MCTS也就是將以上想法融入到樹搜索中，利用樹結構來更加高效的進行節點值的更新和選擇。

引用自：https://blog.csdn.net/natsu1211/article/details/50986810

4. 蒙特卡洛樹搜索（MCTS）

簡介

? 蒙特卡洛樹搜索（Monte Carlo tree search；MCTS）是一種用於某些決策過程的啟發式搜索算法，最引人註目的是在遊戲中的使用。一個主要例子是電腦圍棋程序，它也用於其他棋盤遊戲、即時電子遊戲以及不確定性遊戲。

引用自：https://zh.wikipedia.org/zh-hans/蒙特卡洛樹搜索

搜索步驟

解釋一

選舉(selection)是根據當前獲得所有子步驟的統計結果，選擇一個最優的子步驟。
擴展(expansion)在當前獲得的統計結果不足以計算出下一個步驟時，隨機選擇一個子步驟。
模擬(simulation)模擬遊戲，進入下一步。
反向傳播(Back-Propagation)根據遊戲結束的結果，計算對應路徑上統計記錄的值。

引用自：https://www.cnblogs.com/steven-yang/p/5993205.html

解釋二

選擇（Selection）：從根結點R開始，選擇連續的子結點向下至葉子結點L。後面給出了一種選擇子結點的方法，讓遊戲樹向最優的方向擴展，這是蒙特卡洛樹搜索的精要所在。
擴展（Expansion）：除非任意一方的輸贏使得遊戲在L結束，否則創建一個或多個子結點並選取其中一個結點C。
仿真（Simulation）：在從結點C開始，用隨機策略進行遊戲，又稱為playout或者rollout。
反向傳播（Backpropagation）：使用隨機遊戲的結果，更新從C到R的路徑上的結點信息。

技術分享圖片

引用自： https://zh.wikipedia.org/zh-hans/蒙特卡洛樹搜索

圖解

詳見下述鏈接

引用自：https://www.cnblogs.com/steven-yang/p/5993205.html

詳細算法

在開始階段，搜索樹只有一個節點，也就是我們需要決策的局面。

搜索樹中的每一個節點包含了三個基本信息：代表的局面，被訪問的次數，累計評分。

選擇(Selection)

? 在選擇階段，需要從根節點，也就是要做決策的局面R出發向下選擇出一個最急迫需要被拓展的節點N，局面R是是每一次叠代中第一個被檢查的節點；

對於被檢查的局面而言，他可能有三種可能：
1. 該節點所有可行動作都已經被拓展過
2. 該節點有可行動作還未被拓展過
3. 這個節點遊戲已經結束了(例如已經連成五子的五子棋局面)
對於這三種可能：
1. 如果所有可行動作都已經被拓展過了，那麽我們將使用UCB公式計算該節點所有子節點的UCB值，並找到值最大的一個子節點繼續檢查。反復向下叠代。
2. 如果被檢查的局面依然存在沒有被拓展的子節點(例如說某節點有20個可行動作，但是在搜索樹中才創建了19個子節點)，那麽我們認為這個節點就是本次叠代的的目標節點N，並找出N還未被拓展的動作A。執行步驟[2]
3. 如果被檢查到的節點是一個遊戲已經結束的節點。那麽從該節點直接執行步驟{4]。
每一個被檢查的節點的被訪問次數在這個階段都會自增。

在反復的叠代之後，我們將在搜索樹的底端找到一個節點，來繼續後面的步驟。
拓展(Expansion)

? 在選擇階段結束時候，我們查找到了一個最迫切被拓展的節點N，以及他一個尚未拓展的動作A。在搜索樹中創建一個新的節點$N_n$作為N的一個新子節點。$N_n$的局面就是節點N在執行了動作A之後的局面。
模擬(Simulation)

? 為了讓$N_n$得到一個初始的評分。我們從$N_n$開始，讓遊戲隨機進行，直到得到一個遊戲結局，這個結局將作為$N_n$的初始評分。一般使用勝利/失敗來作為評分，只有1或者0。
反向傳播(Backpropagation)

? 在$N_n$的模擬結束之後，它的父節點N以及從根節點到N的路徑上的所有節點都會根據本次模擬的結果來添加自己的累計評分。如果在[1]的選擇中直接發現了一個遊戲結局的話，根據該結局來更新評分。

? 每一次叠代都會拓展搜索樹，隨著叠代次數的增加，搜索樹的規模也不斷增加。當到了一定的叠代次數或者時間之後結束，選擇根節點下最好的子節點作為本次決策的結果。

引用自：https://www.zhihu.com/question/39916945/answer/83799720

另可參考：https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/

算法偽代碼

技術分享圖片

引用自：https://blog.csdn.net/dinosoft/article/details/50893291

技術分享圖片

引用自：https://blog.csdn.net/u014397729/article/details/27366363

5. 上限置信區間算法（UCT）

UCB1

$$\frac{w_i}{n_i}+c \sqrt{ \frac{ln t}{n_i} }$$

在該式中：

$技術分享圖片$ 代表第 $技術分享圖片$ 次移動後取勝的次數；
$技術分享圖片$ 代表第 $技術分享圖片$ 次移動後仿真的次數；
$技術分享圖片$ 為探索參數—理論上等於 $技術分享圖片$ ；在實際中通常可憑經驗選擇；
$技術分享圖片$ 代表仿真總次數，等於所有 $技術分享圖片$ 的和。

引用自：https://zh.wikipedia.org/zh-hans/蒙特卡洛樹搜索

其中，C越大，就會越照顧訪問次數相對較少的子節點。

引用自：https://zhuanlan.zhihu.com/p/25345778

UCT介紹

? UCT算法（Upper Confidence Bound Apply to Tree）即上限置信區間算法，是一種博弈樹搜索算法，該算法將蒙特卡洛樹搜索方法與UCB公式結合，在超大規模博弈樹的搜索過程中相對於傳統的搜索算法有著時間和空間方面的優勢。

即：MCTS + UCB1 = UCT

引用自：https://baike.baidu.com/item/UCT算法

算法中的UCB公式可替換為：UCB1-tuned 等

引用自：https://blog.csdn.net/xbinworld/article/details/79372777

優點

MCTS 提供了比傳統樹搜索更好的方法。

Aheuristic 啟發式

MCTS 不要求任何關於給定的領域策略或者具體實踐知識來做出合理的決策。這個算法可以在沒有任何關於博弈遊戲除基本規則外的知識的情況下進行有效工作；這意味著一個簡單的MCTS 實現可以重用在很多的博弈遊戲中，只需要進行微小的調整，所以這也使得 MCTS 是對於一般的博弈遊戲的很好的方法。
Asymmetric 非對稱

MCTS 執行一種非對稱的樹的適應搜索空間拓撲結構的增長。這個算法會更頻繁地訪問更加有趣的節點，並聚焦其搜索時間在更加相關的樹的部分。這使得 MCTS 更加適合那些有著更大的分支因子的博弈遊戲，比如說 19X19 的圍棋。這麽大的組合空間會給標準的基於深度或者寬度的搜索方法帶來問題，所以MCTS 的適應性說明它（最終）可以找到那些更加優化的行動，並將搜索的工作聚焦在這些部分。
任何時間

算法可以在任何時間終止，並返回當前最有的估計。當前構造出來的搜索樹可以被丟棄或者供後續重用。（對比dfs暴力搜索）
簡潔

算法實現非常方便（ http://mcts.ai/code/python.html ）

引用自：https://www.jianshu.com/p/d011baff6b64

缺點

MCTS 有缺點很少，但這些缺點也可能是非常關鍵的影響因素。

行為能力

MCTS 算法，根據其基本形式，在某些甚至不是很大的博弈遊戲中在可承受的時間內也不能夠找到最好的行動方式。這基本上是由於組合步的空間的全部大小所致，關鍵節點並不能夠訪問足夠多的次數來給出合理的估計。
速度

MCTS 搜索可能需要足夠多的叠代才能收斂到一個很好的解上，這也是更加一般的難以優化的應用上的問題。例如，最佳的圍棋程序可能需要百萬次的交戰和領域最佳和強化才能得到專家級的行動方案，而最有的GGP 實現對更加復雜的博弈遊戲可能也就只要每秒鐘數十次（領域無關的）交戰。對可承受的行動時間，這樣的GGP 可能很少有時間訪問到每個合理的行動，所以這樣的情形也不大可能出現表現非常好的搜索。

引用自：https://www.jianshu.com/p/d011baff6b64

6. 詳細示例代碼

# This is a very simple implementation of the UCT Monte Carlo Tree Search algorithm in Python 2.7 (convert to Python3).
# The function UCT(rootstate, itermax, verbose = False) is towards the bottom of the code.
# It aims to have the clearest and simplest possible code, and for the sake of clarity, the code
# is orders of magnitude less efficient than it could be made, particularly by using a 
# state.GetRandomMove() or state.DoRandomRollout() function.
# 
# Example GameState classes for Nim, OXO and Othello are included to give some idea of how you
# can write your own GameState use UCT in your 2-player game. Change the game to be played in 
# the UCTPlayGame() function at the bottom of the code.
# 
# Written by Peter Cowling, Ed Powley, Daniel Whitehouse (University of York, UK) September 2012.
# 
# Licence is granted to freely use and distribute for any sensible/legal purpose so long as this comment
# remains in any distributed code.
# 
# For more information about Monte Carlo Tree Search check out our web site at www.mcts.ai

from math import *
import random


class GameState:
    """ A state of the game, i.e. the game board. These are the only functions which are
        absolutely necessary to implement UCT in any 2-player complete information deterministic 
        zero-sum game, although they can be enhanced and made quicker, for example by using a 
        GetRandomMove() function to generate a random move during rollout.
        By convention the players are numbered 1 and 2.
    """

    def __init__(self):
        self.playerJustMoved = 2  # At the root pretend the player just moved is player 2 - player 1 has the first move

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = GameState()
        st.playerJustMoved = self.playerJustMoved
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerJustMoved.
        """
        self.playerJustMoved = 3 - self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """

    def __repr__(self):
        """ Don‘t need this - but good style.
        """
        pass


class NimState:
    """ A state of the game Nim. In Nim, players alternately take 1,2 or 3 chips with the 
        winner being the player to take the last chip. 
        In Nim any initial state of the form 4n+k for k = 1,2,3 is a win for player 1
        (by choosing k) chips.
        Any initial state of the form 4n is a win for player 2.
    """

    def __init__(self, ch):
        self.playerJustMoved = 2  # At the root pretend the player just moved is p2 - p1 has the first move
        self.chips = ch

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = NimState(self.chips)
        st.playerJustMoved = self.playerJustMoved
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerJustMoved.
        """
        assert move >= 1 and move <= 3 and move == int(move)
        self.chips -= move
        self.playerJustMoved = 3 - self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """
        return list(range(1, min([4, self.chips + 1])))

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """
        assert self.chips == 0
        if self.playerJustMoved == playerjm:
            return 1.0  # playerjm took the last chip and has won
        else:
            return 0.0  # playerjm‘s opponent took the last chip and has won

    def __repr__(self):
        s = "Chips:" + str(self.chips) + " JustPlayed:" + str(self.playerJustMoved)
        return s


class OXOState:
    """ A state of the game, i.e. the game board.
        Squares in the board are in this arrangement
        012
        345
        678
        where 0 = empty, 1 = player 1 (X), 2 = player 2 (O)
    """

    def __init__(self):
        self.playerJustMoved = 2  # At the root pretend the player just moved is p2 - p1 has the first move
        self.board = [0, 0, 0, 0, 0, 0, 0, 0, 0]  # 0 = empty, 1 = player 1, 2 = player 2

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = OXOState()
        st.playerJustMoved = self.playerJustMoved
        st.board = self.board[:]
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerToMove.
        """
        assert move >= 0 and move <= 8 and move == int(move) and self.board[move] == 0
        self.playerJustMoved = 3 - self.playerJustMoved
        self.board[move] = self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """
        return [i for i in range(9) if self.board[i] == 0]

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """
        for (x, y, z) in [(0, 1, 2), (3, 4, 5), (6, 7, 8), (0, 3, 6), (1, 4, 7), (2, 5, 8), (0, 4, 8), (2, 4, 6)]:
            if self.board[x] == self.board[y] == self.board[z]:
                if self.board[x] == playerjm:
                    return 1.0
                else:
                    return 0.0
        if self.GetMoves() == []: return 0.5  # draw
        assert False  # Should not be possible to get here

    def __repr__(self):
        s = ""
        for i in range(9):
            s += ".XO"[self.board[i]]
            if i % 3 == 2: s += "\n"
        return s


class OthelloState:
    """ A state of the game of Othello, i.e. the game board.
        The board is a 2D array where 0 = empty (.), 1 = player 1 (X), 2 = player 2 (O).
        In Othello players alternately place pieces on a square board - each piece played
        has to sandwich opponent pieces between the piece played and pieces already on the 
        board. Sandwiched pieces are flipped.
        This implementation modifies the rules to allow variable sized square boards and
        terminates the game as soon as the player about to move cannot make a move (whereas
        the standard game allows for a pass move). 
    """

    def __init__(self, sz=8):
        self.playerJustMoved = 2  # At the root pretend the player just moved is p2 - p1 has the first move
        self.board = []  # 0 = empty, 1 = player 1, 2 = player 2
        self.size = sz
        assert sz == int(sz) and sz % 2 == 0  # size must be integral and even
        for y in range(sz):
            self.board.append([0] * sz)
        self.board[sz / 2][sz / 2] = self.board[sz / 2 - 1][sz / 2 - 1] = 1
        self.board[sz / 2][sz / 2 - 1] = self.board[sz / 2 - 1][sz / 2] = 2

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = OthelloState()
        st.playerJustMoved = self.playerJustMoved
        st.board = [self.board[i][:] for i in range(self.size)]
        st.size = self.size
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerToMove.
        """
        (x, y) = (move[0], move[1])
        assert x == int(x) and y == int(y) and self.IsOnBoard(x, y) and self.board[x][y] == 0
        m = self.GetAllSandwichedCounters(x, y)
        self.playerJustMoved = 3 - self.playerJustMoved
        self.board[x][y] = self.playerJustMoved
        for (a, b) in m:
            self.board[a][b] = self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """
        return [(x, y) for x in range(self.size) for y in range(self.size) if
                self.board[x][y] == 0 and self.ExistsSandwichedCounter(x, y)]

    def AdjacentToEnemy(self, x, y):
        """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
        """
        for (dx, dy) in [(0, +1), (+1, +1), (+1, 0), (+1, -1), (0, -1), (-1, -1), (-1, 0), (-1, +1)]:
            if self.IsOnBoard(x + dx, y + dy) and self.board[x + dx][y + dy] == self.playerJustMoved:
                return True
        return False

    def AdjacentEnemyDirections(self, x, y):
        """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
        """
        es = []
        for (dx, dy) in [(0, +1), (+1, +1), (+1, 0), (+1, -1), (0, -1), (-1, -1), (-1, 0), (-1, +1)]:
            if self.IsOnBoard(x + dx, y + dy) and self.board[x + dx][y + dy] == self.playerJustMoved:
                es.append((dx, dy))
        return es

    def ExistsSandwichedCounter(self, x, y):
        """ Does there exist at least one counter which would be flipped if my counter was placed at (x,y)?
        """
        for (dx, dy) in self.AdjacentEnemyDirections(x, y):
            if len(self.SandwichedCounters(x, y, dx, dy)) > 0:
                return True
        return False

    def GetAllSandwichedCounters(self, x, y):
        """ Is (x,y) a possible move (i.e. opponent counters are sandwiched between (x,y) and my counter in some direction)?
        """
        sandwiched = []
        for (dx, dy) in self.AdjacentEnemyDirections(x, y):
            sandwiched.extend(self.SandwichedCounters(x, y, dx, dy))
        return sandwiched

    def SandwichedCounters(self, x, y, dx, dy):
        """ Return the coordinates of all opponent counters sandwiched between (x,y) and my counter.
        """
        x += dx
        y += dy
        sandwiched = []
        while self.IsOnBoard(x, y) and self.board[x][y] == self.playerJustMoved:
            sandwiched.append((x, y))
            x += dx
            y += dy
        if self.IsOnBoard(x, y) and self.board[x][y] == 3 - self.playerJustMoved:
            return sandwiched
        else:
            return []  # nothing sandwiched

    def IsOnBoard(self, x, y):
        return x >= 0 and x < self.size and y >= 0 and y < self.size

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """
        jmcount = len([(x, y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == playerjm])
        notjmcount = len(
            [(x, y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 3 - playerjm])
        if jmcount > notjmcount:
            return 1.0
        elif notjmcount > jmcount:
            return 0.0
        else:
            return 0.5  # draw

    def __repr__(self):
        s = ""
        for y in range(self.size - 1, -1, -1):
            for x in range(self.size):
                s += ".XO"[self.board[x][y]]
            s += "\n"
        return s


class Node:
    """ A node in the game tree. Note wins is always from the viewpoint of playerJustMoved.
        Crashes if state not specified.
    """

    def __init__(self, move=None, parent=None, state=None):
        self.move = move  # the move that got us to this node - "None" for the root node
        self.parentNode = parent  # "None" for the root node
        self.childNodes = []
        self.wins = 0
        self.visits = 0
        self.untriedMoves = state.GetMoves()  # future child nodes
        self.playerJustMoved = state.playerJustMoved  # the only part of the state that the Node needs later

    def UCTSelectChild(self):
        """ Use the UCB1 formula to select a child node. Often a constant UCTK is applied so we have
            lambda c: c.wins/c.visits + UCTK * sqrt(2*log(self.visits)/c.visits to vary the amount of
            exploration versus exploitation.
        """
        s = sorted(self.childNodes, key=lambda c: c.wins / c.visits + sqrt(2 * log(self.visits) / c.visits))[-1]
        return s

    def AddChild(self, m, s):
        """ Remove m from untriedMoves and add a new child node for this move.
            Return the added child node
        """
        n = Node(move=m, parent=self, state=s)
        self.untriedMoves.remove(m)
        self.childNodes.append(n)
        return n

    def Update(self, result):
        """ Update this node - one additional visit and result additional wins. result must be from the viewpoint of playerJustmoved.
        """
        self.visits += 1
        self.wins += result

    def __repr__(self):
        return "[M:" + str(self.move) + " W/V:" + str(self.wins) + "/" + str(self.visits) + " U:" + str(
            self.untriedMoves) + "]"

    def TreeToString(self, indent):
        s = self.IndentString(indent) + str(self)
        for c in self.childNodes:
            s += c.TreeToString(indent + 1)
        return s

    def IndentString(self, indent):
        s = "\n"
        for i in range(1, indent + 1):
            s += "| "
        return s

    def ChildrenToString(self):
        s = ""
        for c in self.childNodes:
            s += str(c) + "\n"
        return s


def UCT(rootstate, itermax, verbose=False):
    """ Conduct a UCT search for itermax iterations starting from rootstate.
        Return the best move from the rootstate.
        Assumes 2 alternating players (player 1 starts), with game results in the range [0.0, 1.0]."""

    rootnode = Node(state=rootstate)

    for i in range(itermax):
        node = rootnode
        state = rootstate.Clone()

        # Select
        while node.untriedMoves == [] and node.childNodes != []:  # node is fully expanded and non-terminal
            node = node.UCTSelectChild()
            state.DoMove(node.move)

        # Expand
        if node.untriedMoves != []:  # if we can expand (i.e. state/node is non-terminal)
            m = random.choice(node.untriedMoves)
            state.DoMove(m)
            node = node.AddChild(m, state)  # add child and descend tree

        # Rollout - this can often be made orders of magnitude quicker using a state.GetRandomMove() function
        while state.GetMoves() != []:  # while state is non-terminal
            state.DoMove(random.choice(state.GetMoves()))

        # Backpropagate
        while node != None:  # backpropagate from the expanded node and work back to the root node
            node.Update(state.GetResult(
                node.playerJustMoved))  # state is terminal. Update node with result from POV of node.playerJustMoved
            node = node.parentNode

    # Output some information about the tree - can be omitted
    if (verbose):
        print(rootnode.TreeToString(0))
    else:
        print(rootnode.ChildrenToString())

    return sorted(rootnode.childNodes, key=lambda c: c.visits)[-1].move  # return the move that was most visited


def UCTPlayGame():
    """ Play a sample game between two UCT players where each player gets a different number 
        of UCT iterations (= simulations = tree nodes).
    """
    # state = OthelloState(4) # uncomment to play Othello on a square board of the given size
    state = OXOState() # uncomment to play OXO
    # state = NimState(15)  # uncomment to play Nim with the given number of starting chips
    while (state.GetMoves() != []):
        print(str(state))
        if state.playerJustMoved == 1:
            m = UCT(rootstate=state, itermax=1000, verbose=False)  # play with values for itermax and verbose = True
        else:
            m = UCT(rootstate=state, itermax=100, verbose=False)
        print("Best Move: " + str(m) + "\n")
        state.DoMove(m)
    if state.GetResult(state.playerJustMoved) == 1.0:
        print("Player " + str(state.playerJustMoved) + " wins!")
    elif state.GetResult(state.playerJustMoved) == 0.0:
        print("Player " + str(3 - state.playerJustMoved) + " wins!")
    else:
        print("Nobody wins!")


if __name__ == "__main__":
    """ Play a single game to the end using UCT for both players. 
    """
    UCTPlayGame()

原始代碼為python2，由我轉換為python3的代碼。

引用自：http://mcts.ai/code/python.html

【04】蒙特卡洛樹入門學習筆記

alter ike cnblogs append 最大的有趣的控制理論空間 nod 蒙特卡洛樹學習筆記 1. 強化學習（RL）概念 ? 強化學習是機器學習中的一個領域，強調如何基於環境而行動，以取得最大化的預期利益。其靈感來源於心理學中的行為主義理論，即有機體如何

【04】蒙特卡洛樹入門學習筆記

蒙特卡洛樹學習筆記

1. 強化學習（RL）

概念

強化學習的基本組件

2. 馬爾可夫決策過程（MDP）

馬爾可夫過程

馬爾可夫獎勵過程

馬爾可夫決策過程

RL與MDP

3. 蒙特卡洛方法（MCM）

簡介

一個例子

前景

4. 蒙特卡洛樹搜索（MCTS）

簡介

搜索步驟

解釋一

解釋二

圖解

詳細算法

選擇(Selection)

拓展(Expansion)

模擬(Simulation)

反向傳播(Backpropagation)

算法偽代碼

5. 上限置信區間算法（UCT）

UCB1

UCT介紹

優點

缺點

6. 詳細示例代碼

相關推薦