1. 程式人生 > >強化學習環境-Gym安裝到使用入門

強化學習環境-Gym安裝到使用入門

Gym是一個用於測試和比較強化學習演算法的工具包,它不依賴強化學習演算法結構,並且可以使用很多方法對它進行呼叫,像Tensorflow、Theano。

Gym庫收集、解決了很多環境的測試過程中的問題,能夠很好地使得你的強化學習演算法得到很好的工作。並且含有遊戲介面,能夠幫助你去寫更適用的演算法。

安裝

在開始安裝之前,你需要安裝Python 3.5+,簡單安裝的話,用pip安裝就可以啦:

pip install gym

完成之後你就可以很好地去玩gym啦。

從github資源裡面進行安裝。

如果你喜歡的話,你也可以直接克隆github裡面的資源進行安裝。這種方法在你需要自己新增環境,或者修改環境的時候會比較有用。用下面的命令進行下載和安裝。

git clone https://github.com/openai/gym
cd gym
pip install -e .

你之後可以執行以下命令去安裝環境包含的所有遊戲。

pip install -e .[all]

上述命令要求一些獨立的庫,如cmake和新版本的pip。

環境

這裡有以下小例子來跑一些gym包含的遊戲環境,下面的例子是將 CartPole-v0這個環境迭代了1000次。在每次迭代的時候都會將環境初始化。執行之後你將會看到一個經典的倒立擺小車問題。

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

如果你缺少一些環境的依賴關係的話,你將會得到一個有用的錯誤資訊,這個錯誤資訊會告訴你是否是這個依賴關係導致的,並且會指導你如何去修復這個依賴關係。安裝需要的依賴關係是很簡單的。如果你需要去跑Hopper-v1的話,你將會需要一個MuJoCo license

觀測

如果我們想要在與gym環境迭代的過程中採取更好的動作的話,你就會需要知道我們的動作是如何在環境中進行互動的。

與環境互動過程中,環境返回的值就是我們所需要的,實際上,每一步環境都會返回四個值:

observation (object):一個特定的環境物件,代表了你從環境中得到的觀測值,例如從攝像頭獲得的畫素資料,機器人的關節角度和關節速度,或者棋盤遊戲的棋盤。

reward (float):由於之前採取的動作所獲得的大量獎勵,與環境互動的過程中,獎勵值的規模會發生變化,但是總體的目標一直都是使得總獎勵最大。

done (boolean):決定是否將環境初始化,大多數,但不是所有的任務都被定義好了什麼情況該結束這個回合。(舉個例子,這個倒立擺的小車離開地太遠了就結束了這個回合)

info (dict):除錯過程中將會產生的有用資訊,有時它會對我們的強化學習學習過程很有用(例如,有時它會包含最後一個狀態改變後的原始概率),然而在評估你的智慧體的時候你是不會用到這些資訊去驅動你的智慧體學習的。

一個經典的強化學習智慧體與環境互動的過程可以被描述成如下方式:每次迭代,智慧體選擇一個動作,這個動作輸入到環境中去,智慧體會得到下一個觀測值(也就是下一個狀態)和獎勵。

程式的開始被叫做reset(),它會返回一個初始的觀測值,一個合適的方式編寫程式碼如下所示:

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

空間

下面是一個例子,我們從環境的動作空間中隨機選取一些動作,但是實際的動作是這些動作嗎?每個環境獨有一個動作空間和一個狀態空間。這就是空間的屬性,他們有效地描述了動作和狀態。

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)

離散的空間允許一個固定的非負範圍的數字,因此,在這種情況下,有效的動作是0或者1。BOX空間是一個n維的box,因此,一個有效的狀態空間將會是四個數字的array。我們可以通過以下方式檢視動作邊界範圍。

print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

This introspection can be helpful to write generic code that works for many different environments. Box and Discrete are the most common Spaces. You can sample from a Space or check that something belongs to it:

這種方式可以很有效地幫助我們編寫不同環境下的程式碼,Box和離散的space是最常見的space。你可以從space中進行取樣,或者檢視一些資訊:

from gym import spaces
space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
x = space.sample()
assert space.contains(x)
assert space.n == 8

可用的環境

Gym配有多種多樣的環境,從難到易包含各種資料。你可以大致瀏覽以下full list of environments

Algorithmic: 執行計算例如加減計算,和反轉序列等。人們一般都認為這些任務對於計算機來說是相對比較容易的。

Atari: 玩經典的Atari遊戲。我們有完整的Arcade Learning Environment(它在強化學習研究領域有很大的影響力)

2D and 3D robots: 控制機器人模擬。這些任務都是用MuJoCo 物理引擎,它是被設計用來進行更快,更準確的機器人模擬。包含了一些來自 UC Berkeley研究人員的環境 benchmark 。(who incidentally will be joining us this summer). MuJoCo is proprietary software, but offers free trial licenses.

The registry

gym’s main purpose is to provide a large collection of environments that expose a common interface and are versioned to allow for comparisons. To list the environments available in your installation, just ask gym.envs.registry:

from gym import envs
print(envs.registry.all())
#> [EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0), EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0), EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0), EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0), EnvSpec(Gopher-ram-v0), ...

This will give you a list of EnvSpec objects. These define parameters for a particular task, including the number of trials to run and the maximum number of steps. For example, EnvSpec(Hopper-v1) defines an environment where the goal is to get a 2D simulated robot to hop; EnvSpec(Go9x9-v0) defines a Go game on a 9x9 board.

These environment IDs are treated as opaque strings. In order to ensure valid comparisons for the future, environments will never be changed in a fashion that affects performance, only replaced by newer versions. We currently suffix each environment with a v0 so that future replacements can naturally be called v1, v2, etc.

It’s very easy to add your own enviromments to the registry, and thus make them available for gym.make(): just register() them at load time.

Background: Why Gym? (2016)

Reinforcement learning (RL) is the subfield of machine learning concerned with decision making and motor control. It studies how an agent can learn how to achieve goals in a complex, uncertain environment. It’s exciting for two reasons:

  • RL is very general, encompassing all problems that involve making a sequence of decisions: for example, controlling a robot’s motors so that it’s able to run and jump, making business decisions like pricing and inventory management, or playing video games and board games. RL can even be applied to supervised learning problems with sequential or structured outputs.
  • RL algorithms have started to achieve good results in many difficult environments. RL has a long history, but until recent advances in deep learning, it required lots of problem-specific engineering. DeepMind’s Atari results, BRETT from Pieter Abbeel’s group, and AlphaGo all used deep RL algorithms which did not make too many assumptions about their environment, and thus can be applied in other settings.

However, RL research is also slowed down by two factors:

  • The need for better benchmarks. In supervised learning, progress has been driven by large labeled datasets like ImageNet. In RL, the closest equivalent would be a large and diverse collection of environments. However, the existing open-source collections of RL environments don’t have enough variety, and they are often difficult to even set up and use.
  • Lack of standardization of environments used in publications. Subtle differences in the problem definition, such as the reward function or the set of actions, can drastically alter a task’s difficulty. This issue makes it difficult to reproduce published research and compare results from different papers.

Gym is an attempt to fix both problems.