Tensorflow 2.0的這些新設計，你適應好了嗎？

TensorFlow · 發表 2018-11-16 11:39:18

摘要：作者：P. Galeone 編譯：Bot 編者按：幾天前，Tensorflow剛度過自己的3歲生日，作為當前最受歡迎的機器學習框架，Tensorflow在這個寶座上已經盤踞了近三年。無論是成熟的Keras，還是風頭正盛的pytorch，它的地位似乎總是無法被撼動。而就在即將到...

作者：P. Galeone

編譯：Bot

編者按：幾天前，Tensorflow剛度過自己的3歲生日，作為當前最受歡迎的機器學習框架，Tensorflow在這個寶座上已經盤踞了近三年。無論是成熟的Keras，還是風頭正盛的pytorch，它的地位似乎總是無法被撼動。而就在即將到來的2019年，Tensorflow 2.0將正式入場，給暗流湧動的框架之爭再燃一把火。

如果說兩代Tensorflow有什麼根本不同，那應該就是Tensorflow 2.0更注重使用的低門檻，旨在讓每個人都能應用機器學習技術。考慮到它可能會成為機器學習框架的又一個重要里程碑，本文會介紹1.x和2.x版本之間的所有（已知）差異，重點關注它們之間的思維模式變化和利弊關係。

通過閱讀這篇文章，熟悉Tensorflow的老使用者可以儘早轉變思維，適應新版本的變化。而新手也可以直接以Tensorflow 2.0的方式思考，至少目前沒有必要急著去學習別的框架。

Tensorflow 2.0：為什麼？何時？

Tensorflow 2.0的開發初衷是製作一個更簡單易用的Tensorflow。

第一個向公眾透露專案具體開發內容的人是Google Brain的工程師Martin Wicke，我們可以在他的公告郵件列表裡找到Tensorflow 2.0的蛛絲馬跡。在這裡，我們對它做一些簡單提要：

tf.contrib

換言之，如果你在這之前從沒接觸過Tensorflow，你是幸運的。但是，如果你和我們一樣是從0.x版本用起的，那麼你就可能得重寫所有程式碼庫——雖然官方說會發布轉換工具方便老使用者，但這種工具肯定有很多bug，需要一定的手動干預。

而且，你也必須開始轉變思維模式。這做起來不容易，但真的猛士不就應該喜歡挑戰嗎？

所以為了應對挑戰，我們先來適應第一個巨大差異：移除 tf.get_variable , tf.variable_scope , tf.layers ，強制轉型到基於Keras的方法，也就是用 tf.keras 。

關於Tensorflow 2.0的釋出日期，官方並沒有給出明確時間。但根據開發小組成員透露的訊息，我們可以確定它的預覽版會在今年年底釋出，官方正式版可能會在2019年春季釋出。

所以留給老使用者的時間已經不多了。

Keras（OOP）vs Tensorflow 1.x

在GitHub上， ofollow,noindex">RFC：TensorFlow 2.0中的變數這份意見稿已經被官方接受，它可能是對現有程式碼庫影響最大的RFC，值得參考。

我們都知道，在Tensorflow裡，每個變數在計算圖中都有一個唯一的名稱，我們也已經習慣按照這種模式設計計算圖：

哪些操作連線我的變數節點：把計算圖定義為連線的多個子圖，並用 tf.variable_scope 在內部定義每個子圖，以便定義不同計算圖的變數，並在Tensorboard中獲得清晰的圖形表示。
需要在執行同一步驟時多次使用子圖：一定要用 tf.variable_scope 裡的 reuse 引數，不然Tensorflow會生成一個字首為 _n 的新計算圖。
定義計算圖：定義引數初始化節點（你呼叫過幾次 tf.global_variables_initializer() ？）。
把計算圖載入到Session，執行。

下面，我們就以在Tensorflow中實現簡單的GAN為例，更生動地展現上述步驟。

Tensorflow 1.x的GAN

要定義GAN的判別器D，我們一定會用到 tf.variable_scope 裡的 reuse 引數。因為首先我們會把真實影象輸入判別器，之後把生成的假樣本再輸進去，在且僅在最後計算D的梯度。相反地，生成器G裡的引數不會在一次迭代中被用到兩次，所以沒有擔心的必要。

def generator(inputs):
"""generator network.
Args:
inputs: a (None, latent_space_size) tf.float32 tensor
Returns:
G: the generator output node
"""
with tf.variable_scope("generator"):
fc1 = tf.layers.dense(inputs, units=64, activation=tf.nn.elu, name="fc1")
fc2 = tf.layers.dense(fc1, units=64, activation=tf.nn.elu, name="fc2")
G = tf.layers.dense(fc1, units=1, name="G")
return G

def discriminator(inputs, reuse=False):
"""discriminator network
Args:
inputs: a (None, 1) tf.float32 tensor
reuse: python boolean, if we expect to reuse (True) or declare (False) the variables
Returns:
D: the discriminator output node
"""
with tf.variable_scope("discriminator", reuse=reuse):
fc1 = tf.layers.dense(inputs, units=32, activation=tf.nn.elu, name="fc1")
D = tf.layers.dense(fc1, units=1, name="D")
return D

當這兩個函式被呼叫時，Tensorflow會預設在計算圖內部定義兩個不同的子圖，每個子圖都有自己的scope（生成器/判別器）。請注意，這個函式返回的是定義好的子圖的張量，而不是子圖本身。

為了共享D這個子圖，我們需要定義兩個輸入（真實影象/生成樣本），並定義訓練G和D所需的損失函式。

# Define the real input, a batch of values sampled from the real data
real_input = tf.placeholder(tf.float32, shape=(None,1))
# Define the discriminator network and its parameters
D_real = discriminator(real_input)

# Arbitrary size of the noise prior vector
latent_space_size = 100
# Define the input noise shape and define the generator
input_noise = tf.placeholder(tf.float32, shape=(None,latent_space_size))
G = generator(input_noise)

# now that we have defined the generator output G, we can give it in input to 
# D, this call of `discriminator` will not define a new graph, but it will
# **reuse** the variables previously defined
D_fake = discriminator(G, True)

最後要做的是分別定義訓練D和G所需的2個損失函式和2個優化器。

D_loss_real = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=D_real, labels=tf.ones_like(D_real))
)

D_loss_fake = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake, labels=tf.zeros_like(D_fake))
)

# D_loss, when invoked it first does a forward pass using the D_loss_real
# then another forward pass using D_loss_fake, sharing the same D parameters.
D_loss = D_loss_real + D_loss_fake

G_loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake, labels=tf.ones_like(D_fake))
)

定義損失函式不難，對抗訓練的一個特點是把真實影象和由G生成的影象輸入判別器D，由後者輸出評估結果，並把結果饋送給生成器G做參考。這意味著對抗訓練其實是分兩步走，G和D同在一個計算圖內，但在訓練D時，我們不希望更新G中的引數；同理，訓練G時，我們也不希望更新D裡的引數。

因此，由於我們在預設計算圖中定義了每個變數，而且它們都是全域性變數，我們必須在2個不同的列表中收集正確的變數並正確定義優化器，從而計算梯度，對正確的子圖進行更新。

# Gather D and G variables
D_vars = tf.trainable_variables(scope="discriminator")
G_vars = tf.trainable_variables(scope="generator")

# Define the optimizers and the train operations
train_D = tf.train.AdamOptimizer(1e-5).minimize(D_loss, var_list=D_vars)
train_G = tf.train.AdamOptimizer(1e-5).minimize(G_loss, var_list=G_vars)

到這裡，我們已經完成了上面提到的“第3步：定義計算圖”，最後是定義引數初始化節點：

init_op = tf.global_variables_initializer()

優/缺點

只要正確定義了計算圖，且在訓練迴圈內和session內使用，上述GAN就能正常訓練了。但是，從軟體工程角度看，它有一些值得注意的點：

用 tf.variable_scope 修改由 tf.layers 定義的（完整）變數名稱：這其實是對不同scope的變數重新用了一次 tf.layers.* ，導致的結果是定義了新scope下的一組新變數。
布林標誌 reuse 可以完全改變呼叫 tf.layers.* 後的所有行為（定義/reuse）。
每個變數都是全域性變數： tf.layers 呼叫 tf.get_variable （也就是在 tf.layers 下面呼叫）定義的變數可以隨處訪問。
定義子圖很麻煩：你沒法通過呼叫 discriminator 獲得一個新的、完全獨立的判別器，這有點違背常理。
子圖定義的輸出值（呼叫 generator / discriminator ）只是它的輸出張量，而不是內部所有圖的資訊（儘管可以回溯輸出，但這麼做很麻煩）。
定義引數初始化節點很麻煩（不過這個可以用 tf.train.MonitoredSession 和 tf.train.MonitoredTrainingSession 規避）。

以上6點都可能是用Tensorflow構建GAN的缺點。

Tensorflow 2.x的GAN

前面提到了，Tensorflow 2.x移除了 tf.get_variable , tf.variable_scope , tf.layers ，強制轉型到了基於Keras的方法。明年，如果我們想用它構建GAN，我們就必須用 tf.keras 定義生成器G和判別器的：這其實意味著我們憑空多了一個可以用來定義D的共享變數功能。

注：明年 tf.layers 就沒有了，所以你最好從現在就開始適應用 tf.keras 來定義自己的模型，這是過渡到2.x版本的必要準備。

def generator(input_shape):
"""generator network.
Args:
input_shape: the desired input shape (e.g.: (latent_space_size))
Returns:
G: The generator model
"""
inputs = tf.keras.layers.Input(input_shape)
net = tf.keras.layers.Dense(units=64, activation=tf.nn.elu, name="fc1")(inputs)
net = tf.keras.layers.Dense(units=64, activation=tf.nn.elu, name="fc2")(net)
net = tf.keras.layers.Dense(units=1, name="G")(net)
G = tf.keras.Model(inputs=inputs, outputs=net)
return G

def discriminator(input_shape):
"""discriminator network.
Args:
input_shape: the desired input shape (e.g.: (latent_space_size))
Returns:
D: the discriminator model
"""
inputs = tf.keras.layers.Input(input_shape)
net = tf.keras.layers.Dense(units=32, activation=tf.nn.elu, name="fc1")(inputs)
net = tf.keras.layers.Dense(units=1, name="D")(net)
D = tf.keras.Model(inputs=inputs, outputs=net)
return D

看到和Tensorflow的不同了嗎？在這裡， generator 和 discriminator 都返回了一個 tf.keras.Model ，而不僅僅是輸出張量。

在Keras裡，變數共享可以通過多次呼叫同樣的Keras層或模型來實現，而不用像TensorFlow那樣需要考慮變數的scope。所以我們在這裡只需定義一個判別器D，然後呼叫它兩次。

# Define the real input, a batch of values sampled from the real data 
real_input = tf.placeholder(tf.float32, shape=(None,1))

# Define the discriminator model
D = discriminator(real_input.shape[1:])

# Arbitrary set the shape of the noise prior vector
latent_space_size = 100
# Define the input noise shape and define the generator
input_noise = tf.placeholder(tf.float32, shape=(None,latent_space_size))
G = generator(input_noise.shape[1:])

再重申一遍，這裡我們不需要像原來那樣定義 D_fake ，在定義計算圖時也不用提前擔心變數共享。

之後就是定義G和D的損失函式：

D_real = D(real_input)
D_loss_real = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=D_real, labels=tf.ones_like(D_real))
)

G_z = G(input_noise)

D_fake = D(G_z)
D_loss_fake = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake, labels=tf.zeros_like(D_fake))
)

D_loss = D_loss_real + D_loss_fake

G_loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake, labels=tf.ones_like(D_fake))
)

最後，我們要做的是定義分別優化D和G的2個優化器。由於用的是 tf.keras ，所以我們不用手動建立要更新的變數列表， tf.keras.Models 的物件本身就是我們要的東西。

# Define the optimizers and the train operations
train_D = tf.train.AdamOptimizer(1e-5).minimize(D_loss, var_list=D.trainable_variables)
train_G = tf.train.AdamOptimizer(1e-5).minimize(G_loss, var_list=G.trainable_variables)

截至目前，因為我們用的還是靜態圖，所以還要定義變數初始化節點：

init_op = tf.global_variables_initializer()

優/缺點

從 tf.layers 到過渡 tf.keras ：Keras裡有所有 tf.layers 的對應操作。
tf.keras.Model 幫我們完全省去了變數共享和計算圖重新定義的煩惱。
tf.keras.Model 不是一個張量，而是一個自帶變數的完整模型。
定義變數初始化節點還是很麻煩，但之前也提到了，我們可以用 tf.train.MonitoredSession 規避。

以上是Tensorflow 1.x和2.x版本的第一個巨大差異，在下文中，我們再來看看第二個差異——Eager模式。

Eager Execution

Eager Execution（動態圖機制）是TensorFlow的一個指令式程式設計環境，它無需構建計算圖，可以直接評估你的操作：直接返回具體值，而不是構建完計算圖後再返回。它的優點主要有以下幾點：

直觀的介面。更自然地構建程式碼和使用Python資料結構，可完成小型模型和小型資料集的快速迭代。
更容易除錯。直接呼叫ops來檢查執行模型和測試更改，用標準Python除錯工具獲取即時錯誤報告。
更自然的流程控制。直接用Python流程控制而不是用計算圖。

簡而言之，有了Eager Execution，我們不再需要事先定義計算圖，然後再在session裡評估它。它允許用python語句控制模型的結構。

這裡我們舉個典型例子：Eager Execution獨有的 tf.GradientTape 。在計算圖模式下，如果我們要計算某個函式的梯度，首先我們得定義一個計算圖，從中知道各個節點是怎麼連線的，然後從輸出回溯到計算圖的輸入，層層計算並得到最終結果。

但在Eager Execution下，用自動微分計算函式梯度的唯一方法是構建圖。我們得先用 tf.GradientTape 根據可觀察元素（如變數）構建操作圖，然後再計算梯度。下面是 tf.GradientTape 文件中的一個原因和示例：

x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0

此外，用python語句（如if語句和迴圈語句）進行流程控制區別於靜態圖的 tf.get_variable , tf.variable_scope , tf.layers 。

之前官方釋出了一個名為Autograph的工具，它的作用是把普通Python程式碼轉換成複雜的計算圖程式碼，也就是允許使用者用Python直接編寫計算圖。但它指的Python事實上並不是真正意義上的Python（比如必須定義一個函式，讓它返回一個具有指定Tensorflow資料型別的元素列表），也沒法發揮程式語言的強大功能。

就個人而言，我不太喜歡Eager Execution，因為我已經習慣靜態圖了，而這個新改變有點像是對PyTorch的拙劣模仿。至於其他變化，我會在下面以問答方式做簡單介紹。

一問一答

下面是我認為從TensorFlow過渡到TensorFlow 2.0會出現的一些常見問題。

問：如果我的專案要用到tf.contrib怎麼辦？

你可以用pip安裝一個新的Python包，或者把 tf.contrib.something 重新命名為 tf.something 。

問：如果在Tensorflow 1.x裡能正常工作的東西到2.x沒法運行了怎麼辦？

不應該存在這種錯誤，建議你仔細檢查一下程式碼轉換得對不對，閱讀GitHub上的錯誤報告。

問：我的專案在靜態圖上好好的，一放到Eager Execution上就不行了怎麼辦？

我也遇到了這個問題，而且目前還不知道具體原因。所以建議先不要用Eager Execution。

問：我發現Tensorflow 2.x裡好像沒有某個tf.函式怎麼辦？

這個函式很有可能只被移到別的地方去了。在Tensorflow 1.x中，很多函式會有重複、有別名，Tensorflow 2.x對這些函式做了統一刪減整理，也移動了部分函式的位置。你可以在RFC：TensorFlow名稱空間裡找到將要新增、刪除、移動的所有函式。官方即將釋出的工具也能幫你適應這個更新。

小結

看了這麼多，相信讀者現在已經對Tensorflow 2.x有了大致瞭解，也有了心理準備。總的來說，正如大部分產品都要經歷更新迭代，我認為Tensorflow 2.x相比Tensorflow 1.x會是有明顯改進的一個版本。最後，我們再來看一下Tensorflow的發展時間軸，回憶過去三年來它帶給我們的記憶和知識。

Tensorflow 2.0的這些新設計，你適應好了嗎？

Tensorflow 2.0：為什麼？何時？

Keras（OOP）vs Tensorflow 1.x

Eager Execution

一問一答

小結

您可能也會喜歡…