1. 程式人生 > >TensorFlow學習筆記(十三)TensorFLow 常用Optimizer 總結

TensorFlow學習筆記(十三)TensorFLow 常用Optimizer 總結

這裡主要是各種優化器,以及使用。因為大多數機器學習任務就是最小化損失,在損失定義的情況下,後面的工作就交給優化器啦。

因為深度學習常見的是對於梯度的優化,也就是說,優化器最後其實就是各種對於梯度下降演算法的優化。

常用的optimizer類

Ⅰ.class tf.train.Optimizer

優化器(optimizers)類的基類。這個類定義了在訓練模型的時候新增一個操作的API。你基本上不會直接使用這個類,但是你會用到他的子類比如GradientDescentOptimizer, AdagradOptimizer, MomentumOptimizer.等等這些。
後面講的時候會詳細講一下GradientDescentOptimizer 這個類的一些函式,然後其他的類只會講建構函式,因為類中剩下的函式都是大同小異的

Ⅱ.class tf.train.GradientDescentOptimizer

這個類是實現梯度下降演算法的優化器。(結合理論可以看到,這個建構函式需要的一個學習率就行了)

__init__(learning_rate, use_locking=False,name=’GradientDescent’)

作用:建立一個梯度下降優化器物件
引數:
learning_rate: A Tensor or a floating point value. 要使用的學習率
use_locking: 要是True的話,就對於更新操作(update operations.)使用鎖
name: 名字,可選,預設是”GradientDescent”.

compute_gradients(loss,var_list=None,gate_gradients=GATE_OP,aggregation_method=None,colocate_gradients_with_ops=False,grad_loss=None)

作用:對於在變數列表(var_list)中的變數計算對於損失函式的梯度,這個函式返回一個(梯度,變數)對的列表,其中梯度就是相對應變數的梯度了。這是minimize()函式的第一個部分,
引數:
loss: 待減小的值
var_list: 預設是在GraphKey.TRAINABLE_VARIABLES.
gate_gradients:

How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
grad_loss: Optional. A Tensor holding the gradient computed for loss.

apply_gradients(grads_and_vars,global_step=None,name=None)

作用:把梯度“應用”(Apply)到變數上面去。其實就是按照梯度下降的方式加到上面去。這是minimize()函式的第二個步驟。 返回一個應用的操作。
引數:
grads_and_vars: compute_gradients()函式返回的(gradient, variable)對的列表
global_step: Optional Variable to increment by one after the variables have been updated.
name: 可選,名字

get_name()

minimize(loss,global_step=None,var_list=None,gate_gradients=GATE_OP,aggregation_method=None,colocate_gradients_with_ops=False,name=None,grad_loss=None)

作用:非常常用的一個函式
通過更新var_list來減小loss,這個函式就是前面compute_gradients() 和apply_gradients().的結合

Ⅲ.class tf.train.AdadeltaOptimizer

實現了 Adadelta演算法的優化器,可以算是下面的Adagrad演算法改進版本

建構函式:
tf.train.AdadeltaOptimizer.init(learning_rate=0.001, rho=0.95, epsilon=1e-08, use_locking=False, name=’Adadelta’)

作用:構造一個使用Adadelta演算法的優化器
引數:
learning_rate: tensor或者浮點數,學習率
rho: tensor或者浮點數. The decay rate.
epsilon: A Tensor or a floating point value. A constant epsilon used to better conditioning the grad update.
use_locking: If True use locks for update operations.
name: 【可選】這個操作的名字,預設是”Adadelta”

IV.class tf.train.AdagradOptimizer

Optimizer that implements the Adagrad algorithm.

See this paper.
tf.train.AdagradOptimizer.__init__(learning_rate, initial_accumulator_value=0.1, use_locking=False, name=’Adagrad’)

Construct a new Adagrad optimizer.
Args:

learning_rate: A Tensor or a floating point value. The learning rate.
initial_accumulator_value: A floating point value. Starting value for the accumulators, must be positive.
use_locking: If True use locks for update operations.
name: Optional name prefix for the operations created when applying gradients. Defaults to "Adagrad".

Raises:

ValueError: If the initial_accumulator_value is invalid.

The Optimizer base class provides methods to compute gradients for a loss and apply gradients to variables. A collection of subclasses implement classic optimization algorithms such as GradientDescent and Adagrad.

You never instantiate the Optimizer class itself, but instead instantiate one of the subclasses.

Ⅴ.class tf.train.MomentumOptimizer

Optimizer that implements the Momentum algorithm.

tf.train.MomentumOptimizer.__init__(learning_rate, momentum, use_locking=False, name=’Momentum’, use_nesterov=False)

Construct a new Momentum optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate.
momentum: A Tensor or a floating point value. The momentum.
use_locking: If True use locks for update operations.
name: Optional name prefix for the operations created when applying gradients. Defaults to “Momentum”.

Ⅵ.class tf.train.AdamOptimizer

實現了Adam演算法的優化器
建構函式:
tf.train.AdamOptimizer.__init__(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name=’Adam’)

Construct a new Adam optimizer.

Initialization:

m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
The update rule for variable with gradient g uses an optimization described at the end of section2 of the paper:

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

Note that in dense implement of this algorithm, m_t, v_t and variable will update even if g is zero, but in sparse implement, m_t, v_t and variable will not update in iterations g is zero.

Args:

learning_rate: A Tensor or a floating point value. The learning rate.
beta1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability.
use_locking: If True use locks for update operations.
name: Optional name for the operations created when applying gradients. Defaults to “Adam”.

三.例子

I.線性迴歸

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Prepare train data
train_X = np.linspace(-1, 1, 100)
train_Y = 2 * train_X + np.random.randn(*train_X.shape) * 0.33 + 10

# Define the model
X = tf.placeholder("float")
Y = tf.placeholder("float")
w = tf.Variable(0.0, name="weight")
b = tf.Variable(0.0, name="bias")
loss = tf.square(Y - X*w - b)
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

# Create session to run
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())

    epoch = 1
    for i in range(10):
        for (x, y) in zip(train_X, train_Y):
            _, w_value, b_value = sess.run([train_op, w, b],feed_dict={X: x,Y: y})
        print("Epoch: {}, w: {}, b: {}".format(epoch, w_value, b_value))
        epoch += 1


#draw
plt.plot(train_X,train_Y,"+")
plt.plot(train_X,train_X.dot(w_value)+b_value)
plt.show()

結果:
這裡寫圖片描述
這裡寫圖片描述

這裡你可以使用更多的優化器來試一下各個優化器的效能和調參情況