1. 程式人生 > >tensorflow使用L2 regularization正則化修正overfitting過擬合

tensorflow使用L2 regularization正則化修正overfitting過擬合

L2正則化原理:

在Loss中加入引數w的平方和,這樣訓練過程中就會抑制w的值,w的值小,曲線就比較平滑,從而減小過擬合,參考公式如下圖:

正則化是不影響你去擬合曲線的,並不是所有引數都會被無腦抑制,實際上這是一個動態過程,是cross_entropy和L2 loss博弈的一個過程。訓練過程會去擬合一個合理的w,正則化又會去抑制w,兩項相抵消,無關的wi越變越小,但是比零強,有用的wi會被保留,處於一個合理的範圍。過多的道理和演算就不再贅述。

進行MNIST分類訓練,對比cross_entropy和加了l2正則的total_loss。

因為MNIST本來就不復雜,所以FC之前不能做太多CONV,會導致效果太好,不容易分出差距。為展示l2 norm的效果,我只留一層CONV(注意看FC1的輸入是h_pool1,短路了conv2)

,兩層conv可以作為對照組。

機子效能關係,直接取train的前1000作為validation,test的前1000作為test。

程式碼說明,一個基礎的CONV+FC結構,對影象進行label預測,通過cross_entropy衡量效能,進行訓練。
把cross_entropy和l2 loss都扔進collection 'losses'中。

tf.add_to_collection('losses', weight_decay)
tf.add_to_collection('losses', cross_entropy)

total_loss = tf.add_n(tf.get_collection('losses'))提取所有loss,拿total_loss去訓練,也就實現了圖一中公式的效果。

完整程式碼如下:


from __future__ import print_function
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# number 1 to 10 data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

def compute_accuracy(v_xs, v_ys):
    global prediction
    y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1})
    correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    #result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1})
    result = sess.run(accuracy, feed_dict={})
    return result

def weight_variable(shape, wd):
    initial = tf.truncated_normal(shape, stddev=0.1)

    if wd is not None:
        print('wd is not none!!!!!!!')
        weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss')
        tf.add_to_collection('losses', weight_decay)

    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    # stride [1, x_movement, y_movement, 1]
    # Must have strides[0] = strides[3] = 1
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    # stride [1, x_movement, y_movement, 1]
    return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

# define placeholder for inputs to network
xs = tf.placeholder(tf.float32, [None, 784])/255.   # 28x28
ys = tf.placeholder(tf.float32, [None, 10])
keep_prob = tf.placeholder(tf.float32)
x_image = tf.reshape(xs, [-1, 28, 28, 1])
# print(x_image.shape)  # [n_samples, 28,28,1]

## conv1 layer ##
W_conv1 = weight_variable([5,5, 1,32], 0.) # patch 5x5, in size 1, out size 32
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # output size 28x28x32
h_pool1 = max_pool_2x2(h_conv1)                                         # output size 14x14x32

## conv2 layer ##
W_conv2 = weight_variable([5,5, 32, 64], 0.) # patch 5x5, in size 32, out size 64
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # output size 14x14x64
h_pool2 = max_pool_2x2(h_conv2)                                         # output size 7x7x64

###############################################################################################################
## fc1 layer ##
W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)#do not use conv2
#W_fc1 = weight_variable([7*7*64, 1024], wd = 0.00)#use conv2
b_fc1 = bias_variable([1024])
# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]
h_pool2_flat = tf.reshape(h_pool1, [-1, 14*14*32])#do not use conv2
#h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])#use conv2
##################################################################################################################



h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

## fc2 layer ##
W_fc2 = weight_variable([1024, 10], wd = 0.)
b_fc2 = bias_variable([10])
prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)


# the error between prediction and real data
cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
                                              reduction_indices=[1]))       # loss

tf.add_to_collection('losses', cross_entropy)
total_loss = tf.add_n(tf.get_collection('losses'))
print(total_loss)

train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
train_op_with_l2_norm = tf.train.AdamOptimizer(1e-4).minimize(total_loss)

sess = tf.Session()
# important step
# tf.initialize_all_variables() no long valid from
# 2017-03-02 if using tensorflow >= 0.12
if int((tf.__version__).split('.')[1]) < 12 and int((tf.__version__).split('.')[0]) < 1:
    init = tf.initialize_all_variables()
else:
    init = tf.global_variables_initializer()
sess.run(init)

for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
    # sess.run(train_op_with_l2_norm, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
    # sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
    if i % 100 == 0:
        print('train accuracy',compute_accuracy(
            mnist.train.images[:1000], mnist.train.labels[:1000]))
        print('test accuracy',compute_accuracy(
            mnist.test.images[:1000], mnist.test.labels[:1000]))


下邊是訓練過程

不加dropout,不加l2 norm,訓練1000步:

weight_variable([1024, 10], wd = 0.)

明顯每一步train中都好於test,出現過擬合!

train accuracy 0.094
test accuracy 0.089
train accuracy 0.892
test accuracy 0.874
train accuracy 0.91
test accuracy 0.893
train accuracy 0.925
test accuracy 0.925
train accuracy 0.945
test accuracy 0.935
train accuracy 0.954
test accuracy 0.944
train accuracy 0.961
test accuracy 0.951
train accuracy 0.965
test accuracy 0.955
train accuracy 0.964
test accuracy 0.959
train accuracy 0.962
test accuracy 0.956

不加dropout,FC層加l2 norm,weight decay因子設定0.004,訓練1000步:

weight_variable([1024, 10], wd = 0.004)

過擬合現象明顯減輕了不少,甚至有時測試集還好於訓練集(因為驗證集大小的關係,只展示大概效果。)

train accuracy 0.107
test accuracy 0.145
train accuracy 0.876
test accuracy 0.861
train accuracy 0.91
test accuracy 0.909
train accuracy 0.923
test accuracy 0.919
train accuracy 0.931
test accuracy 0.927
train accuracy 0.936
test accuracy 0.939
train accuracy 0.956
test accuracy 0.949
train accuracy 0.958
test accuracy 0.954
train accuracy 0.947
test accuracy 0.95
train accuracy 0.947
test accuracy 0.953

對照組:不使用l2正則,只用dropout:過擬合現象減輕。

W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)
W_fc2 = weight_variable([1024, 10], wd = 0.)
    sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
train accuracy 0.132
test accuracy 0.104
train accuracy 0.869
test accuracy 0.859
train accuracy 0.898
test accuracy 0.889
train accuracy 0.917
test accuracy 0.906
train accuracy 0.923
test accuracy 0.917
train accuracy 0.928
test accuracy 0.925
train accuracy 0.938
test accuracy 0.94
train accuracy 0.94
test accuracy 0.942
train accuracy 0.947
test accuracy 0.941
train accuracy 0.944
test accuracy 0.947

對照組:雙層conv,本身過擬合不明顯,結果略

其他方法:正則化介面

loss公式直接加正則化項再拿去train就可以了。

loss =tf.reduce_mean(tf.square(y_ - y) + tf.contrib.layers.l2_regularizer(lambda)(w)

測一下單獨執行正則化操作的效果(加到loss的程式碼懶得羅列了,太長,就替換前邊的程式碼就可以):

import tensorflow as tf
CONST_SCALE = 0.5
w = tf.constant([[5.0, -2.0], [-3.0, 1.0]])
with tf.Session() as sess:
    print(sess.run(tf.abs(w)))
    print('preprocessing:', sess.run(tf.reduce_sum(tf.abs(w))))
    print('manual computation:', sess.run(tf.reduce_sum(tf.abs(w)) * CONST_SCALE))
    print('l1_regularizer:', sess.run(tf.contrib.layers.l1_regularizer(CONST_SCALE)(w))) #11 * CONST_SCALE

    print(sess.run(w**2))
    print(sess.run(tf.reduce_sum(w**2)))
    print('preprocessing:', sess.run(tf.reduce_sum(w**2) / 2))#default
    print('manual computation:', sess.run(tf.reduce_sum(w**2) / 2 * CONST_SCALE))
    print('l2_regularizer:', sess.run(tf.contrib.layers.l2_regularizer(CONST_SCALE)(w))) #19.5 * CONST_SCALE

----------------------------------------

[[5. 2.]
 [3. 1.]]
preprocessing: 11.0
manual computation: 5.5
l1_regularizer: 5.5
[[25.  4.]
 [ 9.  1.]]
39.0
preprocessing: 19.5
manual computation: 9.75
l2_regularizer: 9.75

注意:L2正則化的預處理資料是平方和除以2,這是方便處理加的一個係數,就像Loss公式一樣

其實在複雜系統下直接寫公式不如把基本loss和正則化項都丟進collection用著方便,何況你還需要把不同的weight設定不同的衰減係數呢是吧,這寫成公式就很繁瑣了。