1. 程式人生 > >AI-030: Google機器學習教程(ML Crash Course with TensorFlow APIs)筆記6-7 - 練習TF實現線性迴歸、特徵組合及離群值分離

AI-030: Google機器學習教程(ML Crash Course with TensorFlow APIs)筆記6-7 - 練習TF實現線性迴歸、特徵組合及離群值分離

本文是Google機器學習教程(ML Crash Course with TensorFlow APIs)的學習筆記。教程地址:

https://developers.google.com/machine-learning/crash-course/ml-intro

6. First Steps with TensorFlow

程式碼及解說地址:

https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb

########################################################Setup

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
# 載入資料,封裝到pandas Dataframe中
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

#資料隨機排序,這樣保障的SGD效果
california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

#為了方便處理資料,將房價資料縮小1000倍
california_housing_dataframe["median_house_value"] /= 1000.0

隨機排序及縮放後的資料:

#對資料有個整體統計分析
california_housing_dataframe.describe()

# 首先選取total_rooms為特徵
my_feature = california_housing_dataframe[["total_rooms"]]

# total_room的樣本資料都是數字,因此定義為tf的數字型別特徵列
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

# 目標是median_house_value
targets = california_housing_dataframe["median_house_value"]

feature_columns:

# 定義優化為梯度下降,學習效率為0.0000001,這個學習很低,可以逐步提高來優化模型
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
# 梯度裁剪確保了梯度的大小在訓練期間不會變得太大,從而導致梯度下降失敗。
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

# 定義線性迴歸,將前面定義的特徵列和優化方法作為引數
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=feature_columns,
    optimizer=my_optimizer
)

接下來是定義資料輸入函式,該函式是將pd獲取的資料組織成tf學習使用的dataset:

def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """用一個特徵來訓練線性迴歸模型

    引數:
      features: pandas DataFrame 格式的特徵資料
      targets: pandas DataFrame 格式的目標值(標籤)
      batch_size: 模型訓練的批次大小
      shuffle: True / False. 是否要將資料重新洗牌
      num_epochs: 資料重複次數. None = 無限重複
    返回:
      下一個資料批次構成的元組(features, labels)
    """

    # 將pandas DataFrame資料轉換成np arrays的字典型別. 結果: {'total_rooms': array([1410., 2046., 2987., ..., 2478., 9882., 1923.])}
    features = {key: np.array(value) for key, value in dict(features).items()}

    # 構建tf的dataset,設定batching/repeating
    ds = Dataset.from_tensor_slices((features, targets))  # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)

    # 根據需要重新洗牌資料
    if shuffle:
        ds = ds.shuffle(buffer_size=10000)#洗牌的樣本大小

    # 獲取下一個資料批次的features/labels
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

訓練模型:

_ = linear_regressor.train(
    input_fn = lambda:my_input_fn(my_feature, targets),
    steps=100
)

評估訓練後的模型:

# 建立預測的輸入函式,每次預測一個樣本,所以不需要迴圈和洗牌:num_epochs=1, shuffle=False
prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)

# 用訓練的模型來預測
predictions = linear_regressor.predict(input_fn=prediction_input_fn)

# 講預測結果轉換成NumPy array, 方便度量誤差
predictions = np.array([item['predictions'][0] for item in predictions])

# 計算均方誤差MSE和均方根誤差RMSE
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)

# 計算樣本的最大、最小及範圍
min_house_value = california_housing_dataframe["median_house_value"].min()
max_house_value = california_housing_dataframe["median_house_value"].max()
min_max_difference = max_house_value - min_house_value

print("Min. Median House Value: %0.3f" % min_house_value)
print("Max. Median House Value: %0.3f" % max_house_value)
print("Difference between Min. and Max.: %0.3f" % min_max_difference)
print("Root Mean Squared Error: %0.3f" % root_mean_squared_error)

可以看到目前模型效果很糟糕,RMSE很大,幾乎到了樣本區間的一半。

思考下將模型訓練的更好?

可以先看看自己的預測與樣本之間的偏差:

calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
calibration_data.describe()

這些數字也體現了我們的模型目前很糟糕。

還可以通過繪製圖形來看看我們的模型效果如何(因為目前是一個特徵,所以可以繪製二維圖形):

# 選擇300個樣本進行繪圖分析
sample = california_housing_dataframe.sample(n=300)

# 獲得樣本中特徵最大最小值
x_0 = sample["total_rooms"].min()
x_1 = sample["total_rooms"].max()

# 獲取模型訓練後的權重與偏差
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

# 計算最大、最小特徵值時的預測值
y_0 = weight * x_0 + bias
y_1 = weight * x_1 + bias

# 繪製最大最小點之間的直線,這就是目前我們的模型幾何表示
plt.plot([x_0, x_1], [y_0, y_1], c='r')

# 繪製圖形座標軸標籤
plt.ylabel("median_house_value")
plt.xlabel("total_rooms")

# 將樣本繪製到圖形上(x-特徵值,y-標籤)
plt.scatter(sample["total_rooms"], sample["median_house_value"])

# 顯示圖形
plt.show()

可以看到,模型(紅線)與樣本(藍點)擬合的很差。

將學習效率提高到0.001,效果會好一些:

通過圖形也可以清晰的看到,單個特徵是不能得到很好的預測模型的。

最後,將前面模型的程式碼組合到一個函式中:

def train_model(learning_rate, steps, batch_size, input_feature):
    """Trains a linear regression model.

    Args:
      learning_rate: A `float`, the learning rate.
      steps: A non-zero `int`, the total number of training steps. A training step
        consists of a forward and backward pass using a single batch.
      batch_size: A non-zero `int`, the batch size.
      input_feature: A `string` specifying a column from `california_housing_dataframe`
        to use as input feature.

    Returns:
      A Pandas `DataFrame` containing targets and the corresponding predictions done
      after training the model.
    """

    periods = 10
    steps_per_period = steps / periods

    my_feature = input_feature
    my_feature_data = california_housing_dataframe[[my_feature]].astype('float32')
    my_label = "median_house_value"
    targets = california_housing_dataframe[my_label].astype('float32')

    # Create input functions.
    training_input_fn = lambda: my_input_fn(my_feature_data, targets, batch_size=batch_size)
    predict_training_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)

    # Create feature columns.
    feature_columns = [tf.feature_column.numeric_column(my_feature)]

    # Create a linear regressor object.
    my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
    linear_regressor = tf.estimator.LinearRegressor(
        feature_columns=feature_columns,
        optimizer=my_optimizer
    )

    # Set up to plot the state of our model's line each period.
    plt.figure(figsize=(15, 6))
    plt.subplot(1, 2, 1)
    plt.title("Learned Line by Period")
    plt.ylabel(my_label)
    plt.xlabel(my_feature)
    sample = california_housing_dataframe.sample(n=300)
    plt.scatter(sample[my_feature], sample[my_label])
    colors = [cm.coolwarm(x) for x in np.linspace(-1, 1, periods)]

    # Train the model, but do so inside a loop so that we can periodically assess
    # loss metrics.
    print("Training model...")
    print("RMSE (on training data):")
    root_mean_squared_errors = []
    for period in range(0, periods):
        # Train the model, starting from the prior state.
        linear_regressor.train(
            input_fn=training_input_fn,
            steps=steps_per_period,
        )
        # Take a break and compute predictions.
        predictions = linear_regressor.predict(input_fn=predict_training_input_fn)
        predictions = np.array([item['predictions'][0] for item in predictions])

        # Compute loss.
        root_mean_squared_error = math.sqrt(
            metrics.mean_squared_error(predictions, targets))
        # Occasionally print the current loss.
        print("  period %02d : %0.2f" % (period, root_mean_squared_error))
        # Add the loss metrics from this period to our list.
        root_mean_squared_errors.append(root_mean_squared_error)
        # Finally, track the weights and biases over time.
        # Apply some math to ensure that the data and line are plotted neatly.
        y_extents = np.array([0, sample[my_label].max()])

        weight = linear_regressor.get_variable_value('linear/linear_model/%s/weights' % input_feature)[0]
        bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

        x_extents = (y_extents - bias) / weight
        x_extents = np.maximum(np.minimum(x_extents,
                                          sample[my_feature].max()),
                               sample[my_feature].min())
        y_extents = weight * x_extents + bias
        plt.plot(x_extents, y_extents, color=colors[period])
    print("Model training finished.")

    # Output a graph of loss metrics over periods.
    plt.subplot(1, 2, 2)
    plt.ylabel('RMSE')
    plt.xlabel('Periods')
    plt.title("Root Mean Squared Error vs. Periods")
    plt.tight_layout()
    plt.plot(root_mean_squared_errors)

    # Create a table with calibration data.
    calibration_data = pd.DataFrame()
    calibration_data["predictions"] = pd.Series(predictions)
    calibration_data["targets"] = pd.Series(targets)
    display.display(calibration_data.describe())

    print("Final RMSE (on training data): %0.2f" % root_mean_squared_error)

    return calibration_data

7. 組合特徵和離群值Synthetic Features and Outliers

組合特徵能夠減少特徵數量,計算一個特徵就能同時引入多個特徵,提高模型的效率;排除離群值可以提高模型的精確度;

增加一個特徵rooms_per_person,該特徵是其他特徵計算來的:total_rooms / population,用新特徵訓練模型:

california_housing_dataframe["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] / california_housing_dataframe["population"])

calibration_data = train_model(
    learning_rate=0.05,
    steps=500,
    batch_size=5,
    input_feature="rooms_per_person")

可以看到模型最後一次訓練RMSE反而增大了,表示超過了梯度的最低點,學習效率要適當減小(當前是0.05)

接下來處理離群值,通過繪製樣本資料的圖形來分析離群值:

#分析離群值
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
#繪製預測值(x)與標籤值(y)的散點圖
plt.scatter(calibration_data["predictions"], calibration_data["targets"])

plt.subplot(1, 2, 2)
#繪製特徵值rooms_per_person的直方分佈圖
_ = california_housing_dataframe["rooms_per_person"].hist()

plt.show()

可以看到離群值的分佈狀況,左邊是預測值(x)與目標值(y);右邊是rooms_per_person的直方圖分佈,大部分資料在0-5之間,所以大於5的可以過濾掉。

排除離群值:

# 排除離群值
california_housing_dataframe["rooms_per_person"] = (
    california_housing_dataframe["rooms_per_person"]).apply(lambda x: min(x, 5))
# 繪製特徵值rooms_per_person的直方分佈圖,看到大於5的利群值已經被排除
_ = california_housing_dataframe["rooms_per_person"].hist()

# 重新訓練模型
calibration_data = train_model(
    learning_rate=0.05,
    steps=500,
    batch_size=5,
    input_feature="rooms_per_person")
# 繪製預測值(x)與標籤值(y)的散點圖
_ = plt.scatter(calibration_data["predictions"], calibration_data["targets"])

可以看到只有0-5之間的特徵值了。

排除離散值後的訓練結果: