AI-030: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記6-7 - 練習TF實現線性迴歸、特徵組合及離群值分離

阿新 • • 發佈：2018-12-30

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址：

https://developers.google.com/machine-learning/crash-course/ml-intro

6. First Steps with TensorFlow

程式碼及解說地址：

https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb

########################################################Setup

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

# 載入資料，封裝到pandas Dataframe中
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

#資料隨機排序，這樣保障的SGD效果
california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

#為了方便處理資料，將房價資料縮小1000倍
california_housing_dataframe["median_house_value"] /= 1000.0

隨機排序及縮放後的資料：

#對資料有個整體統計分析
california_housing_dataframe.describe()

# 首先選取total_rooms為特徵
my_feature = california_housing_dataframe[["total_rooms"]]

# total_room的樣本資料都是數字，因此定義為tf的數字型別特徵列
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

# 目標是median_house_value
targets = california_housing_dataframe["median_house_value"]

feature_columns:

# 定義優化為梯度下降，學習效率為0.0000001，這個學習很低，可以逐步提高來優化模型
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
# 梯度裁剪確保了梯度的大小在訓練期間不會變得太大，從而導致梯度下降失敗。
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

# 定義線性迴歸，將前面定義的特徵列和優化方法作為引數
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=feature_columns,
    optimizer=my_optimizer
)

接下來是定義資料輸入函式，該函式是將pd獲取的資料組織成tf學習使用的dataset：

def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """用一個特徵來訓練線性迴歸模型

    引數:
      features: pandas DataFrame 格式的特徵資料
      targets: pandas DataFrame 格式的目標值（標籤）
      batch_size: 模型訓練的批次大小
      shuffle: True / False. 是否要將資料重新洗牌
      num_epochs: 資料重複次數. None = 無限重複
    返回:
      下一個資料批次構成的元組(features, labels)
    """

    # 將pandas DataFrame資料轉換成np arrays的字典型別. 結果: {'total_rooms': array([1410., 2046., 2987., ..., 2478., 9882., 1923.])}
    features = {key: np.array(value) for key, value in dict(features).items()}

    # 構建tf的dataset，設定batching/repeating
    ds = Dataset.from_tensor_slices((features, targets))  # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)

    # 根據需要重新洗牌資料
    if shuffle:
        ds = ds.shuffle(buffer_size=10000)#洗牌的樣本大小

    # 獲取下一個資料批次的features/labels
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

訓練模型:

_ = linear_regressor.train(
    input_fn = lambda:my_input_fn(my_feature, targets),
    steps=100
)

評估訓練後的模型:

# 建立預測的輸入函式，每次預測一個樣本，所以不需要迴圈和洗牌：num_epochs=1, shuffle=False
prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)

# 用訓練的模型來預測
predictions = linear_regressor.predict(input_fn=prediction_input_fn)

# 講預測結果轉換成NumPy array, 方便度量誤差
predictions = np.array([item['predictions'][0] for item in predictions])

# 計算均方誤差MSE和均方根誤差RMSE
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)

# 計算樣本的最大、最小及範圍
min_house_value = california_housing_dataframe["median_house_value"].min()
max_house_value = california_housing_dataframe["median_house_value"].max()
min_max_difference = max_house_value - min_house_value

print("Min. Median House Value: %0.3f" % min_house_value)
print("Max. Median House Value: %0.3f" % max_house_value)
print("Difference between Min. and Max.: %0.3f" % min_max_difference)
print("Root Mean Squared Error: %0.3f" % root_mean_squared_error)

可以看到目前模型效果很糟糕，RMSE很大，幾乎到了樣本區間的一半。

思考下將模型訓練的更好？

可以先看看自己的預測與樣本之間的偏差：

calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
calibration_data.describe()

這些數字也體現了我們的模型目前很糟糕。

還可以通過繪製圖形來看看我們的模型效果如何（因為目前是一個特徵，所以可以繪製二維圖形）：

# 選擇300個樣本進行繪圖分析
sample = california_housing_dataframe.sample(n=300)

# 獲得樣本中特徵最大最小值
x_0 = sample["total_rooms"].min()
x_1 = sample["total_rooms"].max()

# 獲取模型訓練後的權重與偏差
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

# 計算最大、最小特徵值時的預測值
y_0 = weight * x_0 + bias
y_1 = weight * x_1 + bias

# 繪製最大最小點之間的直線，這就是目前我們的模型幾何表示
plt.plot([x_0, x_1], [y_0, y_1], c='r')

# 繪製圖形座標軸標籤
plt.ylabel("median_house_value")
plt.xlabel("total_rooms")

# 將樣本繪製到圖形上（x-特徵值，y-標籤）
plt.scatter(sample["total_rooms"], sample["median_house_value"])

# 顯示圖形
plt.show()

可以看到，模型（紅線）與樣本（藍點）擬合的很差。

將學習效率提高到0.001，效果會好一些：

通過圖形也可以清晰的看到，單個特徵是不能得到很好的預測模型的。

最後，將前面模型的程式碼組合到一個函式中：

def train_model(learning_rate, steps, batch_size, input_feature):
    """Trains a linear regression model.

    Args:
      learning_rate: A `float`, the learning rate.
      steps: A non-zero `int`, the total number of training steps. A training step
        consists of a forward and backward pass using a single batch.
      batch_size: A non-zero `int`, the batch size.
      input_feature: A `string` specifying a column from `california_housing_dataframe`
        to use as input feature.

    Returns:
      A Pandas `DataFrame` containing targets and the corresponding predictions done
      after training the model.
    """

    periods = 10
    steps_per_period = steps / periods

    my_feature = input_feature
    my_feature_data = california_housing_dataframe[[my_feature]].astype('float32')
    my_label = "median_house_value"
    targets = california_housing_dataframe[my_label].astype('float32')

    # Create input functions.
    training_input_fn = lambda: my_input_fn(my_feature_data, targets, batch_size=batch_size)
    predict_training_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)

    # Create feature columns.
    feature_columns = [tf.feature_column.numeric_column(my_feature)]

    # Create a linear regressor object.
    my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
    linear_regressor = tf.estimator.LinearRegressor(
        feature_columns=feature_columns,
        optimizer=my_optimizer
    )

    # Set up to plot the state of our model's line each period.
    plt.figure(figsize=(15, 6))
    plt.subplot(1, 2, 1)
    plt.title("Learned Line by Period")
    plt.ylabel(my_label)
    plt.xlabel(my_feature)
    sample = california_housing_dataframe.sample(n=300)
    plt.scatter(sample[my_feature], sample[my_label])
    colors = [cm.coolwarm(x) for x in np.linspace(-1, 1, periods)]

    # Train the model, but do so inside a loop so that we can periodically assess
    # loss metrics.
    print("Training model...")
    print("RMSE (on training data):")
    root_mean_squared_errors = []
    for period in range(0, periods):
        # Train the model, starting from the prior state.
        linear_regressor.train(
            input_fn=training_input_fn,
            steps=steps_per_period,
        )
        # Take a break and compute predictions.
        predictions = linear_regressor.predict(input_fn=predict_training_input_fn)
        predictions = np.array([item['predictions'][0] for item in predictions])

        # Compute loss.
        root_mean_squared_error = math.sqrt(
            metrics.mean_squared_error(predictions, targets))
        # Occasionally print the current loss.
        print("  period %02d : %0.2f" % (period, root_mean_squared_error))
        # Add the loss metrics from this period to our list.
        root_mean_squared_errors.append(root_mean_squared_error)
        # Finally, track the weights and biases over time.
        # Apply some math to ensure that the data and line are plotted neatly.
        y_extents = np.array([0, sample[my_label].max()])

        weight = linear_regressor.get_variable_value('linear/linear_model/%s/weights' % input_feature)[0]
        bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

        x_extents = (y_extents - bias) / weight
        x_extents = np.maximum(np.minimum(x_extents,
                                          sample[my_feature].max()),
                               sample[my_feature].min())
        y_extents = weight * x_extents + bias
        plt.plot(x_extents, y_extents, color=colors[period])
    print("Model training finished.")

    # Output a graph of loss metrics over periods.
    plt.subplot(1, 2, 2)
    plt.ylabel('RMSE')
    plt.xlabel('Periods')
    plt.title("Root Mean Squared Error vs. Periods")
    plt.tight_layout()
    plt.plot(root_mean_squared_errors)

    # Create a table with calibration data.
    calibration_data = pd.DataFrame()
    calibration_data["predictions"] = pd.Series(predictions)
    calibration_data["targets"] = pd.Series(targets)
    display.display(calibration_data.describe())

    print("Final RMSE (on training data): %0.2f" % root_mean_squared_error)

    return calibration_data

7. 組合特徵和離群值Synthetic Features and Outliers

組合特徵能夠減少特徵數量，計算一個特徵就能同時引入多個特徵，提高模型的效率；排除離群值可以提高模型的精確度；

增加一個特徵rooms_per_person，該特徵是其他特徵計算來的：total_rooms / population，用新特徵訓練模型：

california_housing_dataframe["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] / california_housing_dataframe["population"])

calibration_data = train_model(
    learning_rate=0.05,
    steps=500,
    batch_size=5,
    input_feature="rooms_per_person")

可以看到模型最後一次訓練RMSE反而增大了，表示超過了梯度的最低點，學習效率要適當減小（當前是0.05）

接下來處理離群值，通過繪製樣本資料的圖形來分析離群值：

#分析離群值
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
#繪製預測值(x)與標籤值(y)的散點圖
plt.scatter(calibration_data["predictions"], calibration_data["targets"])

plt.subplot(1, 2, 2)
#繪製特徵值rooms_per_person的直方分佈圖
_ = california_housing_dataframe["rooms_per_person"].hist()

plt.show()

可以看到離群值的分佈狀況，左邊是預測值（x）與目標值（y）；右邊是rooms_per_person的直方圖分佈，大部分資料在0-5之間，所以大於5的可以過濾掉。

排除離群值：

# 排除離群值
california_housing_dataframe["rooms_per_person"] = (
    california_housing_dataframe["rooms_per_person"]).apply(lambda x: min(x, 5))
# 繪製特徵值rooms_per_person的直方分佈圖，看到大於5的利群值已經被排除
_ = california_housing_dataframe["rooms_per_person"].hist()

# 重新訓練模型
calibration_data = train_model(
    learning_rate=0.05,
    steps=500,
    batch_size=5,
    input_feature="rooms_per_person")
# 繪製預測值(x)與標籤值(y)的散點圖
_ = plt.scatter(calibration_data["predictions"], calibration_data["targets"])

可以看到只有0-5之間的特徵值了。

排除離散值後的訓練結果：

AI-030: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記6-7 - 練習TF實現線性迴歸、特徵組合及離群值分離

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 6. First Ste

AI-033: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記10 - 正則化

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 10 正則化防止過

AI-031: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記8 - 練習使用校驗集和測試集以及評估模型

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 8. 使用校驗集和測試集

正在學習的比較詳細的機器學習教程（不斷更新）

使用MINIST資料集 https://blog.csdn.net/zhaohaibo_/article/d // 獲取 minist 資料集 from tensorflow.examples.tutorials.mnist import input_data mnist = input

手把手教你整合華為機器學習服務（ML Kit）人臉檢測功能

當給自己拍一張美美的自拍照時，卻發現照片中自己的臉不夠瘦、眼睛不夠大、表情不夠豐富可愛…如果此時能夠一鍵美顏瘦臉並且新增可愛的貼紙的話，是不是很棒？當家裡的小孩觀看iPad螢幕時間過長或者眼睛離螢幕距離過近，家長沒能時刻關注到時，如果有一款可以實現parent control的應用，那是不是很方便？面對以

機器學習-4（k-進鄰演算法簡介中）

既然要介紹該演算法，我們就簡單介紹一下歐式距離這個應該是我們初中就學過的了，2點之間的距離就是它的多維空間裡面每個維度的座標的差的平方之和，再開方公式就是 OK，我們現在按照分類的基本原則，把所有的樣本集都放進我們的座標系裡面來，有多少特徵，我們就建立幾維的空間座標系。

KNN機器學習實戰（包含SKLearn--KNN 包的呼叫）

sklearn中KNN 的用法： # -*- coding: utf-8 -*- import numpy as np from sklearn import neighbors, datasets from sklearn.model_selection im

乾貨丨機器學習入門（經典好文，強烈推薦）

讓我們從機器學習談起導讀：在本篇文章中，將對機器學習做個概要的介紹。本文的目的是能讓即便完全不瞭

吳恩達Coursera深度學習課程 deeplearning.ai (3-1) 機器學習(ML)策略（1）--課程筆記

1.1 為什麼是 ML 策略實踐中優化深度學習模型的方法有好多種，應該如何抉擇? 1.2 正交化正交化：一個維度做且只做一件事，各個維度相互獨立，不影響其他維度做的事情。比如電視條件：有調節高度的按鈕，寬度的按鈕，旋轉的按鈕，色彩

（未完待續）機器學習教程視頻資料匯總

提升 isp pla blog sso 相關學習機 you mage 在學習機器學習的過程中，我陸陸續續收集了些免費的教程資料，希望能和您共享。如果您有好的網站希望分享就評論吧，我也會整理到文章中。 1 機器學習基礎篇 (1)課程資源吳恩達《機器學習》cou

Kaggle機器學習教程學習（三）

var com his sel base ike ads ria some 　　該篇詳細討論數據清理步驟。其實這些基礎我覺得與數模競賽過程都是差不多的。　　如文中所說：　　　　The first step to data cleaning is removing u

Coursera吳恩達機器學習教程筆記（一）

人工智慧行業如火如荼，想要入門人工智慧，吳恩達老師的機器學習課程絕對是不二之選（當然，這不是我說的，是廣大網友共同認為的）教程的地址連結：有的同學可能進不去這個網站，解決辦法參照如下連結：這個辦法本人親測有效，因為我看的時候也打不開（囧！！）

google機器學習框架tensorflow學習筆記（十）

訓練集和測試集 (Training and Test Sets)：拆分資料上一單元介紹了將資料集分為兩個子集的概念：訓練集 - 用於訓練模型的子集。測試集 - 用於測試訓練後模型的子集。您可以想象按如下方式拆分單個數據集：

google機器學習框架tensorflow學習筆記（九）

泛化 (Generalization)：過擬合的風險本單元將重點介紹泛化。為了讓您直觀地理解這一概念，我們將展示 3 張圖。假設這些圖中的每個點代表一棵樹在森林中的位置。圖中的兩種顏色分別代表以下含義：藍點代表生病的樹。橙點代表健康的樹。

google機器學習框架tensorflow學習筆記（八）

合成特質和離群值學習目標：建立一個合成特徵，即另外兩個特徵的比例將此新特徵用作線性迴歸模型的輸入通過識別和擷取（移除）輸入資料中的離群值來提高模型的有效性我們來回顧下之前的“使用 TensorFlow 的基本步驟”練習中的模型。首先，我們將

google機器學習框架tensorflow學習筆記（四）

使用TensorFlow的基本步驟 tensorflow是一個可用於構建機器學習模型的平臺，但其實它的用途很廣泛。它是一種基於圖表的通用計算框架，可用來編寫你能想出的任何東西。事實上tensorflow.org的API頁面中提供了可在程式碼中使用的低階tensorflo

google機器學習框架tensorflow學習筆記（三）

降低損失：迭代法迭代學習其實就是一種反饋的結果，有點類似於猜數遊戲，首先你隨便猜一個數，對方告訴你大了還是小了，接著你根據對方提供的資訊進行調整，繼續往正確的方向猜測，如此往復，你通常會越來越接近要猜的數。這個遊戲真正棘手的地方在於儘可能高效地找到最佳模型。下圖

google機器學習框架tensorflow學習筆記（二）

線性迴歸人們早就知曉，相比涼爽的天氣，蟋蟀在較為炎熱的天氣裡鳴叫更為頻繁。數十年來，專業和業餘昆蟲學者已將每分鐘的鳴叫聲和溫度方面的資料編入目錄。Ruth 阿姨將她喜愛的蟋蟀資料庫作為生日禮物送給您，並邀請您自己利用該資料庫訓練一個模型，從而預測鳴叫聲與溫度的關係。如果把資料

google機器學習框架tensorflow學習筆記（七）

使用Tensorflow的基本步驟第五步：訓練模型現在，我們可以在 linear_regressor 上呼叫 train() 來訓練模型。我們會將 my_input_fn 封裝在 lambda

google機器學習框架tensorflow學習筆記（六）

使用Tensorflow的基本步驟設定首先載入必要的庫 import math from IPython import display from matplotlib import cm from matplotlib import gridspec fr

AI-030: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記6-7 - 練習TF實現線性迴歸、特徵組合及離群值分離

相關推薦