Google機器學習 ----驗證

阿新 • • 發佈：2018-12-18

使用多個特徵而非單個特徵來進一步提高模型的有效性
除錯模型輸入資料中的問題
使用測試資料集檢查模型是否過擬合驗證資料

#初始化，讀取資料
from __future__ import print_function
import math
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.cn/mledu-datasets/california_housing_train.csv", sep=",")

#預處理特徵
def preprocess_features(california_housing_dataframe):
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

#特徵處理前12000
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()

#抽取後5000個樣本
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()

validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

plt.figure(figsize=(13,8))#建立影象
ax=plt.subplot(1,2,1)#建立子圖
ax.set_autoscaley_on(False)
ax.set_ylim([32,43])#y軸上下限
ax.set_autoscaley_on(False)
ax.set_xlim([-126,-112])
plt.scatter(validation_examples["longitude"],
            validation_examples["latitude"],cmap=cm.coolwarm,
            c=validation_targets["median_house_value"]/validation_targets["median_house_value"].max())
ax=plt.subplot(1,2,2)
ax.set_autoscaley_on(False)
ax.set_ylim([32,43])
ax.set_autoscaley_on(False)
ax.set_xlim([-126,-112])
plt.scatter(training_examples["longitude"],
            training_examples["latitude"],cmap=cm.coolwarm,
            c=training_targets["median_house_value"]/training_targets["median_house_value"].max())
_=plt.plot()

我們會發現訓練集和驗證集的分佈分隔成了兩部分，就像是把加州的地圖給剪開了一樣，這是因為我們在建立訓練集和驗證集之前沒有對資料進行正確的隨機化處理。

def my_input_fn(features,targets,batch_size=1,shuffle=True,num_epochs=None):
    features={key:np.array(value) for key,value in dict(features).items()}
    ds=Dataset.from_tensor_slices((features,targets))
    ds=ds.batch(batch_size).repeat(num_epochs)
    if shuffle:
        ds.shuffle(10000)
    features,labels=ds.make_one_shot_iterator().get_next()
    return features,labels

def construct_feature_columns(input_features):
    return set([tf.feature_column.numeric_column(my_feature) #生成的特徵列
                for my_feature in input_features])

#訓練模型
def train_model(
    learning_rate,
    steps,
    batch_size,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets):
    periods = 10#週期數
    steps_per_period = steps / periods#每個週期的步數
      
    # Create a linear regressor object.
#建立一個線性迴歸
    my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
#配置線性迴歸模型
    linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=construct_feature_columns(training_examples),
          optimizer=my_optimizer
      )
      # Create input functions.
    training_input_fn = lambda: my_input_fn(
    training_examples, 
    training_targets["median_house_value"], 
    batch_size=batch_size)
    predict_training_input_fn = lambda: my_input_fn(
    training_examples, 
    training_targets["median_house_value"], 
          num_epochs=1, 
          shuffle=False)
    predict_validation_input_fn = lambda: my_input_fn(
    validation_examples, validation_targets["median_house_value"], 
          num_epochs=1, 
          shuffle=False)
    
      # Train the model, but do so inside a loop so that we can periodically assess
      # loss metrics.
    print("Training model...")
    print("RMSE (on training data):")
    training_rmse = []
    validation_rmse = []
    for period in range (0, periods):
#繼續訓練模型，從上一次停止位置繼續（停下來輸出訓練過程中的資料）
        # Train the model, starting from the prior state.
        linear_regressor.train(
            input_fn=training_input_fn,
            steps=steps_per_period,
        )
        # Take a break and compute predictions.
        training_predictions = linear_regressor.predict(input_fn=predict_training_input_fn)
        training_predictions = np.array([item['predictions'][0] for item in training_predictions])  
        validation_predictions = linear_regressor.predict(input_fn=predict_validation_input_fn)
        validation_predictions = np.array([item['predictions'][0] for item in validation_predictions])
        # Compute training and validation loss.
        training_root_mean_squared_error = math.sqrt(
            metrics.mean_squared_error(training_predictions, training_targets))
        validation_root_mean_squared_error = math.sqrt(
            metrics.mean_squared_error(validation_predictions, validation_targets))
        # Occasionally print the current loss.
##列印當前計算的損失
        print("  period %02d : %0.2f" % (period, training_root_mean_squared_error))
        # Add the loss metrics from this period to our list.
        training_rmse.append(training_root_mean_squared_error)
        validation_rmse.append(validation_root_mean_squared_error)
    print("Model training finished.")
    
    # Output a graph of loss metrics over periods.
    plt.ylabel("RMSE")
    plt.xlabel("Periods")
    plt.title("Root Mean Squared Error vs. Periods")
    plt.tight_layout()
    plt.plot(training_rmse, label="training")
    plt.plot(validation_rmse, label="validation")
    plt.legend()
    return linear_regressor

#訓練模型
linear_regressor = train_model(
    learning_rate=0.00003,
    steps=500,
    batch_size=5,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

#基於測試資料進行評估
#已對驗證資料進行大量迭代，接下來確保沒有過擬合該特定樣本集特性
california_housing_test_data = pd.read_csv("https://download.mlcc.google.cn/mledu-datasets/california_housing_test.csv", sep=",")
test_examples=preprocess_features(california_housing_test_data)
test_targets=preprocess_targets(california_housing_test_data)
predict_test_input_fn=lambda:my_input_fn(
    test_examples,test_targets["median_house_value"],
    num_epochs=1,shuffle=False
)
test_predictions=linear_regressor.predict(input_fn=predict_test_input_fn)
test_predictions=np.array([item['predictions'][0] for item in test_predictions])
root_mean_squared_error=math.sqrt(metrics.mean_squared_error(test_predictions,test_targets))
print("Final RMSE (on test data):%0.2f"% root_mean_squared_error)

Google機器學習 ----驗證

使用多個特徵而非單個特徵來進一步提高模型的有效性除錯模型輸入資料中的問題使用測試資料集檢查模型是否過擬合驗證資料 #初始化，讀取資料 from __future__ import print_function import math from IPython imp

Google機器學習特徵集

學習目標：建立一個包含極少特徵但效果與更復雜的特徵集一樣出色的集合具有較少特徵的模型會使用較少的資源，並且更易於維護。我們來看看能否構建這樣一種模型：包含極少的住房特徵，但效果與使用資料集中所有特徵的模型一樣出色。 from __future__ import print_functi

AI-033: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記10 - 正則化

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 10 正則化防止過

AI-031: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記8 - 練習使用校驗集和測試集以及評估模型

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 8. 使用校驗集和測試集

AI-030: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記6-7 - 練習TF實現線性迴歸、特徵組合及離群值分離

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 6. First Ste

Google---機器學習速成課程(十二)- 神經網路簡介 (Introduction to Neural Networks)

神經網路簡介 (Introduction to Neural Networks)神經網路是更復雜版本的特徵組合。實質上，神經網路會學習適合您的相應特徵組合。學習目標對神經網路有一定的瞭解，尤其是瞭解以下方面：隱藏層啟用函式---------------------------

google機器學習框架tensorflow學習筆記（十）

訓練集和測試集 (Training and Test Sets)：拆分資料上一單元介紹了將資料集分為兩個子集的概念：訓練集 - 用於訓練模型的子集。測試集 - 用於測試訓練後模型的子集。您可以想象按如下方式拆分單個數據集：

google機器學習框架tensorflow學習筆記（九）

泛化 (Generalization)：過擬合的風險本單元將重點介紹泛化。為了讓您直觀地理解這一概念，我們將展示 3 張圖。假設這些圖中的每個點代表一棵樹在森林中的位置。圖中的兩種顏色分別代表以下含義：藍點代表生病的樹。橙點代表健康的樹。

google機器學習框架tensorflow學習筆記（八）

合成特質和離群值學習目標：建立一個合成特徵，即另外兩個特徵的比例將此新特徵用作線性迴歸模型的輸入通過識別和擷取（移除）輸入資料中的離群值來提高模型的有效性我們來回顧下之前的“使用 TensorFlow 的基本步驟”練習中的模型。首先，我們將

google機器學習框架tensorflow學習筆記（四）

使用TensorFlow的基本步驟 tensorflow是一個可用於構建機器學習模型的平臺，但其實它的用途很廣泛。它是一種基於圖表的通用計算框架，可用來編寫你能想出的任何東西。事實上tensorflow.org的API頁面中提供了可在程式碼中使用的低階tensorflo

google機器學習框架tensorflow學習筆記（三）

降低損失：迭代法迭代學習其實就是一種反饋的結果，有點類似於猜數遊戲，首先你隨便猜一個數，對方告訴你大了還是小了，接著你根據對方提供的資訊進行調整，繼續往正確的方向猜測，如此往復，你通常會越來越接近要猜的數。這個遊戲真正棘手的地方在於儘可能高效地找到最佳模型。下圖

google機器學習框架tensorflow學習筆記（二）

線性迴歸人們早就知曉，相比涼爽的天氣，蟋蟀在較為炎熱的天氣裡鳴叫更為頻繁。數十年來，專業和業餘昆蟲學者已將每分鐘的鳴叫聲和溫度方面的資料編入目錄。Ruth 阿姨將她喜愛的蟋蟀資料庫作為生日禮物送給您，並邀請您自己利用該資料庫訓練一個模型，從而預測鳴叫聲與溫度的關係。如果把資料

google機器學習框架tensorflow學習筆記（七）

使用Tensorflow的基本步驟第五步：訓練模型現在，我們可以在 linear_regressor 上呼叫 train() 來訓練模型。我們會將 my_input_fn 封裝在 lambda

google機器學習框架tensorflow學習筆記（六）

使用Tensorflow的基本步驟設定首先載入必要的庫 import math from IPython import display from matplotlib import cm from matplotlib import gridspec fr

google機器學習框架tensorflow學習筆記（五）

Pandas簡介 pandas 是一種列存資料分析 API。它是用於處理和分析輸入資料的強大工具，很多機器學習框架都支援將 pandas 資料結構作為輸入。雖然全方位介紹 pandas API 會佔據很

Google---機器學習速成課程(七)- 特徵組合 (Feature Crosses)

特徵組合 (Feature Crosses)特徵組合是指兩個或多個特徵相乘形成的合成特徵。特徵的相乘組合可以提供超出這些特徵單獨能夠提供的預測能力。學習目標瞭解特徵組合。在 TensorFlow 中實施特徵組合。--------------------------------

MLCC筆記 - Google機器學習速成課程 - 筆記匯總

cati ner 基礎知識使用 live -i 正則化復雜 soft MLCC筆記 - Google機器學習速成課程 https://www.cnblogs.com/anliven/p/6107783.html MLCC簡介前提條件和準備工作完成課程的下一步 M

Google機器學習（二）鳶尾花資料集（load_iris）決策樹

Google深度學習系列視訊 ____tz_zs學習筆記一、在Spyder中寫第一個機器學習的程式：這裡使用的分類器是決策樹 from sklearn import tree feature = [[140,1],[130,1],[150,0],[170,

Google機器學習課程筆記之概念--問題構建 (Framing)：機器學習主要術語

什麼是（監督式）機器學習？簡單來說，它的定義如下：機器學習系統通過學習如何組合輸入資訊來對從未見過的資料做出有用的預測。下面我們來了解一下機器學習的基本術語。標籤在簡單線性迴歸中，標籤是我們要預測的事物，即 y 變數。標籤可以是小麥未來的價格、圖片中顯示的動物品種、音訊剪輯的

Google聲明機器學習在自己定制的芯片比方普通的GPU和CPU快15到30倍

protobuf div chm cli cpu tac .net 攻略 margin GOOGLE開發自己的加速機器學習的芯片已經不是什麽秘密了，最先發布出來的是TPU（Tensor Processing Units），在2016年5月I/O開發大會上發布的。可是沒有

Google機器學習 ----驗證

我們會發現訓練集和驗證集的分佈分隔成了兩部分，就像是把加州的地圖給剪開了一樣，這是因為我們在建立訓練集和驗證集之前沒有對資料進行正確的隨機化處理。

相關推薦