Google機器學習特徵集

阿新 • • 發佈：2018-12-18

學習目標：建立一個包含極少特徵但效果與更復雜的特徵集一樣出色的集合

具有較少特徵的模型會使用較少的資源，並且更易於維護。我們來看看能否構建這樣一種模型：包含極少的住房特徵，但效果與使用資料集中所有特徵的模型一樣出色。

from __future__ import print_function
import  math
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.cn/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe=california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index)
)

def preprocess_features(california_housing_dataframe):
    selected_features=california_housing_dataframe[[
        "latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"
    ]]
    processed_features=selected_features.copy()
    processed_features["room_per_person"]=(
        california_housing_dataframe["total_rooms"]/
        california_housing_dataframe["population"]
    )
    return processed_features

def preprocess_targets(california_housing_dataframe):
    output_targets=pd.DataFrame()
    output_targets["median_house_value"]=(
        california_housing_dataframe["median_house_value"]/1000.0
    )
    return output_targets
training_examples=preprocess_features(california_housing_dataframe.head(12000))
training_targets=preprocess_targets(california_housing_dataframe.head(12000))
validation_examples=preprocess_features(california_housing_dataframe.tail(5000))
validation_targets=preprocess_targets(california_housing_dataframe.tail(5000))
print("Training examples summary:")
display.display(training_examples.describe())
print("Validation examples summary:")
display.display(validation_examples.describe())

print("Training targets summary:")
display.display(training_targets.describe())
print("Validation targets summary:")
display.display(validation_targets.describe())

#相關矩陣展現了兩兩比較的相關性，既包括每個特徵與目標特徵之間的比較，也包括每個特徵與其他特徵之間的比較。
#在這裡，相關性被定義為皮爾遜相關係數。 -1負相關 0 不相關 1 正相關
correlation_dataframe=training_examples.copy()
correlation_dataframe["target"]=training_targets["median_house_value"]
correlation_dataframe.corr()

def construct_feature_columns(input_features):
    return set([tf.feature_column.numeric_column(my_feature) for my_feature in input_features])

def my_input_fn(features,targets,batch_size=1,shuffle=True,num_epochs=None):
    features={key:np.array(value) for key,value in dict(features).items()}
    ds=Dataset.from_tensor_slices((features,targets))#它的作用是切分傳入Tensor的第一個維度，生成相應的dataset。
    ds=ds.batch(batch_size).repeat(num_epochs)
    if shuffle:
        ds=ds.shuffle(10000)
    features,labels=ds.make_one_shot_iterator().get_next()
    return features,labels



def train_model(
    learning_rate,
    steps,
    batch_size,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets):
  periods = 10
  steps_per_period = steps / periods

  # Create a linear regressor object.
  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
  linear_regressor = tf.estimator.LinearRegressor(
      feature_columns=construct_feature_columns(training_examples),
      optimizer=my_optimizer
  )
    
  # Create input functions.
  training_input_fn = lambda: my_input_fn(training_examples, 
                                          training_targets["median_house_value"], 
                                          batch_size=batch_size)
  predict_training_input_fn = lambda: my_input_fn(training_examples, 
                                                  training_targets["median_house_value"], 
                                                  num_epochs=1, 
                                                  shuffle=False)
  predict_validation_input_fn = lambda: my_input_fn(validation_examples, 
                                                    validation_targets["median_house_value"], 
                                                    num_epochs=1, 
                                                    shuffle=False)

  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  print("Training model...")
  print("RMSE (on training data):")
  training_rmse = []
  validation_rmse = []
  for period in range (0, periods):
    # Train the model, starting from the prior state.
    linear_regressor.train(
        input_fn=training_input_fn,
        steps=steps_per_period,
    )
    # Take a break and compute predictions.
    training_predictions = linear_regressor.predict(input_fn=predict_training_input_fn)
    training_predictions = np.array([item['predictions'][0] for item in training_predictions])
    
    validation_predictions = linear_regressor.predict(input_fn=predict_validation_input_fn)
    validation_predictions = np.array([item['predictions'][0] for item in validation_predictions])
    
    # Compute training and validation loss.
    training_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(training_predictions, training_targets))
    validation_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(validation_predictions, validation_targets))
    # Occasionally print the current loss.
    print("  period %02d : %0.2f" % (period, training_root_mean_squared_error))
    # Add the loss metrics from this period to our list.
    training_rmse.append(training_root_mean_squared_error)
    validation_rmse.append(validation_root_mean_squared_error)
  print("Model training finished.")

  
  # Output a graph of loss metrics over periods.
  plt.ylabel("RMSE")
  plt.xlabel("Periods")
  plt.title("Root Mean Squared Error vs. Periods")
  plt.tight_layout()
  plt.plot(training_rmse, label="training")
  plt.plot(validation_rmse, label="validation")
  plt.legend()

  return linear_regressor

#花 5 分鐘時間來搜尋一組效果良好的特徵和訓練引數。然後檢視解決方案，看看我們選擇了哪些引數。請謹記，不同的特徵可能需要不同的學習引數。
minimal_features = [
  "median_income",
  "latitude",
]

minimal_training_examples = training_examples[minimal_features]
minimal_validation_examples = validation_examples[minimal_features]

_ = train_model(
    learning_rate=0.01,
    steps=500,
    batch_size=5,
    training_examples=minimal_training_examples,
    training_targets=training_targets,
    validation_examples=minimal_validation_examples,
    validation_targets=validation_targets)

plt.scatter(training_examples["latitude"],training_examples["median_income"])
#兩者確實不存線上性關係。
#有幾個峰值與洛杉磯和舊金山大致相對應。

#zip() 函式用於將可迭代的物件作為引數，將物件中對應的元素打包成一個個元組，然後返回由這些元組組成的列表。
#嘗試建立一些能夠更好地利用緯度的合成特徵。
#例如，您可以建立某個特徵，將 latitude 對映到值 |latitude - 38|，並將該特徵命名為 #distance_from_san_francisco。
#或者，您可以將該空間分成 10 個不同的分桶（例如 latitude_32_to_33、latitude_33_to_34 等）：如#果 latitude 位於相應分桶範圍內，則顯示值 1.0；如果不在範圍內，則顯示值 0.0。
#使用相關矩陣來指導您構建合成特徵；如果您發現效果還不錯的合成特徵，可以將其新增到您的模型中。
#您可以獲得的最佳驗證效果是什麼？
#對緯度進行分桶
LATITUDE_RANGES = zip(range(32, 44), range(33, 45))
OPPO = zip(range(32,44),range(33,45))
def select_and_transform_features(source_df):
  selected_examples = pd.DataFrame()
  selected_examples["median_income"] = source_df["median_income"]
  for r in LATITUDE_RANGES:
    selected_examples["latitude_%d_to_%d" % r] = source_df["latitude"].apply(
      lambda l: 1.0 if l >= r[0] and l < r[1] else 0.0)
  return selected_examples
def select_and_transform_featurestwo(source_df):
    selected_examples = pd.DataFrame()
    selected_examples["median_income"] = source_df["median_income"]
    for r in OPPO:
        selected_examples["latitude_%d_to_%d" % r] = source_df["latitude"].apply(lambda l:1.0 if l >= r[0] and l < r[1] else 0.0)
    return selected_examples
selected_training_examples = select_and_transform_features(training_examples)
selected_validation_examples = select_and_transform_featurestwo(validation_examples)
_ = train_model(
    learning_rate=0.01,
    steps=500,
    batch_size=5,
    training_examples=selected_training_examples,
    training_targets=training_targets,
    validation_examples=selected_validation_examples,
    validation_targets=validation_targets)

Google機器學習特徵集

學習目標：建立一個包含極少特徵但效果與更復雜的特徵集一樣出色的集合具有較少特徵的模型會使用較少的資源，並且更易於維護。我們來看看能否構建這樣一種模型：包含極少的住房特徵，但效果與使用資料集中所有特徵的模型一樣出色。 from __future__ import print_functi

機器學習特征表達——日期與時間特征做離散處理（數字到分類的映射），稀疏類分組（相似特征歸檔），創建虛擬變量（提取新特征）本質就是要麽多變少，或少變多

通過 time 理想 ast 可能 ear 創建 eat 根據特征表達接下來要談到的特征工程類型雖然簡單卻影響巨大。我們將其稱為特征表達。你的數據並不一定總是理想格式。你需要考慮是否有必要通過另一種形式進行特征表達以獲取有用信息。日期與時間特征：我們假設你擁有p

機器學習—特征選擇

拉斯維加斯 n) 樣本找到直接處理隨機選擇偽代碼 gas 1、特征選擇特征選擇是一種及其重要的數據預處理方法。假設你需要處理一個監督學習問題，樣本的特征數非常大（甚至），但是可能僅僅有少部分特征會和對結果產生影響。甚至是簡單的線性分類，如果樣本特征數超過了n，但

機器學習 - 特征篩選與降維

技術分享 eve table for posit none linear osi proc 特征決定了最優效果的上限，算法與模型只是讓效果更逼近這個上限，所以特征工程與選擇什麽樣的特征很重要！以下是一些特征篩選與降維技巧 # -*- coding:utf-8

Google機器學習 ----驗證

使用多個特徵而非單個特徵來進一步提高模型的有效性除錯模型輸入資料中的問題使用測試資料集檢查模型是否過擬合驗證資料 #初始化，讀取資料 from __future__ import print_function import math from IPython imp

AI-033: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記10 - 正則化

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 10 正則化防止過

AI-031: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記8 - 練習使用校驗集和測試集以及評估模型

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 8. 使用校驗集和測試集

AI-030: Google機器學習教程（ML Crash Course with TensorFlow APIs）筆記6-7 - 練習TF實現線性迴歸、特徵組合及離群值分離

本文是Google機器學習教程（ML Crash Course with TensorFlow APIs）的學習筆記。教程地址： https://developers.google.com/machine-learning/crash-course/ml-intro 6. First Ste

Google---機器學習速成課程(十二)- 神經網路簡介 (Introduction to Neural Networks)

神經網路簡介 (Introduction to Neural Networks)神經網路是更復雜版本的特徵組合。實質上，神經網路會學習適合您的相應特徵組合。學習目標對神經網路有一定的瞭解，尤其是瞭解以下方面：隱藏層啟用函式---------------------------

google機器學習框架tensorflow學習筆記（十）

訓練集和測試集 (Training and Test Sets)：拆分資料上一單元介紹了將資料集分為兩個子集的概念：訓練集 - 用於訓練模型的子集。測試集 - 用於測試訓練後模型的子集。您可以想象按如下方式拆分單個數據集：

google機器學習框架tensorflow學習筆記（九）

泛化 (Generalization)：過擬合的風險本單元將重點介紹泛化。為了讓您直觀地理解這一概念，我們將展示 3 張圖。假設這些圖中的每個點代表一棵樹在森林中的位置。圖中的兩種顏色分別代表以下含義：藍點代表生病的樹。橙點代表健康的樹。

google機器學習框架tensorflow學習筆記（八）

合成特質和離群值學習目標：建立一個合成特徵，即另外兩個特徵的比例將此新特徵用作線性迴歸模型的輸入通過識別和擷取（移除）輸入資料中的離群值來提高模型的有效性我們來回顧下之前的“使用 TensorFlow 的基本步驟”練習中的模型。首先，我們將

google機器學習框架tensorflow學習筆記（四）

使用TensorFlow的基本步驟 tensorflow是一個可用於構建機器學習模型的平臺，但其實它的用途很廣泛。它是一種基於圖表的通用計算框架，可用來編寫你能想出的任何東西。事實上tensorflow.org的API頁面中提供了可在程式碼中使用的低階tensorflo

google機器學習框架tensorflow學習筆記（三）

降低損失：迭代法迭代學習其實就是一種反饋的結果，有點類似於猜數遊戲，首先你隨便猜一個數，對方告訴你大了還是小了，接著你根據對方提供的資訊進行調整，繼續往正確的方向猜測，如此往復，你通常會越來越接近要猜的數。這個遊戲真正棘手的地方在於儘可能高效地找到最佳模型。下圖

google機器學習框架tensorflow學習筆記（二）

線性迴歸人們早就知曉，相比涼爽的天氣，蟋蟀在較為炎熱的天氣裡鳴叫更為頻繁。數十年來，專業和業餘昆蟲學者已將每分鐘的鳴叫聲和溫度方面的資料編入目錄。Ruth 阿姨將她喜愛的蟋蟀資料庫作為生日禮物送給您，並邀請您自己利用該資料庫訓練一個模型，從而預測鳴叫聲與溫度的關係。如果把資料

google機器學習框架tensorflow學習筆記（七）

使用Tensorflow的基本步驟第五步：訓練模型現在，我們可以在 linear_regressor 上呼叫 train() 來訓練模型。我們會將 my_input_fn 封裝在 lambda

google機器學習框架tensorflow學習筆記（六）

使用Tensorflow的基本步驟設定首先載入必要的庫 import math from IPython import display from matplotlib import cm from matplotlib import gridspec fr

google機器學習框架tensorflow學習筆記（五）

Pandas簡介 pandas 是一種列存資料分析 API。它是用於處理和分析輸入資料的強大工具，很多機器學習框架都支援將 pandas 資料結構作為輸入。雖然全方位介紹 pandas API 會佔據很

Google---機器學習速成課程(七)- 特徵組合 (Feature Crosses)

特徵組合 (Feature Crosses)特徵組合是指兩個或多個特徵相乘形成的合成特徵。特徵的相乘組合可以提供超出這些特徵單獨能夠提供的預測能力。學習目標瞭解特徵組合。在 TensorFlow 中實施特徵組合。--------------------------------

MLCC筆記 - Google機器學習速成課程 - 筆記匯總

cati ner 基礎知識使用 live -i 正則化復雜 soft MLCC筆記 - Google機器學習速成課程 https://www.cnblogs.com/anliven/p/6107783.html MLCC簡介前提條件和準備工作完成課程的下一步 M

Google機器學習 特徵集

相關推薦

Google機器學習特徵集