Kaggle機器學習教程學習（三）

阿新 • • 發佈：2018-04-15

var com his sel base ike ads ria some

　　該篇詳細討論數據清理步驟。其實這些基礎我覺得與數模競賽過程都是差不多的。

　　如文中所說：

　　　　The first step to data cleaning is removing unwanted observations from your dataset.（數據清理的第一步是從數據集中刪除不需要的觀察數據。）

　　　　This includes duplicate or irrelevant observations.（這主要是一些重復或者不相關的數據。）

　　1.處理重復或不相關數據

　　　　仔細觀察下面兩張圖：

技術分享圖片

　　　　　　　　　　　　　　　　　　　　　　（原圖）

技術分享圖片

　　　　　　　　　　　　　　　　　　　（清理後的圖）

　　asphalt、shake-shingle、composition、asphalt,shake-shingle這幾項都是大小寫不一致或有錯別字的，清理以後幹凈很多。

　　2.處理異常值

　　　　目前我了解到的一些異常值會影響性能，但也有一些異常值可能會很重要。具體看情況處理，如何有十分合理的理由去掉異常值則會是很有幫助的。

　　3.處理缺失的數據

　　　　最常見的兩種方式：

　　　　(1) Dropping observations that have missing values.（去掉缺失值項，但這不是一個好辦法，因為：

- - The fact that the value was missing may be informative in itself.（缺失值具有觀察意義。）
  - Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!（即使缺失一些值也可以對新數據進行預測！）

　　　　　　）

　　　　(2) Imputing the missing values based on other observations.（根據其他觀察結果推算缺失值。）

　　　　一段真理：Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.

　　　　Missing categorical data（缺失分類數據）

　　　　The best way to handle missing data for categorical features is to simply label them as ’Missing’!（對於處理缺失分類特征數據的最佳辦法是將它們標記為‘缺失‘！）

- - You’re essentially adding a new class for the feature.（相當於為數據集添加了新的一項類型。）
  - This tells the algorithm that the value was missing.（明確的告訴算法值是缺失的。）
  - This also gets around the technical requirement for no missing values.（避免了沒有缺失值的要求。）

　　　　Missing numeric data（缺失數字數據）

　　　　For missing numeric data, you should flag and fill the values.（對於缺失數字數據，你應該標記並填充一個值。）

1. Flag the observation with an indicator variable of missingness.（標記一個缺失的狀態。）
2. Then, fill the original missing value with 0 just to meet the technical requirement of no missing values.（使用0填充缺失值的位置以滿足沒有缺失值的要求。）

　　　　By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean.（通過使用這種標記和填充技術，您基本上可以讓算法估計缺失的最佳常數，而不是僅僅用平均值填充它。）

　　emmm...寫這篇的過程越來越發現基本只能是重述一下原文，於是就只能當翻譯一樣的去寫了，然而我這種英語渣只能是在谷歌翻譯下一句一句的通過上下文理解後完成的...所以，如有翻譯不好或理解不對的地方還望指正，只希望不要誤人子弟！

Kaggle機器學習教程學習（三）

Kaggle機器學習教程學習（三）

經典：uC/OS-II系統的學習教程之（三）

機器學習讀書筆記（三）決策樹基礎篇之從相親說起

機器學習之旅（三）

林軒田機器學習基石入門（三）

機器學習系列：（三）特徵提取與處理

spark機器學習筆記：（三）用Spark Python構建推薦系統

Coursera吳恩達機器學習教程筆記（一）

機器學習演算法總結（三）

Machine Learning第六講[應用機器學習的建議] --（三）建立一個垃圾郵件分類器

機器學習小實戰（三）貝葉斯實現拼寫檢查器

機器學習經典演算法（三）--指數加權平均

看懂論文的機器學習基本知識（三）--假正例、假負例、ROC曲線

python學習之路（三）使用socketserver進行ftp斷點續傳

Spring學習之路（三）bean註解管理AOP操作

小白學習安全測試（三）——掃描工具-Nikto使用

初識vue.js，我的學習之路（三）

Kotlin學習與實踐（三）fun 函數

Python學習記錄——Ubuntu（三）文件操作

Python學習之路（三）爬蟲（二）

Hive學習之路（三）Hive元數據信息對應MySQL數據庫表

Kaggle機器學習教程學習（三）

相關推薦