Learn from Top Kagglers:高階特徵工程 II
這是一篇筆記,課程來自Coursera上的
How to Win a Data Science Competition: Learn from Top Kagglers
本篇文章講解在資料科學競賽中常用的特徵工程技巧,這是本篇文章的下部分。
如果你正在使用電腦檢視這篇文章,建議進入閱讀原文,檢視jupyeter notebook檔案。
Statistics and distance based features
該部分專注於此高階特徵工程:計算由另一個分組的一個特徵的各種統計資料和從給定點的鄰域分析得到的特徵。
groupby and nearest neighbor methods
例子:這裡有一些CTR任務的資料
我們可以暗示廣告有 頁面上的最低價格將吸引大部分注意力。 頁面上的其他廣告不會很有吸引力。 計算與這種含義相關的特徵非常容易。 我們可以為每個廣告的每個使用者和網頁新增最低和最高價格。 在這種情況下,具有最低價格的廣告的位置也可以使用。
程式碼實現
-
More feature
-
How many pages user visited
-
Standard deviation of prices
-
Most visited page
-
Many, many more
如果沒有特徵可以像這樣使用groupby呢?可以使用最近鄰點
Neighbors
-
Explicit group is not needed
-
More flexible
-
Much harder to implement
Examples
-
Number of houses in 500m, 1000m,..
-
Average price per square meter in 500m, 1000m,..
-
Number of schools/supermarkets/parking lots in 500m, 1000m,..
-
Distance to colsest subway station
講師在 Springleaf
比賽中使用了它。
KNN features in springleaf
-
Mean encode all the variables
-
For every point, find 2000 nearst neighbors using Bray-Curtis metric
-
Calculate various features from those 2000 neighbors
Evaluate
-
Mean target of neatrest 5,10,15,500,2000, neighbors
-
Mean distance to 10 closest neighbors
-
Mean distance to 10 closest neighbors with target 1
-
Mean distance to 10 closest neighbors with target 0
Matrix factorizations for feature extraction
-
Example of feature fusion
Notes about Matrix Fatorization
-
Can be apply only for some columns
-
Can provide additional diversity
-
Good for ensembles
-
It is lossy transformation.Its’ efficirncy depends on:
-
Usually 5-100
-
Particular task
-
Number of latent factors
Implementtation
-
Serveral MF methods you can find in sklearn
-
SVD and PCA
-
Standart tools for Matrix Fatorization
-
TruncatedSVD
-
Works with sparse matrices
-
Non-negative Matrix Fatorization(NMF)
-
Ensures that all latent fators are non-negative
-
Good for counts-like data
NMF for tree-based methods
non-negative matrix factorization
簡稱NMF,它以一種使資料更適合決策樹的方式轉換資料。
可以看出,NMF變換資料形成平行於軸的線。
因子分解
可以使用與線性模型的技巧來分解矩陣。
Conclusion
-
Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
-
It can be applied for transforming categorical features into real-valued
-
Many of tricks trick suitable for linear models can be useful for MF
Feature interactions
特徵值的所有組合
-
Example:banner selection
假設我們正在構建一個預測模型,在網站上顯示的最佳廣告橫幅。
… | category_ad | category_site | … | is_clicked |
---|---|---|---|---|
… | auto_part | game_news | … | 0 |
… | music_tickets | music_news | .. | 1 |
… | mobile_phones | auto_blog | … | 0 |
將廣告橫幅本身的類別和橫幅將顯示的網站類別,進行組合將構成一個非常強的特徵。
… | ad_site | … | is_clicked |
---|---|---|---|
… | auto_part | game_news | … | 0 |
… | music_tickets | music_news | .. | 1 |
… | mobile_phones | auto_blog | … | 0 |
構建這兩個特徵的組合特徵 ad_site
從技術角度來看, 有兩種方法可以構建這種互動。
-
Example of interactions
方法1
方法2
-
相似的想法也可用於數值變數
事實上,這不限於乘法操作,還可以是其他的
-
Multiplication
-
Sum
-
Diff
-
Division
-
..
Practival Notes
-
We have a lot of possible interactions -N*N for N features.
-
a. Even more if use several types in interactions
-
Need ti reduce it’s number
-
a. Dimensionality reduction
-
b. Feature selection
通過這種方法生成了大量的特徵,可以使用特徵選擇或降維的方法減少特徵。以下用特徵選擇舉例說明
Interactions’ order
-
We looked at 2nd order interactions.
-
Such approach can be generalized for higher orders.
-
It is hard to do generation and selection automatically.
-
Manual building of high-order interactions is some kind of art.
Extract features from DT
看一下決策樹。 讓我們將每個葉子對映成二進位制特徵。 物件葉子的索引可以用作新分類特徵的值。 如果我們不使用單個樹而是使用它們的整體。 例如,隨機森林, 那麼這種操作可以應用於每個條目。 這是一種提取高階互動的強大方法。
-
How to use it
In sklearn:
tree_model.apply()
In xgboost:
booster.predict(pred_leaf=True)
Conclusion
-
We looked at ways to build an interaction of categorical attributes
-
Extended this approach to real-valued features
-
Learn how to extract features via decision trees
t-SNE
用於探索資料分析。可以被視為從資料中獲取特徵的方法。
Practical Notes
-
Result heavily depends on hyperparameters(perplexity)
-
Good practice is to use several projections with different perplexities(5-100)
-
Due to stochastic nature, tSNE provides different projections even for the same data\hyperparams
-
Train and test should be projected together
-
tSNE runs for a long time with a big number of features
-
it is common to do dimensionality reduction before projection.
-
Implementation of tSNE can be found in sklearn library.
-
But personally I perfer you use stand-alone implementation python package tsne due to its’ faster speed.
Conclusion
-
tSNE is a great tool for visualization
-
It can be used as feature as well
-
Be careful with interpretation of results
-
Try different perplexities
矩陣分解:
-
矩陣分解方法概述(sklearn) (http://scikit-learn.org/stable/modules/decomposition.html)
T-SNOW:
-
多核t-SNE實現
-
(https://github.com/DmitryUlyanov/Multicore-TSNE)
-
流形學習方法的比較(sklearn)
-
(http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html)
-
如何有效使用t-SNE(distill.pub部落格)
-
(https://distill.pub/2016/misread-tsne/)
-
tSNE主頁(Laurens van der Maaten)
-
(https://lvdmaaten.github.io/tsne/)
-
示例:具有不同困惑的tSNE(sklearn)
-
(http://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py)
互動:
-
Facebook Research的論文關於從樹中提取分類特徵
-
(https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/)
-
示例:使用樹集合進行要素轉換(sklearn)
-
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html)
點選閱讀原文,檢視jupyter notebook檔案
長按識別二維碼
獲取更多AI資訊
