scikit-learn：4.2. Feature extraction（特征提取，不是特征選擇）

阿新 • • 發佈：2017-07-20

for port ould 詞匯 ret sim hide pla pip

http://scikit-learn.org/stable/modules/feature_extraction.html

帶病在網吧裏。

。。。。。

寫。求支持。

。。

1、首先澄清兩個概念：特征提取和特征選擇（

Feature extraction is very different from Feature selection

）。

the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features（從已經提取的特征中選擇更好的特征）.

以下分為四大部分來講。主要還是4、text feature extraction

2、loading features form dicts

class DictVectorizer。舉個樣例就好：

>>> measurements = [
...     {‘city‘: ‘Dubai‘, ‘temperature‘: 33.},
...     {‘city‘: ‘London‘, ‘temperature‘: 12.},
...     {‘city‘: ‘San Fransisco‘, ‘temperature‘: 18.},
... ] 

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])
>>> vec.get_feature_names()
[‘city=Dubai‘, ‘city=London‘, ‘city=San Fransisco‘, ‘temperature‘]

class DictVectorizer對於提取某個特定詞匯附近的feature windows很實用，比如增加我們通過一個已有的algorithm提取了word ‘sat’ 在句子‘The cat sat on the mat.’中的PoS（Part of Speech）特征。例如以下：

>>> pos_window = [
...     {
...         ‘word-2‘: ‘the‘,
...         ‘pos-2‘: ‘DT‘,
...         ‘word-1‘: ‘cat‘,
...         ‘pos-1‘: ‘NN‘,
...         ‘word+1‘: ‘on‘,
...         ‘pos+1‘: ‘PP‘,
...     },
...     # in a real application one would extract many such dictionaries
... ]

上面的PoS特征就能夠vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):

>>>

>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized                
<1x6 sparse matrix of type ‘<... ‘numpy.float64‘>‘
    with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
>>> vec.get_feature_names()
[‘pos+1=PP‘, ‘pos-1=NN‘, ‘pos-2=DT‘, ‘word+1=on‘, ‘word-1=cat‘, ‘word-2=the‘]

3、feature hashing

The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.

因為hash。所以僅僅保存feature的interger index。而不保存原來feature的string名字。所以沒有inverse_transform方法。

FeatureHasher 接收dict對，即 (feature, value) 對，或者strings，由構造函數的參數input_type決定.結果是scipy.sparse matrix。假設是strings，則value默認取1，比如 [‘feat1‘, ‘feat2‘, ‘feat2‘] 被解釋為[(‘feat1‘, 1), (‘feat2‘, 2)].

4、text feature extraction

由於內容太多，分開寫了。參考著篇博客：http://blog.csdn.net/mmc2015/article/details/46997379

5、image feature extraction

提取部分圖片（Patch extraction）：

The extract_patches_2d function從圖片中提取小塊，存儲成two-dimensional array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d. 可以將全部的小塊重構成原圖：

>>> import numpy as np
>>> from sklearn.feature_extraction import image

>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0]  # R channel of a fake RGB picture
array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])

>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
...     random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0,  3],
        [12, 15]],

       [[15, 18],
        [27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
       [27, 30]])

重構方式例如以下：

>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

The PatchExtractor class和 extract_patches_2d,一樣，僅僅只是能夠同一時候接受多個圖片作為輸入：

>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)

圖片像素的連接（Connectivity graph of an image）：

主要是依據像素的區別來推斷圖片的每兩個像素點是否連接。

。。

。

The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a connectivity matrix for images given the shape of these image.

這有個直觀的樣例：http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py

頭疼。。。。

碎覺。

。。

scikit-learn：4.2. Feature extraction（特征提取，不是特征選擇）

for port ould 詞匯 ret sim hide pla pip http://scikit-learn.org/stable/modules/feature_extraction.html 帶病在網吧裏。。。。。。寫。求支持。。。 1、首先澄

scikit-learn：4.2. Feature extraction（特征提取，不是特征選擇）

scikit-learn：4.2. Feature extraction（特征提取，不是特征選擇）

scikit-learn：4. 數據集預處理（clean數據、reduce降維、expand增維、generate特征提取）

30分鐘學會用scikit-learn的基本回歸方法（線性、決策樹、SVM、KNN）和整合方法（隨機森林，Adaboost和GBRT）

人社部核三通用資料庫表定期優化效能shell執行指令碼（純個人整理，不涉及公司機密）

面向物件思想的個人總結（不要噴我，不是官宣）

OK6410裸機除錯（使用串列埠，不需要使用JLINK）

string中提取檔名（帶副檔名，不帶副檔名）

獲取web應用完整的專案地址（是http地址，不是磁碟路徑）

scikit-learn： isotonic regression（保序回歸，非常有意思，僅做知識點了解，但差點兒沒用到過）

求一個整數數組中和最大的連續子數組，例如：[1, 2, -4, 4, 10, -3, 4, -5, 1]的最大連續子數組是[4, 10, -3, 4]（需寫明思路，並編程實現）

閱讀筆記：Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests（Python package）

scikit-learn：3. Model selection and evaluation

scikit-learn：3.5. Validation curves: plotting scores to evaluate models

redux-form V.7.4.2學習筆記（六）表單同步校驗技術

redux-form V.7.4.2學習筆記（七）Field解析

基礎篇：4.2）規範化：3d軟件工程圖紙用模板

【JZOJ4419】【GDOI2016模擬4.2】hole（四~三維偏序問題）

MySQL 5.7 Reference Manual】15.4.2 Change Buffer（變更緩衝）

下載android4.4.2原始碼全過程（附已下載的原始碼）

案例3.2：括號匹配的檢驗（c++實現/資料結構/棧的基本操作）

scikit-learn：4.2. Feature extraction（特征提取，不是特征選擇）

相關推薦