scikit-learn:4.2. Feature extraction(特征提取,不是特征選擇)
http://scikit-learn.org/stable/modules/feature_extraction.html
帶病在網吧裏。
。。。。。
寫。求支持。
。。
1、首先澄清兩個概念:特征提取和特征選擇(
Feature extraction is very different from Feature selection
)。the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features(從已經提取的特征中選擇更好的特征).
以下分為四大部分來講。主要還是4、text feature extraction
2、loading features form dicts
class DictVectorizer。舉個樣例就好:
上面的PoS特征就能夠vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):
3、feature hashing
The class FeatureHasher is
a high-speed, low-memory vectorizer that uses a technique known as feature
hashing, or the “hashing trick”.
因為hash。所以僅僅保存feature的interger index。而不保存原來feature的string名字。所以沒有inverse_transform方法。
FeatureHasher 接收dict對,即 (feature, value) 對,或者strings,由構造函數的參數input_type決定.結果是scipy.sparse matrix。假設是strings,則value默認取1,比如 [‘feat1‘, ‘feat2‘, ‘feat2‘] 被解釋為[(‘feat1‘, 1), (‘feat2‘, 2)].
4、text feature extraction
由於內容太多,分開寫了。參考著篇博客:http://blog.csdn.net/mmc2015/article/details/46997379
5、image feature extraction
提取部分圖片(Patch extraction):
The extract_patches_2d function從圖片中提取小塊,存儲成two-dimensional
array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d.
可以將全部的小塊重構成原圖:
重構方式例如以下:
The PatchExtractor class和 extract_patches_2d,一樣,僅僅只是能夠同一時候接受多個圖片作為輸入:
圖片像素的連接(Connectivity graph of an image):
主要是依據像素的區別來推斷圖片的每兩個像素點是否連接。
。。
。
。
The function img_to_graph returns
such a matrix from a 2D or 3D image. Similarly, grid_to_graph build
a connectivity matrix for images given the shape of these image.
這有個直觀的樣例:http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py
頭疼。。。。
碎覺。
。。
scikit-learn:4.2. Feature extraction(特征提取,不是特征選擇)