sklearn文字特徵提取

阿新 • • 發佈：2019-02-10

class sklearn.feature_extraction.text.CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
（分為三個處理步驟：preprocessing、tokenizing、n-grams generation）
引數：（一般要設定的引數是decode_error，stop_words='english'，token_pattern='...'（重要引數），max_df，min_df，max_features）
input：一般使用預設即可，可以設定為"filename'或'file'，尚不知道其用法
encodeing：使用預設的utf-8即可，分析器將會以utf-8解碼raw document
decode_error：預設為strict，遇到不能解碼的字元將報UnicodeDecodeError錯誤，設為ignore將會忽略解碼錯誤，還可以設為replace，作用尚不明確
strip_accents：預設為None，可設為ascii或unicode，將使用ascii或unicode編碼在預處理步驟去除raw document中的重音符號
analyzer：一般使用預設，可設定為string型別，如'word', 'char', 'char_wb'，還可設定為callable型別，比如函式是一個callable型別
preprocessor：設為None或callable型別
tokenizer：設為None或callable型別
ngram_range：片語切分的長度範圍，詳細用法見http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction中4.2.3.4上方第三個框
stop_words：設定停用詞，設為english將使用內建的英語停用詞，設為一個list可自定義停用詞，設為None不使用停用詞，設為None且max_df∈[0.7, 1.0)將自動根據當前的語料庫建立停用詞表
lowercase：將所有字元變成小寫
token_pattern：表示token的正則表示式，需要設定analyzer == 'word'，預設的正則表示式選擇2個及以上的字母或數字作為token，標點符號預設當作token分隔符，而不會被當作token
max_df：可以設定為範圍在[0.0 1.0]的float，也可以設定為沒有範圍限制的int，預設為1.0。這個引數的作用是作為一個閾值，當構造語料庫的關鍵詞集的時候，如果某個詞的document frequence大於max_df，這個詞不會被當作關鍵詞。如果這個引數是float，則表示詞出現的次數與語料庫文件數的百分比，如果是int，則表示詞出現的次數。如果引數中已經給定了vocabulary，則這個引數無效
min_df：類似於max_df，不同之處在於如果某個詞的document frequence小於min_df，則這個詞不會被當作關鍵詞
max_features：預設為None，可設為int，對所有關鍵詞的term frequency進行降序排序，只取前max_features個作為關鍵詞集
vocabulary：預設為None，自動從輸入文件中構建關鍵詞集，也可以是一個字典或可迭代物件？
binary：預設為False，一個關鍵詞在一篇文件中可能出現n次，如果binary=True，非零的n將全部置為1，這對需要布林值輸入的離散概率模型的有用的
dtype：使用CountVectorizer類的fit_transform()或transform()將得到一個文件詞頻矩陣，dtype可以設定這個矩陣的數值型別

屬性：
vocabulary_：字典型別，key為關鍵詞，value是特徵索引，樣例如下：
com.furiousapps.haunt2: 57048
bale.yaowoo: 5025
asia.share.superayiconsumer: 4660
com.cooee.flakes: 38555
com.huahan.autopart: 67364
關鍵詞集被儲存為一個數組向量的形式，vocabulary_中的key是關鍵詞，value就是該關鍵詞在陣列向量中的索引，使用get_feature_names()方法可以返回該陣列向量。使用陣列向量可驗證上述關鍵詞，如下：
ipdb> count_vec.get_feature_names()[57048]
u'com.furiousapps.haunt2'
ipdb> count_vec.get_feature_names()[5025]
u'bale.yaowoo'

stop_words_：集合型別，官網的解釋十分到位，如下：
    Terms that were ignored because they either:
            occurred in too many documents (max_df)
            occurred in too few documents (min_df)
            were cut off by feature selection (max_features).
    This is only available if no vocabulary was given.
這個屬性一般用來程式設計師自我檢查停用詞是否正確，在pickling的時候可以設定stop_words_為None是安全的

sklearn文字特徵提取

sklearn文字特徵提取CountVectorizer 和 TfidfVectorizer

sklearn文字特徵提取

sklearn基礎（一）文字特徵提取函式CountVectorizer()和TfidfVectorizer()

基於sklearn的文字特徵提取與分類

Sklearn常用特徵提取和處理方法

文字特徵提取方法研究

機器學習-2.特徵工程和文字特徵提取

【轉】十分鐘上手sklearn：特徵提取，常用模型，交叉驗證

文字特徵提取_03：基於詞頻數的文件向量CountVectorizer

三種文字特徵提取（TF-IDF/Word2Vec/CountVectorizer）

資料探勘-文字特徵提取方法研究

NLP中的語言模型及文字特徵提取演算法

【自然語言處理】【scikit-learn】文字特徵提取

十分鐘上手sklearn：特徵提取，常用模型，交叉驗證

利用sklearn進行字典&文字的特徵提取

機器學習——特徵工程和文字特徵工程提取

文字分類特徵提取之Word2Vec

基於深度神經網路特徵提取的文字無關的說話人識別

Python 文字挖掘：使用機器學習方法進行情感分析（一、特徵提取和選擇）

sklearn特徵提取方法學習

sklearn文字特徵提取

相關推薦