scikit-learn：CountVectorizer提取詞頻

阿新 • • 發佈：2019-01-17

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

sklearn.feature_extraction.text.CountVectorizer(

input=u'content', encoding=u'utf-8', decode_error=u'strict',strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None,token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer=u'word', max_df=1.0, min_df=1,max_features=None, vocabulary=None, binary=False, dtype=)

作用：Convert a collection of text documents to a matrix of token counts（計算詞彙的數量，即tf）；結果由 scipy.sparse.coo_matrix進行稀疏表示。

看下引數就知道CountVectorizer在提取tf時都做了什麼：

strip_accents : {‘ascii’, ‘unicode’, None}：是否除去“音調”，不知道什麼是“音調”？看：http://textmechanic.com/?reqp=1&reqr=nzcdYz9hqaSbYaOvrt==

lowercase : boolean, True by default：計算tf前，先將所有字元轉化為小寫。這個引數一般為True。

preprocessor : callable or None (default)：複寫the preprocessing (string transformation) stage，但保留tokenizing and n-grams generation steps.這個引數可以自己寫。

tokenizer : callable or None (default)：複寫the string tokenization step，但保留preprocessing and n-grams generation steps.這個引數可以自己寫。

stop_words : string {‘english’}, list, or None (default)：如果是‘english’, a built-in stop word list for English is used。如果是a list，那麼最終的tokens中將去掉list中的所有的stop word。如果是None, 不處理停頓詞；但引數 max_df可以設定為 [0.7, 1.0) 之間，進而根據intra corpus document frequency(df) of terms自動detect and filter stop words。這個引數要根據自己的需求調整。

token_pattern : string：正則表示式，預設篩選長度大於等於2的字母和數字混合字元（select tokens of 2 or more alphanumeric characters ），引數analyzer設定為word時才有效。

ngram_range : tuple (min_n, max_n)：n-values值得上下界，預設是ngram_range=(1, 1)，該範圍之內的n元feature都會被提取出來！這個引數要根據自己的需求調整。

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable：特徵基於wordn-grams還是character n-grams。如果是callable是自己複寫的從the raw, unprocessed input提取特徵的函式。

max_df : float in range [0.0, 1.0] or int, default=1.0： min_df : float in range [0.0, 1.0] or int, default=1：按比例，或絕對數量刪除df超過max_df或者df小於min_df的word tokens。有效的前提是引數vocabulary設定成Node。

max_features : int or None, default=None：選擇tf最大的max_features個特徵。有效的前提是引數vocabulary設定成Node。

vocabulary : Mapping or iterable, optional：自定義的特徵word tokens，如果不是None，則只計算vocabulary中的詞的tf。還是設為None靠譜。

binary : boolean, default=False：如果是True，tf的值只有0和1，表示出現和不出現，useful for discrete probabilistic models that model binary events rather than integer counts.。

dtype : type, optional：Type of the matrix returned by fit_transform() or transform().。

結論： CountVectorizer提取tf都做了這些：去音調、轉小寫、去停頓詞、在word（而不是character，也可自己選擇引數）基礎上提取所有ngram_range範圍內的特徵，同時刪去滿足“max_df, min_df,max_features”的特徵的tf。當然，也可以選擇tf為binary。

最後看下兩個函式：

fit(raw_documents[, y]) Learn a vocabulary dictionary of all tokens in the raw documents.

fit_transform(raw_documents[, y]) Learn the vocabulary dictionary and return term-document matrix.

轉自：https://blog.csdn.net/mmc2015/article/details/46866537?utm_source=copy

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print((cv.get_feature_names()))
print(cv.vocabulary_)  # 索引

print((cv_fit.toarray()))
print('對應特徵總個數：', (cv_fit.toarray().sum(axis=0)))  # .sum(axis=0)按列求和

scikit-learn：CountVectorizer提取詞頻

scikit-learn：CountVectorizer提取詞頻

scikit-learn：4.2. Feature extraction（特征提取，不是特征選擇）

scikit-learn：4. 數據集預處理（clean數據、reduce降維、expand增維、generate特征提取）

scikit-learn： isotonic regression（保序回歸，非常有意思，僅做知識點了解，但差點兒沒用到過）

scikit-learn：3. Model selection and evaluation

scikit-learn：3.5. Validation curves: plotting scores to evaluate models

Scikit-learn：聚類clustering

Scikit-learn：scikit-learn快速教程及例項

機器學習精簡教程之七——用scikit-learn做特徵提取

機器學習：SVM（scikit-learn 中的 RBF、RBF 中的超參數 γ）

分享《機器學習實戰：基於Scikit-Learn和TensorFlow》高清中英文PDF+原始碼

分享《機器學習實戰：基於Scikit-Learn和TensorFlow》高清中英文PDF+源代碼

分享《機器學習實戰：基於Scikit-Learn和TensorFlow》+PDF+Aurelien

【SciKit-Learn學習筆記】5：核SVM分類和預測乳腺癌資料集

機器學習庫一：scikit-learn

機器學習筆記（四）Scikit-learn CountVectorizer 與 TfidfVectorizer

scikit-learn /sklearn ：整合學習之隨機森林分類器（Forests of Randomized Tree）官方檔案翻譯

scikit-learn處理輸入資料缺失值的類：Imputer

分享《機器學習實戰：基於Scikit-Learn和TensorFlow》高清中英文PDF+原始碼免費

【SciKit-Learn學習筆記】8：k-均值演算法做文字聚類,聚類演算法效能評估

scikit-learn：CountVectorizer提取詞頻

相關推薦