1. 程式人生 > >15.【進階】特徵提升之特徵抽取--CountVectorizer和TfidfVectorizer

15.【進階】特徵提升之特徵抽取--CountVectorizer和TfidfVectorizer

#學習目標1:使用CountVectorizer和TfidfVectorizer對非結構化的符號化資料(如一系列字串)進行特徵抽取和向量化

from sklearn.datasets import fetch_20newsgroups
#從網際網路上即時下載新聞樣本,subset = 'all'引數表示下載全部近2萬條文字檔案
# subset : 'train' or 'test', 'all', optional
# Select the dataset to load: 'train' for the training set, 'test'
# for the test set, 'all' for both, with shuffled ordering.
news = fetch_20newsgroups(subset='all') #分割資料集 from sklearn.cross_validation import train_test_split X_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33) #用CountVectorizer提取特徵 from sklearn.feature_extraction.text import CountVectorizer count_vec=CountVectorizer() X_count_train = count_vec.fit_transform(X_train) X_count_test = count_vec.transform(X_test) #使用樸素貝葉斯分類器來訓練模型並預測
from sklearn.naive_bayes import MultinomialNB mnb_count = MultinomialNB() mnb_count.fit(X_count_train,y_train) y_count_predict = mnb_count.predict(X_count_test) print 'The Accuracy of mnb(CountVectorizer) is',mnb_count.score(X_count_test,y_test) from sklearn.metrics import classification_report print
classification_report(y_test,y_count_predict,target_names=news.target_names) #由輸出結果可知,使用CountVectorizer在不去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到83.977%的預測準確性 #對比使用TfidfVectorizer且不去掉停用詞的條件下,對文字特徵進行量化 from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vec = TfidfVectorizer() X_tfidf_train = tfidf_vec.fit_transform(X_train) X_tfidf_test = tfidf_vec.transform(X_test) #使用樸素貝葉斯分類器來訓練模型並預測 mnb_tfidf= MultinomialNB() mnb_tfidf.fit(X_tfidf_train,y_train) y_tfidf_predict = mnb_tfidf.predict(X_tfidf_test) print 'The Accuracy of mnb(CountVectorizer) is',mnb_tfidf.score(X_tfidf_test,y_test) from sklearn.metrics import classification_report print classification_report(y_test,y_tfidf_predict,target_names=news.target_names) #由輸出結果可知,使用TfidfVectorizer在不去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到84.634%的預測準確性 #這說明,在訓練文字較多的時候,利用TfidfVectorizer壓制這些常用詞彙對分類決策的干擾,往往可以起到提升模型效能的作用。 #*************************************************************************************************** #學習目標2:在去掉停用詞的前提下,分別使用CountVectorizer和TfidfVectorizer對文字特徵進行量化,再用樸素貝葉斯進行訓練評估 count_filter_vec,tfidf_filter_vec = CountVectorizer(analyzer='word',stop_words='english'),TfidfVectorizer(analyzer='word',stop_words='english') #使用帶有停用詞過濾的CountVectorizer和TfidfVectorizer對訓練文字和測試文字進行量化處理 X_count_filter_train = count_filter_vec.fit_transform(X_train) X_count_filter_test = count_filter_vec.transform(X_test) X_tfidf_filter_train = tfidf_filter_vec.fit_transform(X_train) X_tfidf_filter_test = tfidf_filter_vec.transform(X_test) #使用預設配置的樸素貝葉斯分類器進行訓練和評估 #1.CountVectorizer with filtering stopwords: mnb_count.fit(X_count_filter_train,y_train) y_count_filter_predict = mnb_count.predict(X_count_filter_test) print 'The Accuracy of mnb(CountVectorizer) is',mnb_count.score(X_count_filter_test,y_test) from sklearn.metrics import classification_report print classification_report(y_test,y_count_filter_predict,target_names=news.target_names) #由輸出結果可知,使用CountVectorizer在去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到86.375%的預測準確性 #2.TfidfVectorizer with filtering stopwords: mnb_tfidf.fit(X_tfidf_filter_train,y_train) y_tfidf_filter_predict = mnb_tfidf.predict(X_tfidf_filter_test) print 'The Accuracy of mnb(CountVectorizer) is',mnb_tfidf.score(X_tfidf_filter_test,y_test) from sklearn.metrics import classification_report print classification_report(y_test,y_tfidf_filter_predict,target_names=news.target_names) #由輸出結果可知,使用TfidfVectorizer在去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到88.264%的預測準確性 #綜上所述,總結如下: # 在統一訓練模型下,且文字資料量較大時,去掉停用詞的TfidfVectorizer的模型效能 優於 去掉停用詞的CountVectorizer的模型效能 # 優於 未去掉停用詞的TfidfVectorizer的模型效能 優於 未去掉停用詞的CountVectorizer的模型效能