15.【進階】特徵提升之特徵抽取--CountVectorizer和TfidfVectorizer
阿新 • • 發佈:2019-02-19
#學習目標1:使用CountVectorizer和TfidfVectorizer對非結構化的符號化資料(如一系列字串)進行特徵抽取和向量化
from sklearn.datasets import fetch_20newsgroups
#從網際網路上即時下載新聞樣本,subset = 'all'引數表示下載全部近2萬條文字檔案
# subset : 'train' or 'test', 'all', optional
# Select the dataset to load: 'train' for the training set, 'test'
# for the test set, 'all' for both, with shuffled ordering.
news = fetch_20newsgroups(subset='all')
#分割資料集
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)
#用CountVectorizer提取特徵
from sklearn.feature_extraction.text import CountVectorizer
count_vec=CountVectorizer()
X_count_train = count_vec.fit_transform(X_train)
X_count_test = count_vec.transform(X_test)
#使用樸素貝葉斯分類器來訓練模型並預測
from sklearn.naive_bayes import MultinomialNB
mnb_count = MultinomialNB()
mnb_count.fit(X_count_train,y_train)
y_count_predict = mnb_count.predict(X_count_test)
print 'The Accuracy of mnb(CountVectorizer) is',mnb_count.score(X_count_test,y_test)
from sklearn.metrics import classification_report
print classification_report(y_test,y_count_predict,target_names=news.target_names)
#由輸出結果可知,使用CountVectorizer在不去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到83.977%的預測準確性
#對比使用TfidfVectorizer且不去掉停用詞的條件下,對文字特徵進行量化
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
X_tfidf_train = tfidf_vec.fit_transform(X_train)
X_tfidf_test = tfidf_vec.transform(X_test)
#使用樸素貝葉斯分類器來訓練模型並預測
mnb_tfidf= MultinomialNB()
mnb_tfidf.fit(X_tfidf_train,y_train)
y_tfidf_predict = mnb_tfidf.predict(X_tfidf_test)
print 'The Accuracy of mnb(CountVectorizer) is',mnb_tfidf.score(X_tfidf_test,y_test)
from sklearn.metrics import classification_report
print classification_report(y_test,y_tfidf_predict,target_names=news.target_names)
#由輸出結果可知,使用TfidfVectorizer在不去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到84.634%的預測準確性
#這說明,在訓練文字較多的時候,利用TfidfVectorizer壓制這些常用詞彙對分類決策的干擾,往往可以起到提升模型效能的作用。
#***************************************************************************************************
#學習目標2:在去掉停用詞的前提下,分別使用CountVectorizer和TfidfVectorizer對文字特徵進行量化,再用樸素貝葉斯進行訓練評估
count_filter_vec,tfidf_filter_vec = CountVectorizer(analyzer='word',stop_words='english'),TfidfVectorizer(analyzer='word',stop_words='english')
#使用帶有停用詞過濾的CountVectorizer和TfidfVectorizer對訓練文字和測試文字進行量化處理
X_count_filter_train = count_filter_vec.fit_transform(X_train)
X_count_filter_test = count_filter_vec.transform(X_test)
X_tfidf_filter_train = tfidf_filter_vec.fit_transform(X_train)
X_tfidf_filter_test = tfidf_filter_vec.transform(X_test)
#使用預設配置的樸素貝葉斯分類器進行訓練和評估
#1.CountVectorizer with filtering stopwords:
mnb_count.fit(X_count_filter_train,y_train)
y_count_filter_predict = mnb_count.predict(X_count_filter_test)
print 'The Accuracy of mnb(CountVectorizer) is',mnb_count.score(X_count_filter_test,y_test)
from sklearn.metrics import classification_report
print classification_report(y_test,y_count_filter_predict,target_names=news.target_names)
#由輸出結果可知,使用CountVectorizer在去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到86.375%的預測準確性
#2.TfidfVectorizer with filtering stopwords:
mnb_tfidf.fit(X_tfidf_filter_train,y_train)
y_tfidf_filter_predict = mnb_tfidf.predict(X_tfidf_filter_test)
print 'The Accuracy of mnb(CountVectorizer) is',mnb_tfidf.score(X_tfidf_filter_test,y_test)
from sklearn.metrics import classification_report
print classification_report(y_test,y_tfidf_filter_predict,target_names=news.target_names)
#由輸出結果可知,使用TfidfVectorizer在去掉停用詞的條件下,使用預設配置的樸素貝葉斯分類器,可以得到88.264%的預測準確性
#綜上所述,總結如下:
# 在統一訓練模型下,且文字資料量較大時,去掉停用詞的TfidfVectorizer的模型效能 優於 去掉停用詞的CountVectorizer的模型效能
# 優於 未去掉停用詞的TfidfVectorizer的模型效能 優於 未去掉停用詞的CountVectorizer的模型效能