1. 程式人生 > >《深度學習Python實踐》第22章——文字分類例項

《深度學習Python實踐》第22章——文字分類例項

程式碼如下:

1)演算法比價

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import
KNeighborsClassifier from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report from sklearn.metrics import accuracy_score from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.model_selection import
GridSearchCV from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot as plt categories=['alt.atheism', 'tec.sport.hockey', 'sci.crypt', 'comp.sys.ibm.pc.hardware', 'sci.med', 'comp.sys.mac.hardware'
, 'sci.space', 'comp.windows.x', 'soc.religion.christian', 'misc.forsale', 'talk.politocs.guns', 'rec.autos', 'talk.politocs.medeast', 'rec.motorcycles', 'talk.politics.misc', 'rec.sport.baseball', 'talk.religion.misc'] #匯入訓練資料 train_path='/home/duan/下載/20news-bydate/20news-bydate-train' dataset_train=load_files(container_path=train_path,categories=categories) #匯入評估資料 test_path='/home/duan/下載/20news-bydate/20news-bydate-test' dataset_test=load_files(container_path=test_path,categories=categories) #資料準備與理解 #計算詞頻 count_vect=CountVectorizer(stop_words='english',decode_error='ignore') X_train_counts=count_vect.fit_transform(dataset_train.data) #檢視資料維度 #詞頻的計算結果如下: print(X_train_counts.shape) #計算TF-IDF tf_transformer=TfidfVectorizer(stop_words='english',decode_error='ignore') X_train_counts_tf=tf_transformer.fit_transform(dataset_train.data) print(X_train_counts_tf.shape) #以上用兩種方法進行了文字特徵的提取。並且查看了資料維度。 #接下來用TF-IDF特徵進行分類模型的訓練。 #評估演算法 #設定評估演算法的基準 num_folds=10 seed=7 scoring='accuracy' #線性演算法LR , #非線性演算法:CART,SVM,MNB,KNN models={} models['LR']=LogisticRegression() models['SVM']=SVC() models['CART']=DecisionTreeClassifier() models['MNB']=MultinomialNB() models['KNN']=KNeighborsClassifier() #比較演算法 results=[] for key in models: kfold = KFold(n_splits= num_folds, random_state=seed) cv_result = cross_val_score(models[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring) results.append(cv_result) print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

執行結果:

(7838, 77172)
(7838, 77172)
KNN: 0.824575 (0.012700)
LR: 0.920900 (0.008155)
CART: 0.703240 (0.013782)
MNB: 0.896786 (0.009055)
SVM: 0.062772 (0.004306)

箱線圖比較演算法:

#箱線圖10折交叉驗證比較演算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(models.keys())
plt.show()

執行結果:
這裡寫圖片描述
從圖中結果可以看出,樸素貝葉斯分類器的資料離散程度比較好,邏輯迴歸的偏度較大.演算法結果的離散程度能夠反應演算法對資料的只用情況,所以對邏輯迴歸和樸素貝葉斯分類器進行進一步的研究,實行演算法調參.

2) 演算法調參

通過上面的分析發現,LR和MNB值得進一步進行優化.下面對這兩個演算法的參宿進行調參,進一步提高演算法的準確度.

(1)邏輯迴歸調參

邏輯迴歸中的超引數是C.C是目標的約束函式,C值越小則正則化強度越大,對C進行調參,每次給C設定一定數量的值,如果臨界值是最有引數,重複這個步驟,直到找到最優值.

#演算法調參
#調參LR
param_grid={}
param_grid['C']=[0.1,5,13,15]
model=LogisticRegression()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最優:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))

執行結果:
最優:0.9393978055626435使用{'C': 15}

(2)樸素貝葉斯調參

樸素貝葉斯有一個alpha引數,該引數是一個平滑引數,預設值為1.0.
我們可以對這個引數進行調參,以提高演算法的準確度.

#演算法調參
#調參MNB
param_grid={}
param_grid['alpha']=[0.001,0.01,0.1,1.5]
model=MultinomialNB()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最優:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))
cv_results=zip(grid_result.cv_results_['mean_test_score'],
               grid_result.cv_results_['std_test_score'],
               grid_result.cv_results_['params'])
for mean, std, param in cv_results:
    print('%f (%f) with %r'%(mean, std, param)) 

執行結果:

最優:0.934804797142128使用{'alpha': 0.01}
0.929829 (0.008380) with {'alpha': 0.001}
0.934805 (0.008096) with {'alpha': 0.01}
0.928043 (0.008024) with {'alpha': 0.1}
0.889640 (0.010375) with {'alpha': 1.5}

MNB演算法最有引數為alpha=0.01.最優:0.934804797142128使用{‘alpha’: 0.01}
LR演算法最優引數為:C=15. 最優:0.9393978055626435使用{‘C’: 15}

通過調參發現,LR在C=15時具有最好的準確度.接下來審查整合演算法.

3).整合演算法

隨機森林(RF)
AdaBoost(AB)

ensembles={}
ensembles['RF']=RandomForestClassifier()
ensembles['AB']=AdaBoostClassifier()
#比較整合演算法
results=[]
for key in ensembles:
    kfold = KFold(n_splits= num_folds, random_state=seed)
    cv_result = cross_val_score(ensembles[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)
    results.append(cv_result)
    print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

執行結果:

RF: 0.773795 (0.017244)
AB: 0.620055 (0.017638)

箱線圖:

#箱線圖10折交叉驗證比較演算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(ensembles.keys())
plt.show()

這裡寫圖片描述

從箱線圖可以看出,隨機森林的分佈比較均勻,對資料的適用性比較高,更值得進一步優化研究.

4).整合演算法調參

#整合演算法調參
#調參RF
param_grid={}
param_grid['n_estimators']=[10,100,150,200]
model=RandomForestClassifier()

kfold=KFold(n_splits=num_folds,random_state=seed)

grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)

grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)

print('最優:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))

cv_results=zip(grid_result.cv_results_['mean_test_score'],
               grid_result.cv_results_['std_test_score'],
               grid_result.cv_results_['params'])
for mean, std, param in cv_results:
    print('%f (%f) with %r'%(mean, std, param)) 

執行結果:

最優:0.888236795100791使用{'n_estimators': 200}
0.779025 (0.007910) with {'n_estimators': 10}
0.882496 (0.012405) with {'n_estimators': 100}
0.887982 (0.010867) with {'n_estimators': 150}
0.888237 (0.009727) with {'n_estimators': 200}

確定最終模型

#演算法調參
#調參LR
param_grid={}
model=LogisticRegression(C=15)
model.fit(X=X_train_counts_tf,y=dataset_train.target)
X_test_counts=tf_transformer.transform(dataset_test.data)
predictions=model.predict(X_test_counts)
print(accuracy_score(dataset_test.target,predictions))
print(classification_report(dataset_test.target,predictions))

執行結果:

0.8844163312248419
             precision    recall  f1-score   support

          0       0.85      0.79      0.82       319
          1       0.78      0.84      0.81       392
          2       0.86      0.88      0.87       385
          3       0.91      0.89      0.90       395
          4       0.81      0.90      0.86       390
          5       0.91      0.91      0.91       396
          6       0.97      0.95      0.96       398
          7       0.94      0.97      0.96       397
          8       0.97      0.94      0.96       396
          9       0.92      0.89      0.91       396
         10       0.93      0.95      0.94       394
         11       0.86      0.93      0.89       398
         12       0.91      0.77      0.84       310
         13       0.70      0.62      0.65       251

avg / total       0.89      0.88      0.88      5217