1. 程式人生 > >機器學習之路: python 樸素貝葉斯分類器 預測新聞類別

機器學習之路: python 樸素貝葉斯分類器 預測新聞類別

groups group news ckey put epo test electron final

使用python3 學習樸素貝葉斯分類api

設計到字符串提取特征向量

歡迎來到我的git下載源代碼: https://github.com/linyi0604/kaggle

 1 from sklearn.datasets import fetch_20newsgroups
 2 from sklearn.cross_validation import train_test_split
 3 # 導入文本特征向量轉化模塊
 4 from sklearn.feature_extraction.text import CountVectorizer
 5 # 導入樸素貝葉斯模型
 6 from sklearn.naive_bayes import
MultinomialNB 7 # 模型評估模塊 8 from sklearn.metrics import classification_report 9 10 ‘‘‘ 11 樸素貝葉斯模型廣泛用於海量互聯網文本分類任務。 12 由於假設特征條件相互獨立,預測需要估計的參數規模從冪指數量級下降接近線性量級,節約內存和計算時間 13 但是 該模型無法將特征之間的聯系考慮,數據關聯較強的分類任務表現不好。 14 ‘‘‘ 15 16 ‘‘‘ 17 1 讀取數據部分 18 ‘‘‘ 19 # 該api會即使聯網下載數據 20 news = fetch_20newsgroups(subset="all
") 21 # 檢查數據規模和細節 22 # print(len(news.data)) 23 # print(news.data[0]) 24 ‘‘‘ 25 18846 26 27 From: Mamatha Devineni Ratnam <[email protected]> 28 Subject: Pens fans reactions 29 Organization: Post Office, Carnegie Mellon, Pittsburgh, PA 30 Lines: 12 31 NNTP-Posting-Host: po4.andrew.cmu.edu 32
33 I am sure some bashers of Pens fans are pretty confused about the lack 34 of any kind of posts about the recent Pens massacre of the Devils. Actually, 35 I am bit puzzled too and a bit relieved. However, I am going to put an end 36 to non-PIttsburghers‘ relief with a bit of praise for the Pens. Man, they 37 are killing those Devils worse than I thought. Jagr just showed you why 38 he is much better than his regular season stats. He is also a lot 39 fo fun to watch in the playoffs. Bowman should let JAgr have a lot of 40 fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final 41 regular season game. PENS RULE!!! 42 ‘‘‘ 43 44 ‘‘‘ 45 2 分割數據部分 46 ‘‘‘ 47 x_train, x_test, y_train, y_test = train_test_split(news.data, 48 news.target, 49 test_size=0.25, 50 random_state=33) 51 52 ‘‘‘ 53 3 貝葉斯分類器對新聞進行預測 54 ‘‘‘ 55 # 進行文本轉化為特征 56 vec = CountVectorizer() 57 x_train = vec.fit_transform(x_train) 58 x_test = vec.transform(x_test) 59 # 初始化樸素貝葉斯模型 60 mnb = MultinomialNB() 61 # 訓練集合上進行訓練, 估計參數 62 mnb.fit(x_train, y_train) 63 # 對測試集合進行預測 保存預測結果 64 y_predict = mnb.predict(x_test) 65 66 ‘‘‘ 67 4 模型評估 68 ‘‘‘ 69 print("準確率:", mnb.score(x_test, y_test)) 70 print("其他指標:\n",classification_report(y_test, y_predict, target_names=news.target_names)) 71 ‘‘‘ 72 準確率: 0.8397707979626485 73 其他指標: 74 precision recall f1-score support 75 76 alt.atheism 0.86 0.86 0.86 201 77 comp.graphics 0.59 0.86 0.70 250 78 comp.os.ms-windows.misc 0.89 0.10 0.17 248 79 comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240 80 comp.sys.mac.hardware 0.93 0.78 0.85 242 81 comp.windows.x 0.82 0.84 0.83 263 82 misc.forsale 0.91 0.70 0.79 257 83 rec.autos 0.89 0.89 0.89 238 84 rec.motorcycles 0.98 0.92 0.95 276 85 rec.sport.baseball 0.98 0.91 0.95 251 86 rec.sport.hockey 0.93 0.99 0.96 233 87 sci.crypt 0.86 0.98 0.91 238 88 sci.electronics 0.85 0.88 0.86 249 89 sci.med 0.92 0.94 0.93 245 90 sci.space 0.89 0.96 0.92 221 91 soc.religion.christian 0.78 0.96 0.86 232 92 talk.politics.guns 0.88 0.96 0.92 251 93 talk.politics.mideast 0.90 0.98 0.94 231 94 talk.politics.misc 0.79 0.89 0.84 188 95 talk.religion.misc 0.93 0.44 0.60 158 96 97 avg / total 0.86 0.84 0.82 4712 98 ‘‘‘

機器學習之路: python 樸素貝葉斯分類器 預測新聞類別