機器學習入門實戰——樸素貝葉斯實戰新聞組資料集

阿新 • • 發佈：2019-02-10

樸素貝葉斯實戰新聞組資料集

關於樸素貝葉斯的相關理論知識可檢視：樸素貝葉斯法

關於新聞組資料集

20newsgroups資料集是用於文字分類、文字挖據和資訊檢索研究的國際標準資料集之一。一些新聞組的主題特別相似(e.g. comp.sys.ibm.pc.hardware/comp.sys.mac.hardware)，還有一些卻完全不相關 (e.g misc.forsale /soc.religion.christian)。

20個新聞組資料集包含大約18000個新聞組，其中20個主題分成兩個子集:一個用於訓練(或開發)，另一個用於測試(或用於效能評估)。訓練集和測試集之間的分割是基於特定日期之前和之後釋出的訊息。

程式碼實戰

首先，還是匯入資料集

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')
print(len(news.data))
print(news.data[0])

我們這裡打印出來一個新聞例子，如下

18846
From: Mamatha Devineni Ratnam [email protected]
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers’ relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!

接下來，劃分資料集，還是75%訓練集，25%測試集

from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)

我們需要對文字特徵進行提取，我們這裡使用CountVectorizer來提取特徵。CountVectorizer能夠將文字詞塊化，通過計算詞彙的數量來將文字轉化成向量（更多文字特徵提取內容可檢視https://www.cnblogs.com/Haichao-Zhang/p/5220974.html）。然後我們匯入模型來學習資料。

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X_train = vec.fit_transform(X_train)
X_test = vec.transform(X_test)

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train,Y_train)
y_predict = mnb.predict(X_test)

最後，我們還是一樣，檢驗一下模型的準確度

from sklearn.metrics import classification_report
print('The Accuracy of Navie Bayes Classifier is',mnb.score(X_test,Y_test))
print(classification_report(Y_test,y_predict,target_names = news.target_names))

這裡寫圖片描述
程式碼參考：《Python機器學習及實踐：從零開始通往Kaggle競賽之路》

機器學習入門實戰——樸素貝葉斯實戰新聞組資料集

樸素貝葉斯實戰新聞組資料集

關於新聞組資料集

程式碼實戰

機器學習入門之樸素貝葉斯法

機器學習入門實戰——樸素貝葉斯實戰新聞組資料集

機器學習筆記5——樸素貝葉斯演算法

機器學習演算法之樸素貝葉斯（Naive Bayes）--第二篇

【機器學習】使用樸素貝葉斯進行文件分類

【十九】機器學習之路——樸素貝葉斯分類

機器學習演算法之樸素貝葉斯（Naive Bayes）--第一篇

機器學習：半樸素貝葉斯分類器

python機器學習庫sklearn——樸素貝葉斯分類器

機器學習演算法總結--樸素貝葉斯

機器學習系列之樸素貝葉斯演算法（監督學習-分類問題）

機器學習演算法之樸素貝葉斯

Python機器學習筆記：樸素貝葉斯演算法

用樸素貝葉斯對wine資料集分類

樸素貝葉斯對鳶尾花資料集進行分類

機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

機器學習實戰——樸素貝葉斯Python實現記錄

【ML學習筆記】樸素貝葉斯演算法的demo（機器學習實戰例子）

機器學習實戰-樸素貝葉斯

機器學習實戰——樸素貝葉斯

機器學習入門實戰——樸素貝葉斯實戰新聞組資料集

樸素貝葉斯實戰新聞組資料集

關於新聞組資料集

程式碼實戰

相關推薦