機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

阿新 • • 發佈：2018-11-03

機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

關鍵字：樸素貝葉斯、python、原始碼解析
作者：米倉山下
時間：2018-10-25
機器學習實戰（Machine Learning in Action,@author: Peter Harrington）
原始碼下載地址：https://www.manning.com/books/machine-learning-in-action
[email protected]:pbharrin/machinelearninginaction.git

*************************************************************
一、朴樹貝葉斯分類（bayes）

#樸素貝葉斯實現文字分類原理：
bayes公式：P(ci|w)=P(w|ci)P(ci)/p(w)
首先計算侮辱性言論和非侮辱性言論文件出現的概率P(ci)即P(1)、P(0)的概率；接著計算P(w|ci)，假設詞的出現相互獨立（也是樸素貝葉斯“樸素”的含義），p(w0,w1,…wN|c1)=p(w0|c1)p(w1|c1)…p(wN|c1)，然後利用公式計算在該詞向量下屬於不同類別概率，比較返回概率最大的類別

訓練：P(w|ci)——分類ci下詞向量w的條件概率，P(w|ci)=p(w0,w1,…wN|c1)=p(w0|c1)p(w1|c1)…p(wN|c1)；P(ci)——分類ci出現的概率；p(w)——詞向量w出現的概率（每個詞出現的概率之積，是一個定值？）求解：P(ci|w)——詞向量w下分類為ci的條件概率

測試：利用公式計算在該詞向量下屬於不同類別概率，比較返回概率最大的類別

--------------------------------------------------------------
#樸素貝葉斯實現文字分類，訓練函式
#input：trainMatrix——訓練樣本（詞集模型，詞出現與否分別為1和0），trainCategory——類別向量
#output：p0Vect——[2],p1Vect——[1],pAbusive——樣本為侮辱性P(1)的概率

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                     #訓練樣本數目
    numWords = len(trainMatrix[0])                      #特徵個數（詞彙表大小）
    pAbusive = sum(trainCategory)/float(numTrainDocs)   #樣本為侮辱性P(1)的概率，即P(ci)
    p0Num = ones(numWords); p1Num = ones(numWords)      # 
初始化概率，[注1]
    p0Denom = 2.0; p1Denom = 2.0
    for i in range(numTrainDocs):                       #遍歷樣本
        if trainCategory[i] == 1:                       #侮辱性樣本
            p1Num += trainMatrix[i]                     #侮辱性詞向量之和
            p1Denom += sum(trainMatrix[i])              #侮辱性樣本中，出現的詞彙總數 

        else:                             #非侮辱性樣本
            p0Num += trainMatrix[i]       #向量之和
            p0Denom += sum(trainMatrix[i])#非侮辱性樣本，出現的詞彙總數
    p1Vect = log(p1Num/p1Denom)           #[1]侮辱性樣本中，詞彙表每個詞彙出現概率，即P(w|ci)，[注1]
    p0Vect = log(p0Num/p0Denom)           #[2]非侮辱性樣本中，詞彙表每個詞彙出現概率，同為P(w|ci)
    return p0Vect,p1Vect,pAbusive

#注：利用貝葉斯分類器對文件進行分類時，要計算多個概率的乘積以獲取文件屬於某個類別的概率（p(w0|c=1)p(w1|c=1)p(w2|c=1)）,若其中一個概率為零則整體為零，為避免影響，將p0Num，p1Num初始化為1，將分母p0Denom初始化為2；另外為避免太多小數相乘造成資料下溢，因此對其取對數（f(x)與ln(f(x))具有相同的單調性），並有ln(a*b)=ln(a)+n(b),將其後的乘積運算轉換為加和

#樸素貝葉斯分類測試函式

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):    #計算屬於不同類別的概率並進行比較
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)       #計算概率p(w0|c1)p(w1|c1)…p(wN|c1)p(c1)，
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) #因為進行了對數運算，乘積運算轉換為加和
    if p1 > p0:
        return 1
    else:
        return 0

-------------------------------------------------------------
測試：

>>> import bayes
>>> from numpy import *
>>> listOPosts,listClasses=bayes.loadDataSet()     #讀取訓練資料集及類別標籤
>>> listOPosts
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
>>> listClasses
[0, 1, 0, 1, 0, 1]
>>> myVocalList=bayes.createVocabList(listOPosts)  #建立詞彙表
>>> myVocalList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']
>>> trainMat=[]
>>> for postindoc in listOPosts:                     #將訓練資料集轉換為詞向量形式
...   trainMat.append(bayes.setOfWords2Vec(myVocalList,postindoc))
...
>>> trainMat
[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]
>>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)#計算條件概率及侮辱性文件出現的概率p(1)
>>> p0v,p1v,pab
(array([-2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
       -2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
       -2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
       -2.15948425, -3.25809654, -3.25809654, -2.56494936, -3.25809654,
       -2.56494936, -2.56494936, -3.25809654, -2.56494936, -2.56494936,
       -2.56494936, -3.25809654, -2.56494936, -3.25809654, -2.56494936,
       -2.56494936, -1.87180218]), array([-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
       -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
       -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
       -2.35137526, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
       -3.04452244, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
       -3.04452244, -1.65822808, -3.04452244, -2.35137526, -3.04452244,
       -3.04452244, -3.04452244]), 0.5)
>>>
>>> testEntry = ['love', 'my', 'dalmation']
>>> thisDoc = array(bayes.setOfWords2Vec(myVocalList, testEntry))  #測試資料轉詞向量
>>> thisDoc
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
>>> print testEntry,'classified as: ',bayes.classifyNB(thisDoc,p0v,p1v,pab)#判斷測試資料類別
['love', 'my', 'dalmation'] classified as:  0

*************************************************************
二、示例：朴樹貝葉斯過濾垃圾郵件
email資料夾下有兩個資料夾ham和spam，分別存放垃圾郵件和非垃圾郵件，將這50條資料，40條作為訓練樣本，10條資料作為測試樣本，使用詞袋模型完成訓練和測試，並列印錯誤詞，計算錯誤率

>>> import bayes
>>> bayes.spamTest()
classification error ['benoit', 'mandelbrot', '1924', '2010', 'benoit', 'mandelbrot', '1924', '2010', 'wilmott', 'team', 'benoit', 'mandelbrot', 'the', 'mathematician', 'the', 'father', 'fractal', 'mathematics', 'and', 'advocate', 'more', 'sophisticated', 'modelling', 'quantitative', 'finance', 'died', '14th', 'october', '2010', 'aged', 'wilmott', 'magazine', 'has', 'often', 'featured', 'mandelbrot', 'his', 'ideas', 'and', 'the', 'work', 'others', 'inspired', 'his', 'fundamental', 'insights', 'you', 'must', 'logged', 'view', 'these', 'articles', 'from', 'past', 'issues', 'wilmott', 'magazine']
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
classification error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
the error rate is:  0.3
>>>

另一個例子：使用樸素貝葉斯來發現低於相關的用詞
原理：利用訓練樣本求得p0Vect——[2],p1Vect——[1],pAbusive；即P(w|ci)和P(ci)，通過調整移除的高頻詞資料，達到較高的準確率，然後返回出現概率較大的詞彙。
（程式碼略）
詳細見書P70

機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

<Machine Learning in Action >之二樸素貝葉斯 C#實現文章分類

Python機器學習與實戰筆記之樸素貝葉斯分類

《機器學習實戰》學習筆記：樸素貝葉斯分類演算法

【Spark MLlib速成寶典】模型篇04樸素貝葉斯【Naive Bayes】（Python版）

樸素貝葉斯分類（Naive Bayes,NB）

機器學習實戰（Machine Learning in Action）學習筆記————02.k-鄰近演算法（KNN）

機器學習實戰（Machine Learning in Action）學習筆記————05.Logistic迴歸

機器學習實戰（Machine Learning in Action）學習筆記————03.決策樹原理、原始碼解析及測試

機器學習實戰（Machine Learning in Action）學習筆記————08.使用FPgrowth演算法來高效發現頻繁項集

機器學習實戰（Machine Learning in Action）學習筆記————07.使用Apriori演算法進行關聯分析

機器學習實戰（Machine Learning in Action）學習筆記————06.k-均值聚類演算法（kMeans）學習筆記

機器學習實戰（Machine Learning in Action）學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、資料降維

機器學習實戰（Machine Learning in Action）學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、數據降維

《機器學習實戰》（Machine Learning in Action) 一書中的錯誤之處（內容、程式碼）

機器學習---樸素貝葉斯分類器（Machine Learning Naive Bayes Classifier）

《Machine Learning In Action》學習筆記(1)-KNN(k-近鄰演算法)

機器學習實戰（三）樸素貝葉斯NB（Naive Bayes）

【ML學習筆記】樸素貝葉斯演算法的demo（機器學習實戰例子）

樸素貝葉斯-分類及Sklearn庫實現（1）機器學習實戰

機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

相關推薦