基於樸素貝葉斯分類演算法實現垃圾郵箱分類

阿新 • • 發佈：2018-12-25

貝葉斯決策理論

在機器學習中，樸素貝葉斯是基於貝葉斯決策的一種簡單形式,下面給出貝葉斯的基本公式，也是最重要的公式：

這裡寫圖片描述

其中X是一個m*n的矩陣，m為他的樣本數，n為特徵的個數，即我們要求的是：在已知的樣本情況下的條件概率。

$示例$ )表示的是條件概率

$示例$ )為先驗概率

為什麼需要樸素貝葉斯

這裡有兩個原因：

由統計學知識，如果每個特徵需要N個樣本，那麼對於10個特徵需要 $示例$ 個樣本，這個資料無疑是非常大的，會隨著特徵的增大而迅速的增大。如果特徵之間獨立，那麼樣本數就可以減少到10*N，所謂的獨立及時每個特徵與其他的特徵沒有關係，當然這個假設很強，因為我們知道實際中是很難的。

因為獨立同分布，因此在計算條件概率的時，可以避免了求聯合概率，因此可以寫成：

下面給演算法的實現過程（打公式太難，不喜勿噴）：
這裡寫圖片描述

上述演算法的修正

1. 最後概率可能為0

在進行分類是，多個概率乘積得到類別，但是如果有一個概率為0,則最後的結果為0，因此未來避免未出現的屬性值，在估計概率時同城要進行“平滑”，常用的是“拉普拉斯修正”（Laplacian correction）,具體來說，在計算 $示例$ )的時候，將分子加1，分母加上類別數N.同樣在計算 $示例$ )的時候在分子加1，分母加上Ni,其表示第i個屬性的可能的取值數。

2. 值過小可能會溢位

在實際中對概率取對數的形式，可以防止相乘是溢位。

垃圾郵件分類

在email/spam資料夾中有25封垃圾郵件，在email/ham中有25封正常郵件，將其進行垃圾郵件分類。

分詞

首先遇到的問題是怎樣把一封郵件進行分詞，即將其劃分成一個個單詞的形式。可以想到用正則表示式，關於正則表示式可以參考網上的資料，這裡給出python的程式，實現怎樣將一個長的字元衝進行分詞的操作。

def textParse(bigString):
    import re          #匯入正則表示式的庫
    listOfTokens=re.split(r'\W*',bigString)   #返回列表
    return [tok.lower() for 
 tok in listOfTokens if len(tok)>2]

示例：

#見最後的總程式
#可見它將一封郵件進行了分詞
[yqtao@localhost ml]$ python bayes.py   
['peter', 'with', 'jose', 'out', 'town', 'you', 'want', 'meet', 'once', 'while', 'keep', 'things', 'going', 'and', 'some', 'interesting', 'stuff', 'let', 'know', 'eugene']

這裡的re.split(r'\W*',bigString),表示以除了數字，字母和下劃線的符合進行劃分，return 的是一個列表推到式生成的列表，其中將單詞長度小於等於2的過濾掉，並且將其變成小寫字母。

生成詞彙表

將所有的郵件進行分詞後生成一個dataSet，然後生成一個詞彙表，這個詞彙表是一個集合，即每個單詞只出現一次，詞彙表是一個列表形式如：
[“cute”,”love”,help”,garbage”,”quit”…]

def createVocabList(dataSet):
    vocabSet=set([])
    for docment in dataSet:
        vocabSet=vocabSet| set(docment) #union of tow sets
    return list(vocabSet) #convet if to list

生成詞向量

每一封郵件的詞彙都存在了詞彙表中，因此可以將每一封郵件生成一個詞向量，存在幾個則為幾，不存在為0，例如：[“love”,”garbage”],則他的詞向量為
[0,1,0,1,0,…],其位置是與詞彙表所對應的，因此詞向量的維度與詞彙表相同。

#vocablist為詞彙表，inputSet為輸入的郵件
def bagOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)    #他的大小與詞向量一樣
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]+=1 #查詢單詞的索引
        else: print ("the word is not in my vocabulry")
    return returnVec

訓練演算法

這一步是演算法的核心，要計算：
1. 先驗概率
2. 計算 $示例$ ), $示例$ ) 這裡0表示正常郵件，1表示垃圾郵件。
這裡需要重點的理解是如何計算第二步的，例如， $示例$ )，表示在垃圾郵件的條件下第i個特徵的概率，首先先將所有的類別為1的詞向量相加，可以得到每個特徵的個數，因此在除以在類別1的單詞總數就是在垃圾郵件中每個單詞的概率了。注意，這裡所說的特徵即詞彙的每一個單詞。
python程式如下：

#這裡的trainMat是訓練樣本的詞向量，其是一個矩陣，他的每一行為一個郵件的詞向量
#trainGategory為與trainMat對應的類別，值為0，1表示正常，垃圾
def train(trainMat,trainGategory):
    numTrain=len(trainMat)
    numWords=len(trainMat[0])  #is vocabulry length
    pAbusive=sum(trainGategory)/float(numTrain)
    p0Num=ones(numWords);p1Num=ones(numWords)
    p0Denom=2.0;p1Denom=2.0
    for i in range(numTrain):
        if trainGategory[i] == 1:
            p1Num += trainMat[i] #統計類1中每個單詞的個數
            p1Denom += sum(trainMat[i]) #類1的單詞總數
        else:
            p0Num += trainMat[i]
            p0Denom +=sum(trainMat[i])
    p1Vec=log(p1Num/p1Denom) #類1中每個單詞的概率
    p0Vec=log(p0Num/p0Denom)
    return p0Vec,p1Vec,pAbusive

處理資料驗證過程

這裡首先將50封郵件讀進docList列表中，然後生成一個詞彙表包含所有的單詞，接下來使用交叉驗證，隨機的選擇10個樣本進行測試，40個樣本進行訓練。

#spam email classfy
def spamTest():
    fullTest=[];docList=[];classList=[]
    for i in range(1,26): #it only 25 doc in every class
        wordList=textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(1)
        wordList=textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(0)
    vocabList=createVocabList(docList)   # create vocabulry
    trainSet=range(50);testSet=[]
#choose 10 sample to test ,it index of trainMat
    for i in range(10):
        randIndex=int(random.uniform(0,len(trainSet)))#num in 0-49
        testSet.append(trainSet[randIndex])
        del(trainSet[randIndex])
    trainMat=[];trainClass=[]
    for docIndex in trainSet:
        trainMat.append(bagOfWords2Vec(vocabList,docList[docIndex]))
        trainClass.append(classList[docIndex])
    p0,p1,pSpam=train(array(trainMat),array(trainClass))
    errCount=0
    for docIndex in testSet:
        wordVec=bagOfWords2Vec(vocabList,docList[docIndex])
        if classfy(array(wordVec),p0,p1,pSpam) != classList[docIndex]:
            errCount +=1
            print ("classfication error"), docList[docIndex]

    print ("the error rate is ") , float(errCount)/len(testSet)

完整的程式碼如下：

from numpy import *

#create a vocablist of set ,word can only exit once

def createVocabList(dataSet):
    vocabSet=set([])
    for docment in dataSet:
        vocabSet=vocabSet| set(docment) #union of tow sets
    return list(vocabSet) #convet if to list 


def bagOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]+=1
        else: print ("the word is not in my vocabulry")
    return returnVec

# tranin algorithm
# the p1Num is mean claclualte in 1 class evrey word contain weight
def train(trainMat,trainGategory):
    numTrain=len(trainMat)
    numWords=len(trainMat[0])  #is vocabulry length
    pAbusive=sum(trainGategory)/float(numTrain)
    p0Num=ones(numWords);p1Num=ones(numWords)
    p0Denom=2.0;p1Denom=2.0
    for i in range(numTrain):
        if trainGategory[i] == 1:
            p1Num += trainMat[i] 
            p1Denom += sum(trainMat[i])
        else:
            p0Num += trainMat[i]
            p0Denom +=sum(trainMat[i])
    p1Vec=log(p1Num/p1Denom)
    p0Vec=log(p0Num/p0Denom)
    return p0Vec,p1Vec,pAbusive
# classfy funtion
def classfy(vec2classfy,p0Vec,p1Vec,pClass1):
    p1=sum(vec2classfy*p1Vec)+log(pClass1)
    p0=sum(vec2classfy*p0Vec)+log(1-pClass1)
    if p1 > p0:
        return 1;
    else:
        return 0

# split the big string
def textParse(bigString):
    import re
    listOfTokens=re.split(r'\W*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

#spam email classfy
def spamTest():
    fullTest=[];docList=[];classList=[]
    for i in range(1,26): #it only 25 doc in every class
        wordList=textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(1)
        wordList=textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(0)
    vocabList=createVocabList(docList)   # create vocabulry
    trainSet=range(50);testSet=[]
#choose 10 sample to test ,it index of trainMat
    for i in range(10):
        randIndex=int(random.uniform(0,len(trainSet)))#num in 0-49
        testSet.append(trainSet[randIndex])
        del(trainSet[randIndex])
    trainMat=[];trainClass=[]
    for docIndex in trainSet:
        trainMat.append(bagOfWords2Vec(vocabList,docList[docIndex]))
        trainClass.append(classList[docIndex])
    p0,p1,pSpam=train(array(trainMat),array(trainClass))
    errCount=0
    for docIndex in testSet:
        wordVec=bagOfWords2Vec(vocabList,docList[docIndex])
        if classfy(array(wordVec),p0,p1,pSpam) != classList[docIndex]:
            errCount +=1
            print ("classfication error"), docList[docIndex]

    print ("the error rate is ") , float(errCount)/len(testSet)

if __name__ == '__main__':
    #下面的為了演示分詞的，可註釋
    #listWord=textParse(open('email/ham/1.txt').read())
    spamTest()

執行與測試：直接python bayes.py執行，結果如下：

[yqtao@localhost ml]$ python bayes.py
classfication error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
[yqtao@localhost ml]$ python bayes.py
the error rate is  0.0
[yqtao@localhost ml]$ python bayes.py
the error rate is  0.0
[yqtao@localhost ml]$ python bayes.py
the error rate is  0.0
[yqtao@localhost ml]$ python bayes.py
classfication error ['scifinance', 'now', 'automatically', 'generates', 'gpu', 'enabled', 'pricing', 'risk', 'model', 'source', 'code', 'that', 'runs', '300x', 'faster', 'than', 'serial', 'code', 'using', 'new', 'nvidia', 'fermi', 'class', 'tesla', 'series', 'gpu', 'scifinance', 'derivatives', 'pricing', 'and', 'risk', 'model', 'development', 'tool', 'that', 'automatically', 'generates', 'and', 'gpu', 'enabled', 'source', 'code', 'from', 'concise', 'high', 'level', 'model', 'specifications', 'parallel', 'computing', 'cuda', 'programming', 'expertise', 'required', 'scifinance', 'automatic', 'gpu', 'enabled', 'monte', 'carlo', 'pricing', 'model', 'source', 'code', 'generation', 'capabilities', 'have', 'been', 'significantly', 'extended', 'the', 'latest', 'release', 'this', 'includes']
the error rate is  0.1

因為是隨機選擇的樣本，可以執行10次取平均值，可以觀察到，測試效果還不錯。還有一點要注意，這裡一直出現的是將垃圾郵件誤判為正常郵件，這會比將正常的誤判為垃圾郵件要好。

參考資料：

《統計學習方法》李航著
《機器學習》周志華著
《機器學習實戰》Peter Harrington

基於樸素貝葉斯分類演算法實現垃圾郵箱分類

貝葉斯決策理論

為什麼需要樸素貝葉斯

上述演算法的修正

垃圾郵件分類

分詞

生成詞彙表

生成詞向量

訓練演算法

處理資料驗證過程

基於樸素貝葉斯分類演算法實現垃圾郵箱分類

基於樸素貝葉斯的關於網際網路金融新聞分類（python實現）

《機器學習實戰》基於樸素貝葉斯分類演算法構建文字分類器的Python實現

Python--基於樸素貝葉斯演算法的情感分類

基於樸素貝葉斯的中文文字分類器(python實現，非呼叫)

資料探勘：基於樸素貝葉斯分類演算法的文字分類實踐

基於樸素貝葉斯分類器的文字分類演算法（上）

Python實現基於樸素貝葉斯的垃圾郵件分類

基於樸素貝葉斯分類器的 20-news-group分類及結果對比(Python3)

基於樸素貝葉斯算法的情感分類

基於LVD、貝葉斯模型演算法實現的電商行業商品評論與情感分析案例

kaggle | 基於樸素貝葉斯分類器的語音性別識別

MINIST | 基於樸素貝葉斯分類器的0-9數字手寫體識別

基於樸素貝葉斯分類器的文字分類

基於樸素貝葉斯的新聞分類

樸素貝葉斯的python實現（針對演算法預測類不針對文字）

基於樸素貝葉斯的定位演算法

【樸素貝葉斯】實戰樸素貝葉斯_程式碼實現_訓練演算法

（資料探勘-入門-8）基於樸素貝葉斯的文字分類器

機器學習讀書筆記（四）樸素貝葉斯基礎篇之網站賬號分類

基於樸素貝葉斯分類演算法實現垃圾郵箱分類

貝葉斯決策理論

為什麼需要樸素貝葉斯

上述演算法的修正

垃圾郵件分類

分詞

生成詞彙表

生成詞向量

訓練演算法

處理資料驗證過程

相關推薦