1. 程式人生 > >文字處理之貝葉斯垃圾郵件分類

文字處理之貝葉斯垃圾郵件分類

本文所講解的是如何通過Python將文字讀取,並且將每一個文字生成對應的詞向量並返回. 文章的背景是將50封郵件(包含25封正常郵件,25封垃圾郵件)通過貝葉斯演算法對其進行分類.

主要分為如下幾個部分:
①讀取所有郵件;
②建立詞彙表;
③生成沒封郵件對應的詞向量(詞集模型);
④用sklearn中的樸素貝葉斯演算法進行分類;
⑤生成效能評估報告

1.函式介紹

下面先介紹需要用到的功能函式

1.1建立詞彙表

思路:用所給的文字建立一個詞彙表;就是將用所有出現的單詞構成一個不重複的集合,即不含同一個單詞.

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)
    
postingList=[['my', 'dog', 'dog','has']]
print createVocabList(postingList)
>> ['has', 'my', 'dog']

1.2 將所有的大寫字母轉換成小寫字母,並且去掉長度小於兩個字元的單詞

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]
                            # 去掉長度小於兩個字元的單詞,2可以自己調節

s = 'i Love YYUU'
print textParse(s)
>> ['love', 'yyuu']

1.3將每一個文字變成一個詞向量

構建詞向量有兩種方式:第一種是用文本里面出現的單詞,同詞彙表向量進行對比,如果出現在詞彙表中,則對應位置為1,反之為0.這種方式只管有無出現,不管出現次數,稱為詞集模型(set-of-words model);另外一種就是,同時也統計出現次數,稱為詞袋模型(bag-of-words model).

def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

vocabulary = ['wo','do','like','what','go']
text = ['do','go','what','do']
print setOfWords2Vec(vocabulary,text)
>> [0, 1, 0, 1, 1]

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

vocabulary = ['wo','do','like','what','go']
text = ['do','go','what','do']
print setOfWords2Vec(vocabulary,text)
>> [0, 2, 0, 1, 1]

2.整合函式

將上面三個函式寫在一起;下面的操作方式只是針對本例,但是隻要稍作修改同樣能夠適應其它地方.

def createVocabList(dataSet):# 建立詞彙表
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet):# 建立詞向量
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

def preProcessing():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)# 讀取文字
        classList.append(1)# 讀取每個文字的標籤
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary# 生成詞向表
    data = []
    target = classList
    for docIndex in range(50):# 本例一共有50個文字
        data.append(setOfWords2Vec(vocabList,docList[docIndex]))生成詞向量
    return data,target#返回處理好的詞向量和標籤

3.訓練並預測

import textProcess as tp
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report

data,target= tp.preProcessing()


X_train,X_test,y_train,y_test = train_test_split(data,target,test_size=0.25)

mnb = MultinomialNB()
mnb.fit(X_train,y_train)
y_pre = mnb.predict((X_test))
print y_pre # 預測結果
print y_test # 實際結果
print 'The accuracy of Naive Bayes Classifier is',mnb.score(X_test,y_test)
print classification_report(y_test,y_pre)

參考

  • 機器學習實戰
  • Python機器學習及實踐