1. 程式人生 > >使用樸素貝葉斯過濾垃圾郵件

使用樸素貝葉斯過濾垃圾郵件

split文字分割函式

mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.'
ret=mySent.split()
print(ret)

輸出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']
>>> 

使用正則表示式可以解決單詞中的其他符號:

\d 匹配任何十進位制數;它相當於類 [0-9]。
\D 匹配任何非數字字元;它相當於類 [^0-9]。
\s 匹配任何空白字元;它相當於類 [ \t\n\r\f\v]。
\S 匹配任何非空白字元;它相當於類 [^ \t\n\r\f\v]。
\w 匹配任何字母數字字元;它相當於類 [a-zA-Z0-9_]。
\W 匹配任何非字母數字字元;它相當於類 [^a-zA-Z0-9_]。
import re
regEx=re.compile('\\W')    #大寫的W
listOfTokens=regEx.split(mySent)
print(listOfTokens)

輸出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', '', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']

為了消除其中的空元素,剔除長度為0的元素:

ret=[tok for tok in listOfTokens if len
(tok)>0] print(ret)

等效為一下語句:

ret=[]
for tok in listOfTokens:
    if len(tok)>0:
        ret.append(tok)
print(ret)

其中for關鍵字前方的字元即為需要append的內容,為了統一大小寫,全部返回小寫:

ret=[tok.lower() for tok in listOfTokens if len(tok)>0]
print(ret)

輸出:

['this', 'book', 'is', 'the', 'best', 'book', 'on'
, 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']

在附件中有很多email文字,以其中任意一個為例:

Hello,

Since you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.

For example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you’re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.

you have received this mandatory email service announcement to update you about important changes to Google Groups.

同樣也可以將所有單詞都分割出來。

emailText = open('email/ham/6.txt').read()
listOfTokens=regEx.split(emailText)
print(listOfTokens)

定義一個split函式,返回長度大於2的單詞list

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

最後附上檔案解析及完整的垃圾郵件測試函式:

from numpy import *

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    #因為傳入是二維陣列,所以將二維陣列內的所有元素全部壓入Set中(順序可能會被打亂)
    #最後再以list的形式返回
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

#統計單詞出現的次數,用於建立向量集
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec


def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                     #測試集數目  6
    numWords = len(trainMatrix[0])                      #總單詞(去重)數目  32
    pAbusive = sum(trainCategory)/float(numTrainDocs)   #該文件屬於侮辱類的概率=被標記為侮辱類句子數量/總句子數量=3/6.0=0.5
    #變數初始化
    p0Num = zeros(numWords); p1Num = zeros(numWords)    #標記向量初始化為[0,0,0,0...]
    p0Denom = 0; p1Denom = 0                            #統計數為0

    #計算概率時,需要計算多個概率的乘積以獲得文件屬於某個類別的概率
    #即計算p(w0|ci)*p(w1|ci)*...p(wN|ci),然後當其中任意一項的值為0,那麼最後的乘積也為0.
    #為降低這種影響,採用拉普拉斯平滑,在分子上新增a(一般為1),分母上新增ka(k表示類別總數),
    #即在這裡將所有詞的出現數初始化為1,並將分母初始化為2*1=2
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()   
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0

    #對於每個句子
    #如果該句被人工標記為侮辱性的,則其中出現的每個詞彙p1Num都該被認為是侮辱性的,侮辱性詞彙總數p1Denom也做相應統計
    #如果該句不是侮辱性的,同樣做統計
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    #每個單詞的是侮辱詞的條件概率=在侮辱詞中出現的次數p1Num/侮辱詞出現總數p1Denom
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    #計算概率時,由於大部分因子都非常小,最後相乘的結果四捨五入為0,造成下溢位或者得不到準確的結果,
    #所以,我們可以對成績取自然對數,即求解對數似然概率。這樣,可以避免下溢位或者浮點數舍入導致的錯誤。
    #同時採用自然對數處理不會有任何損失。
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive


def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    #p1=(單詞A出現的次數*單詞A出現在侮辱句時的概率+單詞B出現的次數*單詞B出現在侮辱句時的概率+...)*正常句出現的概率
    #p0=(單詞A出現的次數*單詞A出現在正常句時的概率+單詞B出現的次數*單詞B出現在正常句時的概率+...)*正常句出現的概率
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0


def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        #分別讀取25封垃圾郵件和正常郵件
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        #因ham/23.txt中包含商標R符號,讀取時需要忽略掉錯誤
        wordList = textParse(open('email/ham/%d.txt' % i,encoding='utf-8',errors='ignore').read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    #vocabulary 去重
    vocabList = createVocabList(docList)                #create vocabulary
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        #numpy包含ramdom,random.uniform用於生成一個0到len(trainingSet)的隨機數
        randIndex = int(random.uniform(0,len(trainingSet)))
        #在range(50)中隨機選取10封不重複的郵件
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  

    trainMat=[]; trainClasses = []
    #剩下的40封郵件用於統計訓練
    for docIndex in trainingSet:    #train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    #計算每種條件對應的概率
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))

    #選中的10封由於測試
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        #如果用貝葉斯分類器的結果和實際結果不一樣
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print ("classification error",docList[docIndex])
    #計算平均錯誤率
    print ('the error rate is: ',float(errorCount)/len(testSet))
    #return vocabList,fullText


spamTest()

注意其中的第23篇正常郵件中包含一個utf-8不支援的字元:


SciFinance now automatically generates GPU-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new NVIDIA Fermi-class Tesla 20-Series GPU.

SciFinance® is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.

SciFinance's automatic, GPU-enabled Monte Carlo pricing model source code generation capabilities have been significantly extended in the latest release. This includes:

需要跳過。

樣本資料下載