使用樸素貝葉斯過濾垃圾郵件

阿新 • • 發佈：2018-12-05

split文字分割函式

mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.'
ret=mySent.split()
print(ret)

輸出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']
>>>

使用正則表示式可以解決單詞中的其他符號：

\d 匹配任何十進位制數；它相當於類 [0-9]。
\D 匹配任何非數字字元；它相當於類 [^0-9]。
\s 匹配任何空白字元；它相當於類 [ \t\n\r\f\v]。
\S 匹配任何非空白字元；它相當於類 [^ \t\n\r\f\v]。
\w 匹配任何字母數字字元；它相當於類 [a-zA-Z0-9_]。
\W 匹配任何非字母數字字元；它相當於類 [^a-zA-Z0-9_]。

import re
regEx=re.compile('\\W')    #大寫的W
listOfTokens=regEx.split(mySent)
print(listOfTokens)

輸出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', '', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']

為了消除其中的空元素，剔除長度為0的元素：

ret=[tok for tok in listOfTokens if len 
(tok)>0]
print(ret)

等效為一下語句：

ret=[]
for tok in listOfTokens:
    if len(tok)>0:
        ret.append(tok)
print(ret)

其中for關鍵字前方的字元即為需要append的內容，為了統一大小寫，全部返回小寫：

ret=[tok.lower() for tok in listOfTokens if len(tok)>0]
print(ret)

輸出：

['this', 'book', 'is', 'the', 'best', 'book', 'on' 
, 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']

在附件中有很多email文字，以其中任意一個為例：

Hello,

Since you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.

For example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you’re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.

you have received this mandatory email service announcement to update you about important changes to Google Groups.

同樣也可以將所有單詞都分割出來。

emailText = open('email/ham/6.txt').read()
listOfTokens=regEx.split(emailText)
print(listOfTokens)

定義一個split函式，返回長度大於2的單詞list

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

最後附上檔案解析及完整的垃圾郵件測試函式：

from numpy import *

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    #因為傳入是二維陣列，所以將二維陣列內的所有元素全部壓入Set中（順序可能會被打亂）
    #最後再以list的形式返回
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

#統計單詞出現的次數，用於建立向量集
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec


def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                     #測試集數目  6
    numWords = len(trainMatrix[0])                      #總單詞（去重）數目  32
    pAbusive = sum(trainCategory)/float(numTrainDocs)   #該文件屬於侮辱類的概率=被標記為侮辱類句子數量/總句子數量=3/6.0=0.5
    #變數初始化
    p0Num = zeros(numWords); p1Num = zeros(numWords)    #標記向量初始化為[0,0,0,0...]
    p0Denom = 0; p1Denom = 0                            #統計數為0

    #計算概率時，需要計算多個概率的乘積以獲得文件屬於某個類別的概率
    #即計算p(w0|ci)*p(w1|ci)*...p(wN|ci)，然後當其中任意一項的值為0，那麼最後的乘積也為0.
    #為降低這種影響，採用拉普拉斯平滑，在分子上新增a(一般為1)，分母上新增ka(k表示類別總數)，
    #即在這裡將所有詞的出現數初始化為1，並將分母初始化為2*1=2
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()   
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0

    #對於每個句子
    #如果該句被人工標記為侮辱性的，則其中出現的每個詞彙p1Num都該被認為是侮辱性的，侮辱性詞彙總數p1Denom也做相應統計
    #如果該句不是侮辱性的，同樣做統計
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    #每個單詞的是侮辱詞的條件概率=在侮辱詞中出現的次數p1Num/侮辱詞出現總數p1Denom
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    #計算概率時，由於大部分因子都非常小，最後相乘的結果四捨五入為0,造成下溢位或者得不到準確的結果，
    #所以，我們可以對成績取自然對數，即求解對數似然概率。這樣，可以避免下溢位或者浮點數舍入導致的錯誤。
    #同時採用自然對數處理不會有任何損失。
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive


def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    #p1=(單詞A出現的次數*單詞A出現在侮辱句時的概率+單詞B出現的次數*單詞B出現在侮辱句時的概率+...)*正常句出現的概率
    #p0=(單詞A出現的次數*單詞A出現在正常句時的概率+單詞B出現的次數*單詞B出現在正常句時的概率+...)*正常句出現的概率
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0


def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        #分別讀取25封垃圾郵件和正常郵件
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        #因ham/23.txt中包含商標R符號，讀取時需要忽略掉錯誤
        wordList = textParse(open('email/ham/%d.txt' % i,encoding='utf-8',errors='ignore').read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    #vocabulary 去重
    vocabList = createVocabList(docList)                #create vocabulary
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        #numpy包含ramdom，random.uniform用於生成一個0到len(trainingSet)的隨機數
        randIndex = int(random.uniform(0,len(trainingSet)))
        #在range(50)中隨機選取10封不重複的郵件
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  

    trainMat=[]; trainClasses = []
    #剩下的40封郵件用於統計訓練
    for docIndex in trainingSet:    #train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    #計算每種條件對應的概率
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))

    #選中的10封由於測試
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        #如果用貝葉斯分類器的結果和實際結果不一樣
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print ("classification error",docList[docIndex])
    #計算平均錯誤率
    print ('the error rate is: ',float(errorCount)/len(testSet))
    #return vocabList,fullText


spamTest()

注意其中的第23篇正常郵件中包含一個utf-8不支援的字元：


SciFinance now automatically generates GPU-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new NVIDIA Fermi-class Tesla 20-Series GPU.

SciFinance® is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.

SciFinance's automatic, GPU-enabled Monte Carlo pricing model source code generation capabilities have been significantly extended in the latest release. This includes:

需要跳過。

樣本資料下載

使用樸素貝葉斯過濾垃圾郵件

split文字分割函式 mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.' ret=mySent.split() print(ret) 輸出 ['This', '

機器學習-資料分析之樸素貝葉斯過濾垃圾郵件

資料分析之過濾垃圾郵件前沿之前也學了一些資料分析的案例從一直沒有記錄，所有準備從現在開始把所學的都記錄在CSDN中。如果大家看到我的博文有什麼不理解或者還想學習更深入的可以去上面的網站。樸素貝葉斯之過濾垃圾郵件使用樸素貝葉斯解決一些生活中的問題。先從文字內容得

樸素貝葉斯-過濾垃圾郵件程式碼例項詳解

1.問題描述過濾垃圾郵件 2.思考過程（1）收集資料：提供文字檔案（2）準備資料：將文字檔案解析成詞條向量此處我們需要從給予的文字文件中構建自己的詞列表（將文字內容進行詞分割，過濾不需要的），也就是要建立符合實際情況的文字解析規則和過濾器（此處發現pytho

基於樸素貝葉斯的垃圾郵件過濾

1.文字切分 #對於一個文字字串，可以使用Python的string.split()方法將其切分 mySent = 'This book is the best book on python or M.L. I have ever laid eyes upon' word

【python與機器學習入門3】樸素貝葉斯2——垃圾郵件分類

參考部落格：樸素貝葉斯基礎篇之言論過濾器（po主Jack-Cui,《——大部分內容轉載自參考書籍：《機器學習實戰》——第四章4.6

樸素貝葉斯演算法----垃圾郵件識別

問題是什麼？問題是，給定一封郵件，判定它是否屬於垃圾郵件。按照先例，我們還是用 D 來表示這封郵件，注意 D 由 N 個單片語成。我們用 h+ 來表示垃圾郵件，h- 表示正常郵件。問題可以形式化地描述為求： P(h+|D) = P(h+) * P(D|h+) / P(D) P

Python實現基於樸素貝葉斯的垃圾郵件分類

聽說樸素貝葉斯在垃圾郵件分類的應用中效果很好，尋思樸素貝葉斯容易實現，就用python寫了一個樸素貝葉斯模型下的垃圾郵件分類。在400封郵件（正常郵件與垃圾郵件各一半）的測試集中測試結果為分類準確率95.15%，在僅僅統計詞頻計算概率的情況下，分類結果還是相當不

python實現貝葉斯推斷——垃圾郵件分類

理論前期準備資料來源資料來源於《機器學習實戰》中的第四章樸素貝葉斯分類器的實驗資料。資料書上只提供了50條資料（25條正常郵件，25條垃圾郵件），感覺資料量偏小，以後打算使用scikit-learn提供的iris資料。資料準備和很

大資料之Spark（七）--- Spark機器學習，樸素貝葉斯，酒水評估和分類案例學習，垃圾郵件過濾學習案例，電商商品推薦，電影推薦學習案例

一、Saprk機器學習介紹 ------------------------------------------------------------------ 1.監督學習 a.有訓練資料集,符合規範的資料 b.根據資料集，產生一個推斷函式

使用樸素貝葉斯算法簡單實現垃圾郵件過濾

垃圾郵件相關性得到因此 block align 介紹 14. 影響一、算法介紹樸素貝葉斯法，簡稱NB算法，是貝葉斯決策理論的一部分，是基於貝葉斯定理與特征條件獨立假設的分類方法：首先理解兩個概念： · 先驗概率是指根據以往經驗和分析得到的概率，它往往作為“由因求

機器學習實戰中，第四章樸素貝葉斯，過濾垃圾郵件，正則表示式切分郵件內容得出字母的問題解決方法

原文中的程式碼：listOfTokens = re.split(r'\W*', bigString) 修改為：listOfTokens = re.split(r'\W+', bigString)

Python實現樸素貝葉斯演算法 --- 過濾垃圾郵件

# -*- coding:utf-8 -*- import numpy as np import random import re __author__ = 'yangxin' """ 過濾垃圾郵件 """ class FilterSpam(object): #

第4章樸素貝葉斯（文字分類、過濾垃圾郵件、獲取區域傾向）

貝葉斯定理： P ( c

機器學習筆記（2）——使用樸素貝葉斯演算法過濾（中英文）垃圾郵件

在上一篇文章《使用樸素貝葉斯演算法對文件分類詳解》中，我們實現了用樸素貝葉斯演算法對簡單文件的分類，今天我們將利用此分類器來過濾垃圾郵件。 1. 準備資料——文字切分之前演算法中輸入的文件格式為單詞向量，例如['my', 'dog', 'has', 'flea', 'p

利用樸素貝葉斯（Navie Bayes）進行垃圾郵件分類

判斷 ase create numpy water 向量 not in imp img 貝葉斯公式描寫敘述的是一組條件概率之間相互轉化的關系。在機器學習中。貝葉斯公式能夠應用在分類問題上。這篇文章是基於自己的學習所整理。並利用一個垃圾郵件分類的樣例來加深對於理論的理解

樸素貝葉斯應用：垃圾郵件分類

import nltk nltk.download() from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer #預處理 def preprocessing(text): tokens

樸素貝葉斯應用：垃圾郵件分類(更新)

#讀取資料集 import csv file_path=r'jiangnan.txt' sms=open(file_path,'r',encoding='utf-8') sms_data=[] sms_label=[] text=csv.reader(sms,delimiter='\t') text

第十二次作業——樸素貝葉斯應用：垃圾郵件分類

text = "Everybody knows waste paper and used coke cans are discarded everywhere. You might have seen plastic bags flying in the sky and getting caught i

機器學習之樸素貝葉斯（附垃圾郵件分類）

樸素貝葉斯分類器介紹概述樸素貝葉斯分類器技術基於貝葉斯定理，特別適用於輸入維數較高的情況。儘管樸素貝葉斯方法簡單，但它通常比更復雜的分類方法更勝一籌。

郵件分類和過濾-樸素貝葉斯NB經典案例

關於樸素貝葉斯的理論，已在機器學習之樸素貝葉斯分類器中進行了詳細說明，但是沒有經歷coding親

使用樸素貝葉斯過濾垃圾郵件

相關推薦