1. 程式人生 > >小白python學習——機器學習篇——樸素貝葉斯演算法

小白python學習——機器學習篇——樸素貝葉斯演算法

一.大概思路:

1.找出資料集合,所有一個單詞的集合,不重複,各個文件。

2.把每個文件換成0,1模型,出現的是1,就可以得到矩陣長度一樣的各個文件。

3.計算出3個概率,一是侮辱性的文件概率,二是侮辱性文件中各個詞出現的概率,三是非侮辱性文件中各個詞出現的概率。

4.二、三計算方法,遍歷0,1文件,同一型別加起來除以n*sun(set),得到一個矩陣,裡面是各個詞語的概率,沒有出現就是0.

二.程式碼實現:

 

import numpy as np
import math
def loadDataSet():
    postingList=[['my','dog','has','flea','problems','help','please'],
                 ['maybe','not','take','him','to','dog','park','stupid'],
                 ['my','dalmation','is','so','cute','I','love','him'],
                 ['stop','posting','stupid','worthless','garbage'],
                 ['mr','licks','ate','my','steak','how','to','stop','him'],
                 ['quit','buying','worthless','dog','food','stupid']]
    classVec=[0,1,0,1,0,1]
    return  postingList,classVec
#輸入詞表
def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet | set(document)
    return list(vocabSet)   #返回列表
#把詞表轉為一個矩陣,一個數據集
def setOfWords2Vec(vocabList,inputSet):
    returnVec  = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("The word is not in Vocabulary")
    return returnVec
#出現為1,沒出先為0,建立0,1矩陣
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom =2.0
    p1Denom =2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom +=sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom +=sum(trainMatrix[i])
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    return p0Vect,p1Vect,pAbusive
#計算概率
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    p1=sum(vec2Classify*p1Vec) + math.log(pClass1)
    p0=sum(vec2Classify*p0Vec) + math.log(1.0-pClass1)
    if p1>p0:
        return 1
    else:
        return 0
#比較概率大小
def testingNB(ceshi):
    postingList, classVec = loadDataSet()
    new_list = createVocabList(postingList)
    train = []
    for i in postingList:
        train.append(setOfWords2Vec(new_list, i))
    p0Vect, p1Vect, pAbusive = trainNB0(train, classVec)
    ce = setOfWords2Vec(new_list, ceshi)
    print(classifyNB(ce, p0Vect, p1Vect, pAbusive))
#最終究極合成測試函式

測試結果: