1. 程式人生 > >機器學習實戰——python實現簡單的樸素貝葉斯分類器

機器學習實戰——python實現簡單的樸素貝葉斯分類器

基礎公式

貝葉斯定理:P(A|B) = P(B|A)*P(A)/P(B)
假設B1,B2…Bn彼此獨立,則有:P(B1xB2x…xBn|A) = P(B1|A)xP(B2|A)x…xP(Bn|A)

資料(虛構)

A1 A2 A3 A4 A5 B
1  1  1  1  3  no
1  1  1  2  2  soft
1  1  2  1  3  no
1  1  2  2  1  hard
1  2  1  1  2  no
1  2  1  2  3  soft
1  2  2  1  1  no
1  2  2  2  2  hard
2  1  1  1  3  no
2
1 1 2 3 soft 2 1 2 1 1 no 2 1 2 2 1 hard 2 2 1 1 2 no 2 2 1 2 3 soft 2 2 2 1 2 soft 2 2 2 2 2 hard 3 1 1 1 1 no 3 1 1 2 2 soft 3 1 2 1 1 no 3 1 2 2 1 hard 3 2 1 1 3 soft 3 2 1 2 1 soft 3 2 2 1 2 no 3 2 2 2 3 no

五個features,一個label

演算法步驟

1.根據訓練集計算概率:
(1)計算:
P(B="hard"),P(B="soft"),P(B="no")
(2)計算:
P(A1="1"|B="hard"),P(A1="2"|B="hard"),P(A1="3"|B="hard");
P(A2="1"|B="hard"),P(A2="2"|B="hard"),...

P(A1="1"|B="soft"),P(A1="2"|B="soft"),P(A1="3"|B="soft");
P(A2="1"|B="soft"),P(A2="2"|B="soft"),...

P(A1="1"|B="no"),P(A1="2"
|B="no"),P(A1="3"|B="no"); P(A2="1"|B="no"),P(A2="2"|B="no"),... 2.按照貝葉斯定理計算測試資料分類的概率: 計算:P(B="hard"|test_A) , P(B="soft"|test_A) , P(B="no"|test_A) 概率最大的類別,就是樸素貝葉斯分類器得到的分類結果。

程式碼實現

def train(dataSet,labels):

    uniqueLabels = set(labels)
    res = {}
    for label in uniqueLabels:
        res[label] = []
        res[label].append(labels.count(label)/float(len(labels)))
        for i in range(len(dataSet[0])-1):
            tempCols = [l[i] for l in dataSet if l[-1]==label]#獲取Ai的值
            uniqueValues = set(tempCols)
            dict = {}
            for value in uniqueValues:
                count = tempCols.count(value)
                prob = count/float(labels.count(label))#計算P(A|B)
                dict[value] = prob  
            res[label].append(dict)

    return res


def test(testVect,probMat):
    hard = probMat['hard']
    soft = probMat['soft']
    no = probMat['no']
    phard = hard[0]
    psoft = soft[0]
    pno = no[0]
    for i in range(len(testVect)):
        if testVect[i] in hard[i+1]:
            phard *= hard[i+1][testVect[i]]
        else:
            phard = 0

        if testVect[i] in soft[i + 1]:
            psoft *= soft[i + 1][testVect[i]]
        else:
            psoft = 0

        if testVect[i] in no[i + 1]:
            pno *= no[i + 1][testVect[i]]
        else:
            pno = 0
    res['hard'] = phard
    res['soft'] = psoft
    res['no'] = pno
    print phard, psoft, pno
    return max(res, key=res.get)

獲取資料

def loadDataSet(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    returnMat = []
    labels = []
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('  ')
        labels.append(listFromLine[-1])
        returnMat.append(listFromLine)
    return returnMat,labels

根據訓練集計算概率

這裡的res返回的是儲存上述演算法步驟1中描述的所有概率值的字典。字典結構如下:

{'hard': [P(B="hard"), {'1': P(A1="1"|B="hard"), '2': P(A1="2"|B="hard"), '3': P(A1="3"|B="hard")}, {'1': P(A2="1"|B="hard"), '2': P(A2="2"|B="hard")}, {'1': P(A3="1"|B="hard"),'2':P(A3="2"|B="hard")}, {'1': P(A4="1"|B="hard"),'2':P(A4="2"|B="hard")}, {'1': P(A5="1"|B="hard"), '2': P(A5="2"|B="hard"), '3': P(A5="3"|B="hard")}], 

'soft': [P(B="soft"),  {'1': P(A1="1"|B="soft"), '2': P(A1="2"|B="soft"), '3': P(A1="3"|B="soft")}, {'1': P(A2="1"|B="soft"), '2': P(A2="2"|B="soft")}, {'1': P(A3="1"|B="soft"),'2':P(A3="2"|B="soft")}, {'1': P(A4="1"|B="soft"),'2':P(A4="2"|B="soft")}, {'1': P(A5="1"|B="soft"), '2': P(A5="2"|B="soft"), '3': P(A5="3"|B="soft")}], 

'no': [P(B="no"),  {'1': P(A1="1"|B="no"), '2': P(A1="2"|B="no"), '3': P(A1="3"|B="no")}, {'1': P(A2="1"|B="no"), '2': P(A2="2"|B="no")}, {'1': P(A3="1"|B="no"),'2':P(A3="2"|B="no")}, {'1': P(A4="1"|B="no"),'2':P(A4="2"|B="no")}, {'1': P(A5="1"|B="no"), '2': P(A5="2"|B="no"), '3': P(A5="3"|B="no")}]}

其中,若概率為0,則字典裡不包含該鍵值對。

計算測試樣本的分類概率

測試結果

dataSet , labels = loadDataSet("dataset.txt")
probMat = train(dataSet,labels)

res = test(['3','1','2','2','1'],probMat)
print res

執行結果