1. 程式人生 > >樸素貝葉斯算法簡介及python代碼實現分析

樸素貝葉斯算法簡介及python代碼實現分析

匯總 cti rate append avg pop one data number

概念:

  貝葉斯定理:貝葉斯理論是以18世紀的一位神學家托馬斯.貝葉斯(Thomas Bayes)命名。通常,事件A在事件B(發生)的條件下的概率,與事件B在事件A(發生)的條件下的概率是不一樣的;然而,這兩者是有確定的關系的,貝葉斯定理就是這種關系的陳述

  樸素貝葉斯:樸素貝葉斯方法是基於貝葉斯定理和特征條件獨立假設的分類方法。對於給定的訓練數據集,首先基於特征條件獨立假設學習輸入/輸出的聯合概率分布;然後基於此模型,對給定的輸入x,利用貝葉斯定理求出後驗概率(Maximum A Posteriori)最大的輸出y。

通俗的來講,在給定數據集的前提下,對於一個新樣本(未分類),在數據集中找到和新樣本特征相同的樣本,最後根據這些樣本算出每個類的概率,概率最高的類即為新樣本的類。

運算公式:

P( h | d) = P ( d | h ) * P( h) / P(d)

這裏:
P ( h | d ):是因子h基於數據d的假設概率,叫做後驗概率
P ( d | h ) : 是假設h為真條件下的數據d的概率
P( h) : 是假設條件h為真的時候的概率(和數據無關),它叫做h的先驗概率
P(d) : 數據d的概率,和先驗條件無關.

算法實現分解:

1 數據處理:加載數據並把他們分成訓練數據和測試數據
2 匯總數據:匯總訓練數據的概率以便後續計算概率和做預測
3 結果預測: 通過給定的測試數據和匯總的訓練數據做預測
4 評估準確性:使用測試數據來評估預測的準確性

代碼實現:

  1 # Example of Naive Bayes implemented from Scratch in Python
  2 import csv
  3 import random
  4 import math
  5 
  6 def loadCsv(filename):
  7         lines = csv.reader(open(filename, "rb"))
  8         dataset = list(lines)
  9         for i in range(len(dataset)):
 10                 dataset[i] = [float(x) for
x in dataset[i]] 11 return dataset 12 13 def splitDataset(dataset, splitRatio): 14 trainSize = int(len(dataset) * splitRatio) 15 trainSet = [] 16 copy = list(dataset) 17 while len(trainSet) < trainSize: 18 index = random.randrange(len(copy)) 19 trainSet.append(copy.pop(index)) 20 return [trainSet, copy] 21 22 def separateByClass(dataset): 23 separated = {} 24 for i in range(len(dataset)): 25 vector = dataset[i] 26 if (vector[-1] not in separated): 27 separated[vector[-1]] = [] 28 separated[vector[-1]].append(vector) 29 return separated 30 31 def mean(numbers): 32 return sum(numbers)/float(len(numbers)) 33 34 def stdev(numbers): 35 avg = mean(numbers) 36 variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1) 37 return math.sqrt(variance) 38 39 def summarize(dataset): 40 summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)] 41 del summaries[-1] 42 return summaries 43 44 def summarizeByClass(dataset): 45 separated = separateByClass(dataset) 46 summaries = {} 47 for classValue, instances in separated.iteritems(): 48 summaries[classValue] = summarize(instances) 49 return summaries 50 51 def calculateProbability(x, mean, stdev): 52 exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2)))) 53 return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent 54 55 def calculateClassProbabilities(summaries, inputVector): 56 probabilities = {} 57 for classValue, classSummaries in summaries.iteritems(): 58 probabilities[classValue] = 1 59 for i in range(len(classSummaries)): 60 mean, stdev = classSummaries[i] 61 x = inputVector[i] 62 probabilities[classValue] *= calculateProbability(x, mean, stdev) 63 return probabilities 64 65 def predict(summaries, inputVector): 66 probabilities = calculateClassProbabilities(summaries, inputVector) 67 bestLabel, bestProb = None, -1 68 for classValue, probability in probabilities.iteritems(): 69 if bestLabel is None or probability > bestProb: 70 bestProb = probability 71 bestLabel = classValue 72 return bestLabel 73 74 def getPredictions(summaries, testSet): 75 predictions = [] 76 for i in range(len(testSet)): 77 result = predict(summaries, testSet[i]) 78 predictions.append(result) 79 return predictions 80 81 def getAccuracy(testSet, predictions): 82 correct = 0 83 for i in range(len(testSet)): 84 if testSet[i][-1] == predictions[i]: 85 correct += 1 86 return (correct/float(len(testSet))) * 100.0 87 88 def main(): 89 filename = pima-indians-diabetes.data.csv 90 splitRatio = 0.67 91 dataset = loadCsv(filename) 92 trainingSet, testSet = splitDataset(dataset, splitRatio) 93 print(Split {0} rows into train={1} and test={2} rows).format(len(dataset), len(trainingSet), len(testSet)) 94 # prepare model 95 summaries = summarizeByClass(trainingSet) 96 # test model 97 predictions = getPredictions(summaries, testSet) 98 accuracy = getAccuracy(testSet, predictions) 99 print(Accuracy: {0}%).format(accuracy) 100 101 main()

pima-indians-diabetes.data.csv的下載地址:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

參考文檔:

1 https://en.wikipedia.org/wiki/Naive_Bayes_classifier

2 https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

3 https://machinelearningmastery.com/naive-bayes-for-machine-learning/

樸素貝葉斯算法簡介及python代碼實現分析