小白python學習——機器學習篇——樸素貝葉斯演算法
阿新 • • 發佈:2018-11-01
一.大概思路:
1.找出資料集合,所有一個單詞的集合,不重複,各個文件。
2.把每個文件換成0,1模型,出現的是1,就可以得到矩陣長度一樣的各個文件。
3.計算出3個概率,一是侮辱性的文件概率,二是侮辱性文件中各個詞出現的概率,三是非侮辱性文件中各個詞出現的概率。
4.二、三計算方法,遍歷0,1文件,同一型別加起來除以n*sun(set),得到一個矩陣,裡面是各個詞語的概率,沒有出現就是0.
二.程式碼實現:
import numpy as np import math def loadDataSet(): postingList=[['my','dog','has','flea','problems','help','please'], ['maybe','not','take','him','to','dog','park','stupid'], ['my','dalmation','is','so','cute','I','love','him'], ['stop','posting','stupid','worthless','garbage'], ['mr','licks','ate','my','steak','how','to','stop','him'], ['quit','buying','worthless','dog','food','stupid']] classVec=[0,1,0,1,0,1] return postingList,classVec #輸入詞表 def createVocabList(dataSet): vocabSet=set([]) for document in dataSet: vocabSet=vocabSet | set(document) return list(vocabSet) #返回列表 #把詞表轉為一個矩陣,一個數據集 def setOfWords2Vec(vocabList,inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print("The word is not in Vocabulary") return returnVec #出現為1,沒出先為0,建立0,1矩陣 def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = np.zeros(numWords) p1Num = np.zeros(numWords) p0Denom =2.0 p1Denom =2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom +=sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom +=sum(trainMatrix[i]) p1Vect = p1Num/p1Denom p0Vect = p0Num/p0Denom return p0Vect,p1Vect,pAbusive #計算概率 def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1): p1=sum(vec2Classify*p1Vec) + math.log(pClass1) p0=sum(vec2Classify*p0Vec) + math.log(1.0-pClass1) if p1>p0: return 1 else: return 0 #比較概率大小 def testingNB(ceshi): postingList, classVec = loadDataSet() new_list = createVocabList(postingList) train = [] for i in postingList: train.append(setOfWords2Vec(new_list, i)) p0Vect, p1Vect, pAbusive = trainNB0(train, classVec) ce = setOfWords2Vec(new_list, ceshi) print(classifyNB(ce, p0Vect, p1Vect, pAbusive)) #最終究極合成測試函式
測試結果: