K-近鄰演算法python實現
內容主要來源於機器學習實戰這本書,加上自己的理解。
1.KNN演算法的簡單描述
K最近鄰(k-Nearest Neighbor,KNN)分類演算法可以說是最簡單的機器學習演算法了。它採用測量不同特徵值之間的距離方法進行分類。它的思想很簡單:如果一個樣本在特徵空間中的k個最相似(即特徵空間中最鄰近)的樣本中的大多數屬於某一個類別,則該樣本也屬於這個類別。下圖是大家引用的一個最經典示例圖。
比如上面這個圖,我們有兩類資料,分別是藍色方塊和紅色三角形,他們分佈在一個上圖的二維中間中。那麼假如我們有一個綠色圓圈這個資料,需要判斷這個資料是屬於藍色方塊這一類,還是與紅色三角形同類。怎麼做呢?我們先把離這個綠色圓圈最近的幾個點找到,因為我們覺得離綠色圓圈最近的才對它的類別有判斷的幫助。那到底要用多少個來判斷呢?這個個數就是k了。如果k=3,就表示我們選擇離綠色圓圈最近的3個點來判斷,由於紅色三角形所佔比例為2/3,所以我們認為綠色圓是和紅色三角形同類。如果k=5,由於藍色四方形比例為3/5,因此綠色圓被賦予藍色四方形類。從這裡可以看到,k的值選取很重要的。
KNN演算法中,所選擇的鄰居都是已經正確分類的物件。該方法在定類決策上只依據最鄰近的一個或者幾個樣本的類別來決定待分樣本所屬的類別。由於KNN方法主要靠周圍有限的鄰近的樣本,而不是靠判別類域的方法來確定所屬類別的,因此對於類域的交叉或重疊較多的待分樣本集來說,KNN方法較其他方法更為適合。
該演算法在分類時有個主要的不足是,當樣本不平衡時,如一個類的樣本容量很大,而其他類樣本容量很小時,有可能導致當輸入一個新樣本時,該樣本的K個鄰居中大容量類的樣本佔多數。因此可以採用權值的方法(和該樣本距離小的鄰居權值大)來改進。該方法的另一個不足之處是計算量較大,因為對每一個待分類的文字都要計算它到全體已知樣本的距離,才能求得它的K個最近鄰點。目前常用的解決方法是事先對已知樣本點進行剪輯,事先去除對分類作用不大的樣本。該演算法比較適用於樣本容量比較大的類域的自動分類,而那些樣本容量較小的類域採用這種演算法比較容易產生誤分。
總的來說就是我們已經存在了一個帶標籤的資料比對庫,然後輸入沒有標籤的新資料後,將新資料的每個特徵與樣本集中資料對應的特徵進行比較,然後演算法提取樣本集中特徵最相似(最近鄰)的分類標籤。一般來說,只選擇樣本資料庫中前k個最相似的資料。最後,選擇k個最相似資料中出現次數最多的分類。其演算法描述如下:
1)計算已知類別資料集中的點與當前點之間的距離;
2)按照距離遞增次序排序;
3)選取與當前點距離最小的k個點;
4)確定前k個點所在類別的出現頻率;
5)返回前k個點出現頻率最高的類別作為當前點的預測分類。
二:python程式部分
2.1 python匯入資料
def createDataSet(): group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] return group, labels
建立了資料集和標籤。
根據上面說到的演算法描述中五個步驟K-近鄰演算法核心部分程式:
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) - dataSet # tile :construct array by repeating inX dataSetSize times
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5 # get distance
sortedDistIndicies = distances.argsort() # return ordered array's index
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
不知道是不是編碼設定問題,註釋沒法寫成中文,只能是英文。
K-近鄰演算法書上應用到了改進約會網站的配對效果上面具體流程:
準備資料部分:從文字檔案中解析資料,文字中說到3種特徵:飛行里程、玩遊戲時間、消費冰淇淋數量。我不知道作者為什麼選擇這三種特徵,好像跟約會配對沒什麼毛關係。
這部分用到很多numpy中處理矩陣的函式。
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines()) #get the number of lines in the file
returnMat = zeros((numberOfLines,3)) #prepare matrix to return
classLabelVector = [] #prepare labels return
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip() # delete character like tab or backspace
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3] # get 3 features
classLabelVector.append(int(listFromLine[-1])) # get classify result
index += 1
return returnMat,classLabelVector
處理資料中涉及到資料值的歸一化。意思就是說上面約會配對有三個特徵,但是會發現飛行距離這個數值遠遠大於其它兩個,為了體現3個特徵相同的影響力,對資料進行歸一化。
def autoNorm(dataSet):
minVals = dataSet.min(0) # select least value in column
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide
return normDataSet, ranges, minVals
另外一個應用是在手寫識別系統。類似於前面約會網站應用,準備資料時需要進行影象到向量轉換,然後呼叫K-近鄰的核心演算法實現。
下面是所有的程式碼綜合和測試程式碼:主函式裡添加了一些matplotlib畫圖測試程式碼
'''
kNN: k Nearest Neighbors
Input: inX: vector to compare to existing dataset (1xN)
dataSet: size m data set of known vectors (NxM)
labels: data set labels (1xM vector)
k: number of neighbors to use for comparison (should be an odd number)
Output: the most popular class label
'''
from numpy import *
import operator
from os import listdir
import matplotlib
import matplotlib.pyplot as plt
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) - dataSet # tile :construct array by repeating inX dataSetSize times
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5 # get distance
sortedDistIndicies = distances.argsort() # return ordered array's index
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines()) #get the number of lines in the file
returnMat = zeros((numberOfLines,3)) #prepare matrix to return
classLabelVector = [] #prepare labels return
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip() # delete character like tab or backspace
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3] # get 3 features
classLabelVector.append(int(listFromLine[-1])) # get classify result
index += 1
return returnMat,classLabelVector
def autoNorm(dataSet):
minVals = dataSet.min(0) # select least value in column
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide
return normDataSet, ranges, minVals
def datingClassTest():
hoRatio = 0.50 #hold out 10%
datingDataMat,datingLabels = file2matrix('E:\PythonMachine Learning in Action\datingTestSet2.txt') #load data setfrom file
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
print m
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
if (classifierResult != datingLabels[i]): errorCount += 1.0
print "the total error rate is: %f" % (errorCount/float(numTestVecs))
print errorCount
def classifyperson():
resultList = ['not at all','in small doses','in large doses']
percentTats = float(raw_input('percentage time spent on games ?'))
ffmiles = float(raw_input('frequent flier miles per year?'))
iceCream = float(raw_input('liters of ice cream consumed each year?'))
datingDataMat,datingLabels = file2matrix('E:\PythonMachine Learning in Action\datingTestSet2.txt') #load data setfrom file
normMat, ranges, minVals = autoNorm(datingDataMat)
inArr = array([ffmiles,percentTats,iceCream])
classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
print "your probably like this person :" ,\
resultList[classifierResult-1]
def img2vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect
def handwritingClassTest():
hwLabels = []
trainingFileList = listdir('E:/PythonMachine Learning in Action/trainingDigits') #load the training set
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector('E:/PythonMachine Learning in Action/trainingDigits/%s' % fileNameStr)
testFileList = listdir('E:/PythonMachine Learning in Action/testDigits') #iterate through the test set
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('E:/PythonMachine Learning in Action/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount/float(mTest))
if __name__=='__main__':
#classifyperson()
datingClassTest()
dataSet, labels = createDataSet()
testX = array([1.2, 1.0])
k = 3
outputLabel = classify0(testX, dataSet, labels, 3)
print "Your input is:", testX, "and classified to class: ", outputLabel
testX = array([0.1, 0.3])
outputLabel = classify0(testX, dataSet, labels, 3)
print "Your input is:", testX, "and classified to class: ", outputLabel
handwritingClassTest()
datingDataMat,datingLabels = file2matrix('E:\PythonMachine Learning in Action\datingTestSet2.txt')
print datingDataMat
print datingLabels[0:20]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()
這裡要注意:
trainingFileList = listdir('E:/PythonMachine Learning in Action/trainingDigits')
呼叫這個函式時路徑寫法,如果不想複雜指定路徑簡單就把資料夾和knn.py檔案放在一起。