Python3《機器學習實戰》筆記:K-近鄰演算法
阿新 • • 發佈:2018-12-17
2.1 實施KNN演算法
python3實現KNN演算法,本書採用的是python2,轉化為python3
import numpy as np #運算子模組 import operator def createDataSet(): group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]) labels = ['A', 'A', 'B', 'B'] return group, labels #K-近鄰演算法 def classify0(inX, dataSet, labels, k): #獲取shape的第一個值 dataSetSize = dataSet.shape[0] #tile函式把inX重複dataSetSize遍,1列,用尤拉定理進行計算 diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet sqDiffMat = diffMat ** 2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances ** 0.5 #argsort函式返回的是陣列值從小到大的索引值 sortedDistIndices = distances.argsort() classCount = {} for i in range(k): voteIlabel = labels[sortedDistIndices[i]] classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) # 返回最近鄰的點 return sortedClassCount[0][0]
測試結果如下: 輸入:
import kNN
group , labels = kNN.createDataSet()
print(kNN.classify0([0,0],group,labels,3)
輸出:
B
2.2 使用K——近鄰演算法對約會網站的匹配效果進行改進
下載《機器學習實戰》的輔助材料,下載地址為https://github.com/frankstar007/kNN 資料集放在 2.2data 中, 可以下載使用(注意檔名字是:datingTestSet2,本書中沒有2)
2.2.1 在KNN.py中加入下列程式碼:
def file2matrix(filename): #開啟檔案 fr=open(filename) #readlines() 方法用於讀取所有行(直到結束符 EOF)並返回列表, #該列表可以由 Python 的 for... in ... 結構進行處理。 #返回型別為一個列表 arrayOLines=fr.readlines() #列表的長度 numberOfLines=len(arrayOLines) #設定numberOfLines行3列的0矩陣 returnMat=np.zeros((numberOfLines,3)) #設定空列表 classLabelVector=[] index=0 for line in arrayOLines: #利用函式strip擷取掉所有的回車符 line=line.strip() #使用tab字元\t將整行的資料分割為1個元素 listFromLine=line.split('\t') #選取前三個矩陣,將它們儲存在特徵矩陣中 returnMat[index,:]=listFromLine[0:3] #將列表中的最後一行元素儲存在classLabelVector中 classLabelVector.append(listFromLine[-1]) index+=1 return returnMat,classLabelVector
在test.py中進行測試:
import kNN
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
print(datingDataMat)
print(datingLabels[0:20])
[[ 4.09200000e+04 8.32697600e+00 9.53952000e-01] [ 1.44880000e+04 7.15346900e+00 1.67390400e+00] [ 2.60520000e+04 1.44187100e+00 8.05124000e-01] ..., [ 2.65750000e+04 1.06501020e+01 8.66627000e-01] [ 4.81110000e+04 9.13452800e+00 7.28045000e-01] [ 4.37570000e+04 7.88260100e+00 1.33244600e+00]] [3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]
2.2.2 分析資料:使用Matplotlib建立散點圖
import matplotlib
import os
import matplotlib.pyplot as plt
from numpy import *
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show()
本題目一共有三組特徵值:玩遊戲視訊所耗時間的百分比;每週消費的冰淇淋公升數;每年獲取的飛行常客里程數; 分別繪製出彩色影象
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
ax2.scatter(datingDataMat[:,1], datingDataMat[:,2],
15.0*array(list(map(int,datingLabels))),
15.0*array(list(map(int,datingLabels))))
#資料乘以特徵值,更好的區別特徵資料
ax3.scatter(datingDataMat[:,0], datingDataMat[:,2],
15.0*array(list(map(int,datingLabels))),
15.0*array(list(map(int,datingLabels))))
ax4.scatter(datingDataMat[:,0], datingDataMat[:,1],
15.0*array(list(map(int,datingLabels))),
15.0*array(list(map(int,datingLabels))))
plt.show()
2.2.3 準備資料: 歸一化數值
根據表格中的數值和尤拉兩點之間距離,數值差值最大的屬性對計算結果的影響最大,當資料的樣本特徵權重不一樣,就會導致某一個特徵權重的差距太大影響到整體的距離,因此要使用歸一化來將這種不同取值範圍的特徵值歸一化,將取值範圍處理為0到1,或者-1到1之間;使用如下公式可以講任意取值範圍的特徵值轉化為0到1區間內的值:
newValue = (oldValue-min) / (max-min)
這裡的new和old都針對的是某一列裡的一個,而在這裡使用應該是列表整體的使用了公式,故得到的是一個列表型別的newValue;
def autoNorm(dataSet):#輸入為資料集資料
minVals = dataSet.min(0)#獲得資料每列的最小值,minval是個列表
maxVals = dataSet.max(0)#獲得資料每列的最大值,maxval是個列表
ranges = maxVals - minVals#獲得取值範圍
normDataSet = zeros(shape(dataSet)) #初始化歸一化資料集
m = dataSet.shape[0]#得到行
normDataSet = dataSet - tile(minVals,(m,1))
normDataSet = normDataSet/tile(ranges,(m,1)) #特徵值相除
return normDataSet,ranges , minVals#返回歸一化矩陣,取值範圍, 最小值
test測試:
import kNN
from numpy import *
import operator
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
normMat , ranges , minval= kNN.autoNorm(datingDataMat)
print(normMat,'\n' ,ranges,'\n' , minval)
輸出結果:
[[ 0.44832535 0.39805139 0.56233353]
[ 0.15873259 0.34195467 0.98724416]
[ 0.28542943 0.06892523 0.47449629]
...,
[ 0.29115949 0.50910294 0.51079493]
[ 0.52711097 0.43665451 0.4290048 ]
[ 0.47940793 0.3768091 0.78571804]] #歸一化矩陣
[ 9.12730000e+04 2.09193490e+01 1.69436100e+00] #取值範圍:max-min
[ 0. 0. 0.001156] #最小值
2.2.4 測試演算法:作為完整的程式驗證分類器
def datingClassTest():
hoRatio = 0.10 #測試資料佔總樣本的10%
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #樣本集,樣本標籤
normMat , ranges , minVals = autoNorm(datingDataMat) #歸一化處理樣本集,然後得到取值範圍和最小值
m = normMat.shape[0]#樣本集行數
numTestVecs = int(m*hoRatio) #測試樣本集的數量
errorCount = 0.0#初始化錯誤率
for i in range(numTestVecs):#對樣本集進行錯誤收集
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m], 3)#kNN
print("The classifier came back with : %d , the real answer is : %d" % (int(classifierResult),int(datingLabels[i])))
if(classifierResult!=datingLabels[i]):
errorCount+=1.0
print("the total error rate if :%f" % (errorCount/float(numTestVecs)))#計算錯誤率並輸出
test測試程式碼
KNN.datingClassTest()
輸出結果:
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
最後一步操作:約會網站預測函式
最後一個主要是構建分類器,然後自己讀入資料給出結果。
def classfyPerson():
resultList = ['not at all' , 'in small doese ' , 'in large dose'] #分類器
precentTats = float(raw_input("precentage of time spent playint video games?")) #輸入資料
ffMiles = float(raw_input("frequent flier miles earned per year"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
datingDataMat , datingLabels = file2matrix('datingTestSet2.txt') #訓練集
normMat , ranges , minVals = autoNorm(datingDataMat) #進行訓練
inArr =array([ffMiles,precentTats,iceCream]) #把特徵加入矩陣
#4個輸入引數分別為:用於分類的輸入向量inX,輸入的訓練樣本集dataSet,標籤向量labels,選擇最近鄰居的數目k
classfierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3) #歸一化處理矩陣,並且結果就是序列號-1就是對應
print "You will probably like this person : " , resultList[classfierResult - 1 ]
輸出結果:
You will probably like this person : in small doses