1. 程式人生 > >機器學習筆記1-k近鄰演算法的實現

機器學習筆記1-k近鄰演算法的實現

k_近鄰演算法:採用測量不同特徵值之間的距離方法進行分類.
優點:精度高,對異常值不明感,無資料輸入假定
缺點:計算複雜度高,空間複雜度高
適用資料範圍:數值型和標稱型
步驟如下:
1.計算一直類別資料集中的點御當前點之間的距離
2.按照距離的遞增次序排序
3.選取當前的點距離最小的k個點
4.確定前k個點所在類別的出現頻率
5.返回前k個點出現頻率最高的類別作為當前點的預測分類
計算二維座標系中A,B點距離公式:[(xA0-xB0)^2+(xA1-xB1)^2]^(1/2)
設訓練樣本集為[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]],標籤向量為['A','A','B','B'],當輸入一個向量時,判斷屬於哪一類
import numpy
import operator




def createDataSet():
#訓練集
group = numpy.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
#標籤向量
labels = ['A','A','B','B']
return group,labels
def classify0(inX,dataSet,labels,k):
#讀取矩陣第一維度的長度
dataSetSize = dataSet.shape[0]
#輸入向量與訓練集差值的陣列
diffMat = numpy.tile(inX,(dataSetSize,1)) - dataSet
#計算各點與訓練集的距離
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distance = sqDistances**0.5
#將距離陣列的下標按照距離大小排序
sortedDistIndicies = distance.argsort()
classCount = {}
#在k的範圍內,分別計算兩類的數目
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
#以k以內類別數目排序
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),
 reverse = True)
#返回數目最多的類(即輸入向量應該屬於的類)
return sortedClassCount[0][0]
def test():
print(classify0([0.3,0.5],group,labels,2))
if __name__ == '__main__':
group = numpy.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
test()
其中幾個不熟悉的函式:
shape:讀取矩陣的長度,比如shape[0]:就是讀取矩陣的一維長度
tile:形如tile(x,y)就是重複x,y次,例如;
>>> numpy.tile([1,1],10)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> numpy.tile([2,1],[2,3])
array([[2, 1, 2, 1, 2, 1],
  [2, 1, 2, 1, 2, 1]])
sum(axis=1):在某一維度上相加:
>>> a = numpy.array([1,2])
>>> a.sum()
3
>>> a.sum(axis=1)
Traceback (most recent call last):
 File "<pyshell#27>", line 1, in <module>
a.sum(axis=1)
 File "F:\python3\lib\site-packages\numpy\core\_methods.py", line 32, in _sum
return umr_sum(a, axis, dtype, out, keepdims)
ValueError: 'axis' entry is out of bounds
>>> numpy.array([[1,2,4],[2,4,5]]).sum(axis=1)
array([ 7, 11])
argsort():將陣列的值的下標按值的由大到小的順序排序
>>> a = numpy.array([8,6,7,9,10,5,7])
>>> a.argsort()
array([5, 1, 2, 6, 0, 3, 4], dtype=int32)
items():字典的值以列表的形式返回
itemgetter():用於返回物件那些維的資料
sorted():函式sorted(iterable[, cmp[, key[, reverse]]]),用於給列表排序,返回一個新的列表
iterable -- 可迭代物件。
cmp -- 比較的函式,這個具有兩個引數,引數的值都是從可迭代物件中取出,此函式必須遵守的規則為,大於則返回1,小於則返回-1,等於則返回0。
key -- 主要是用來進行比較的元素,只有一個引數,具體的函式的引數就是取自於可迭代物件中,指定可迭代物件中的一個元素來進行排序。
reverse -- 排序規則,reverse = True 降序 , reverse = False 升序(預設)。
例項,根據玩視訊遊戲所耗時間百分比,每年獲得的飛行常客里程數,每週的冰淇淋公升數來判斷魅力值
1.解析資料,書中給出的資料存在一個問題,就是標籤向量為一個字串,需要將其轉化成整形形式
    def getValueOfClassLabel(ClassLabel):
        val = 1;
        if not ClassLabel in ValueOfClassLabel.keys():
            ValueOfClassLabel[ClassLabel] = Value.pop()
        return ValueOfClassLabel[ClassLabel]
解析檔案的完整程式碼:
def file2matrix(filename):
'''
用於解析訓練集檔案
'''
ValueOfClassLabel = {}
Value = [1,2,3]
def getValueOfClassLabel(ClassLabel):
val = 1;
if not ClassLabel in ValueOfClassLabel.keys():
ValueOfClassLabel[ClassLabel] = Value.pop()
return ValueOfClassLabel[ClassLabel]

file = open(filename)
arrayOLines = file.readlines()
#檔案的行數
numberOfLines = len(arrayOLines)
#返回建立的訓練集
returnMat = numpy.zeros((numberOfLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFormLine = line.split('\t')
returnMat[index,:] = listFormLine[0:3]
classLabelVector.append(getValueOfClassLabel(str(listFormLine[-1])))
index +=1
return returnMat,classLabelVector
通過上述程式可以將檔案內容格式化成我們需要的訓練集,標籤向量,通過畫圖來直觀的判斷他們之間的關係
import numpy
import kNN
import matplotlib
import matplotlib.pyplot as plt


fig = plt.figure()
ax = fig.add_subplot(111)
datingDataMat,datingLabels = kNN.file2matrix('f:\\datingTestSet.txt')
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],
15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
plt.xlabel('Percentage of Time Spent Playing Video Games')
plt.ylabel('Liters of Ice Cream Consumed Per Week')
plt.show()
3D圖:
import numpy
import kNN
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')
datingDataMat,datingLabels = kNN.file2matrix('f:\\datingTestSet.txt')
ax.scatter(datingDataMat[:,0],datingDataMat[:,1],datingDataMat[:,2],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
ax.set_xlabel('mei nian huo qu de fei xing chang ke li cheng shu')
ax.set_ylabel('wan you xi shi jian bi li')
ax.set_zlabel('mei zhou xiao hao de bing qi li shuliang')
plt.show()
多圖:
import numpy
import kNN
import matplotlib
import matplotlib.pyplot as plt


fig = plt.figure()
ax1 = fig.add_subplot(311)
datingDataMat,datingLabels = kNN.file2matrix('f:\\datingTestSet.txt')
ax1.scatter(datingDataMat[:,0],datingDataMat[:,1],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
ax1.set_xlabel('fly')
ax2 = fig.add_subplot(312)
ax2.scatter(datingDataMat[:,0],datingDataMat[:,2],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
ax2 = fig.add_subplot(313)
ax2.scatter(datingDataMat[:,1],datingDataMat[:,2],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
plt.show()
不熟悉的函式:
add_subplot:用於指定影象的位置,例如111,指影象分成一行一列,在第一幅圖上畫
scatter:畫散點圖,必須輸入的有x,y座標,可選項有顏色形狀等
zero:建立0矩陣
歸一化:
處理不同取值範圍的特徵值時,通常需要將數值未硬化,如果將取值範圍處理為0到1或者-1到1之間,下面公式可以將任意取值範圍的特徵值轉化為0到1的區間內
newValue = (oldValue-min)/(max-min)
min,max分別是資料集中特徵值最大值和最小值,程式如下
def autoNum(dataSet):
#獲取每一列的最小值
minVals = dataSet.min(0)
#獲取每一列的最大值
maxVals = dataSet.max(0)
#最大值和最小值的差
ranges = maxVals - minVals


#將每一行歸一化
normDataSet = numpy.zeros(numpy.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - numpy.tile(minVals,(m,1))
normDataSet = normDataSet/numpy.tile(ranges,(m,1))
return normDataSet,ranges,minVals
容易搞錯的是min(0)返回的是每一列的最小值,而不是第0列的最小值,min()返回的是所有值的最小值,min(1)返回的是每一行的最小值
測試程式:
def datingClassTest():
'''
用於測試分類器
'''
hoRatio = 0.10
datingDataMating,datingLabels = file2matrix('f:\\datingTestSet.txt')
normMat,ranges,minVals = autoNum(datingDataMating)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],
datingLabels[numTestVecs:m],3)
print('the classifier came back with: %d,the real answer is:%d'%
 (classifierResult,datingLabels[i]))
if(classifierResult != datingLabels[i]):errorCount += 1.0
print('the total error rate is:%f'%(errorCount/float(numTestVecs)))
測試結果:
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:1
the total error rate is:0.050000
完整的程式(python3可執行):
import numpy
import operator




def createDataSet():
'''
返回一個訓練集和標籤向量
'''
#訓練集
group = numpy.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
#標籤向量
labels = ['A','A','B','B']
return group,labels
def classify0(inX,dataSet,labels,k):
'''
用於實現k_近鄰演算法,接收輸入一個向量,一個訓練集,一個標籤向量,一個K值
判斷向量所屬的類別
'''
#讀取矩陣第一維度的長度
dataSetSize = dataSet.shape[0]
#輸入向量與訓練集差值的陣列
diffMat = numpy.tile(inX,(dataSetSize,1)) - dataSet
#計算各點與訓練集的距離
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distance = sqDistances**0.5
#將距離陣列的下標按照距離大小排序
sortedDistIndicies = distance.argsort()
classCount = {}
#在k的範圍內,分別計算兩類的數目
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
#以k以內類別數目排序
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),
 reverse = True)
#返回數目最多的類(即輸入向量應該屬於的類)
return sortedClassCount[0][0]


def file2matrix(filename):
'''
用於解析訓練集檔案
'''
ValueOfClassLabel = {}
Value = [1,2,3]
def getValueOfClassLabel(ClassLabel):
val = 1;
if not ClassLabel in ValueOfClassLabel.keys():
ValueOfClassLabel[ClassLabel] = Value.pop()
return ValueOfClassLabel[ClassLabel]

file = open(filename)
arrayOLines = file.readlines()
#檔案的行數
numberOfLines = len(arrayOLines)
#返回建立的訓練集
returnMat = numpy.zeros((numberOfLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFormLine = line.split('\t')
returnMat[index,:] = listFormLine[0:3]
classLabelVector.append(getValueOfClassLabel(str(listFormLine[-1])))
index +=1
return returnMat,classLabelVector
def autoNum(dataSet):
'''
用於將資料歸一化
'''
#獲取每一列的最小值
minVals = dataSet.min(0)
#獲取每一列的最大值
maxVals = dataSet.max(0)
#最大值和最小值的差
ranges = maxVals - minVals


#將每一行歸一化
normDataSet = numpy.zeros(numpy.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - numpy.tile(minVals,(m,1))
normDataSet = normDataSet/numpy.tile(ranges,(m,1))
return normDataSet,ranges,minVals
def datingClassTest():
'''
用於測試分類器
'''
hoRatio = 0.10
datingDataMating,datingLabels = file2matrix('f:\\datingTestSet.txt')
normMat,ranges,minVals = autoNum(datingDataMating)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],
datingLabels[numTestVecs:m],3)
print('the classifier came back with: %d,the real answer is:%d'%
 (classifierResult,datingLabels[i]))
if(classifierResult != datingLabels[i]):errorCount += 1.0
print('the total error rate is:%f'%(errorCount/float(numTestVecs)))
def test():
  datingClassTest()
if __name__ == '__main__':


test()