Python3《機器學習實戰》筆記：K-近鄰演算法

阿新 • • 發佈：2018-12-17

2.1 實施KNN演算法

python3實現KNN演算法，本書採用的是python2，轉化為python3

import numpy as np
#運算子模組
import operator
def createDataSet():
    group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

#K-近鄰演算法
def classify0(inX, dataSet, labels, k):

    #獲取shape的第一個值
    dataSetSize = dataSet.shape[0]

    #tile函式把inX重複dataSetSize遍，1列，用尤拉定理進行計算
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5

    #argsort函式返回的是陣列值從小到大的索引值
    sortedDistIndices = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndices[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
        sortedClassCount = sorted(classCount.items(),
                                  key=operator.itemgetter(1), reverse=True)
    # 返回最近鄰的點
    return sortedClassCount[0][0]

測試結果如下：輸入：

import kNN
group , labels = kNN.createDataSet()
print(kNN.classify0([0,0],group,labels,3)

輸出：

2.2 使用K——近鄰演算法對約會網站的匹配效果進行改進

下載《機器學習實戰》的輔助材料，下載地址為https://github.com/frankstar007/kNN 資料集放在 2.2data 中，可以下載使用（注意檔名字是：datingTestSet2，本書中沒有2）

2.2.1 在KNN.py中加入下列程式碼：

def file2matrix(filename):
    #開啟檔案
    fr=open(filename)
    #readlines() 方法用於讀取所有行(直到結束符 EOF)並返回列表，
    #該列表可以由 Python 的 for... in ... 結構進行處理。
    #返回型別為一個列表
    arrayOLines=fr.readlines()
    #列表的長度
    numberOfLines=len(arrayOLines)
    #設定numberOfLines行3列的0矩陣
    returnMat=np.zeros((numberOfLines,3))
    #設定空列表
    classLabelVector=[]
    index=0
    for line in arrayOLines:
        #利用函式strip擷取掉所有的回車符
        line=line.strip()
        #使用tab字元\t將整行的資料分割為1個元素
        listFromLine=line.split('\t')
        #選取前三個矩陣，將它們儲存在特徵矩陣中
        returnMat[index,:]=listFromLine[0:3]
        #將列表中的最後一行元素儲存在classLabelVector中
        classLabelVector.append(listFromLine[-1])
        index+=1
    return returnMat,classLabelVector

在test.py中進行測試：

import kNN
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
print(datingDataMat)
print(datingLabels[0:20])

[[  4.09200000e+04   8.32697600e+00   9.53952000e-01]
 [  1.44880000e+04   7.15346900e+00   1.67390400e+00]
 [  2.60520000e+04   1.44187100e+00   8.05124000e-01]
 ..., 
 [  2.65750000e+04   1.06501020e+01   8.66627000e-01]
 [  4.81110000e+04   9.13452800e+00   7.28045000e-01]
 [  4.37570000e+04   7.88260100e+00   1.33244600e+00]]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

2.2.2 分析資料：使用Matplotlib建立散點圖

import matplotlib
import os
import matplotlib.pyplot as plt
from numpy import *
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show()

在這裡插入圖片描述

本題目一共有三組特徵值：玩遊戲視訊所耗時間的百分比；每週消費的冰淇淋公升數；每年獲取的飛行常客里程數；分別繪製出彩色影象

import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(221) 
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)

ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
ax2.scatter(datingDataMat[:,1], datingDataMat[:,2],
            15.0*array(list(map(int,datingLabels))),
            15.0*array(list(map(int,datingLabels)))) 
#資料乘以特徵值，更好的區別特徵資料
ax3.scatter(datingDataMat[:,0], datingDataMat[:,2],
            15.0*array(list(map(int,datingLabels))),
            15.0*array(list(map(int,datingLabels))))
ax4.scatter(datingDataMat[:,0], datingDataMat[:,1],
            15.0*array(list(map(int,datingLabels))),
            15.0*array(list(map(int,datingLabels))))
plt.show()

在這裡插入圖片描述

2.2.3 準備資料：歸一化數值

根據表格中的數值和尤拉兩點之間距離，數值差值最大的屬性對計算結果的影響最大，當資料的樣本特徵權重不一樣，就會導致某一個特徵權重的差距太大影響到整體的距離，因此要使用歸一化來將這種不同取值範圍的特徵值歸一化，將取值範圍處理為0到1，或者-1到1之間;使用如下公式可以講任意取值範圍的特徵值轉化為0到1區間內的值：

                        newValue = (oldValue-min) / (max-min)

這裡的new和old都針對的是某一列裡的一個，而在這裡使用應該是列表整體的使用了公式，故得到的是一個列表型別的newValue；

def autoNorm(dataSet):#輸入為資料集資料
    minVals = dataSet.min(0)#獲得資料每列的最小值,minval是個列表
    maxVals = dataSet.max(0)#獲得資料每列的最大值,maxval是個列表
    ranges = maxVals - minVals#獲得取值範圍
    normDataSet = zeros(shape(dataSet)) #初始化歸一化資料集
    m = dataSet.shape[0]#得到行
    normDataSet = dataSet - tile(minVals,(m,1))
    normDataSet = normDataSet/tile(ranges,(m,1)) #特徵值相除
    return normDataSet,ranges , minVals#返回歸一化矩陣，取值範圍， 最小值

test測試：

import kNN
from numpy import *
import operator
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
normMat , ranges , minval= kNN.autoNorm(datingDataMat)
print(normMat,'\n' ,ranges,'\n' , minval)

輸出結果：

[[ 0.44832535  0.39805139  0.56233353]
 [ 0.15873259  0.34195467  0.98724416]
 [ 0.28542943  0.06892523  0.47449629]
 ..., 
 [ 0.29115949  0.50910294  0.51079493]
 [ 0.52711097  0.43665451  0.4290048 ]
 [ 0.47940793  0.3768091   0.78571804]] #歸一化矩陣
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00] #取值範圍：max-min
 [ 0.        0.        0.001156] #最小值

2.2.4 測試演算法：作為完整的程式驗證分類器

def datingClassTest():
    hoRatio = 0.10 #測試資料佔總樣本的10%
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #樣本集，樣本標籤
    normMat , ranges , minVals = autoNorm(datingDataMat) #歸一化處理樣本集，然後得到取值範圍和最小值
    m = normMat.shape[0]#樣本集行數
    numTestVecs = int(m*hoRatio) #測試樣本集的數量
    errorCount = 0.0#初始化錯誤率
    for i in range(numTestVecs):#對樣本集進行錯誤收集
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m], 3)#kNN
        print("The classifier came back with : %d , the real answer is : %d" % (int(classifierResult),int(datingLabels[i])))
        if(classifierResult!=datingLabels[i]):
            errorCount+=1.0
    print("the total error rate if :%f" % (errorCount/float(numTestVecs)))#計算錯誤率並輸出

test測試程式碼

KNN.datingClassTest()

輸出結果：

The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2

最後一步操作：約會網站預測函式

最後一個主要是構建分類器，然後自己讀入資料給出結果。

def classfyPerson():
    resultList = ['not at all' , 'in small doese ' , 'in large dose'] #分類器
    precentTats = float(raw_input("precentage of time spent playint video games?")) #輸入資料
    ffMiles = float(raw_input("frequent flier miles earned per year"))
    iceCream = float(raw_input("liters of ice cream consumed per year?"))
    datingDataMat , datingLabels = file2matrix('datingTestSet2.txt') #訓練集
    normMat , ranges , minVals =  autoNorm(datingDataMat) #進行訓練
    inArr =array([ffMiles,precentTats,iceCream]) #把特徵加入矩陣
    #4個輸入引數分別為：用於分類的輸入向量inX，輸入的訓練樣本集dataSet，標籤向量labels，選擇最近鄰居的數目k
    classfierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3) #歸一化處理矩陣，並且結果就是序列號-1就是對應
    print "You will probably like this person : " , resultList[classfierResult - 1 ]

輸出結果：

You will probably like this person :  in small doses

Python3《機器學習實戰》筆記：K-近鄰演算法

2.1 實施KNN演算法

2.2 使用K——近鄰演算法對約會網站的匹配效果進行改進

2.2.1 在KNN.py中加入下列程式碼：

2.2.2 分析資料：使用Matplotlib建立散點圖

2.2.3 準備資料：歸一化數值

2.2.4 測試演算法：作為完整的程式驗證分類器

最後一步操作：約會網站預測函式

機器學習實戰筆記2(k-近鄰演算法)

機器學習實戰筆記（K近鄰）

C++單刷《機器學習實戰》之一——k-近鄰演算法

《機器學習實戰》—— KNN(K近鄰演算法)

Python3《機器學習實戰》筆記：K-近鄰演算法

《機器學習實戰》學習筆記：k-近鄰演算法的兩個應用場景

機器學習實戰筆記-利用K均值聚類算法對未標註數據分組

機器學習（6）K近鄰演算法

我與機器學習 - [Today is Knn] - [K-近鄰演算法]

機器學習實戰之使用k-鄰近演算法改進約會網站的配對效果

機器學習實施kNN之k-近鄰演算法--演算法步驟

Python3《機器學習實戰》學習筆記（一）：k-近鄰演算法

python3.5《機器學習實戰》學習筆記（一）：k近鄰演算法

python3.5《機器學習實戰》學習筆記（三）：k近鄰演算法scikit-learn實戰手寫體識別

Python3《機器學習實戰》學習筆記（一）：k-近鄰演算法(史詩級乾貨長文)

機器學習實戰筆記一：K-近鄰演算法在約會網站上的應用

《機器學習實戰》筆記（一）：K-近鄰演算法

機器學習實戰筆記2：使用K-近鄰演算法改進約會網站的配對效果

《機器學習實戰》學習筆記（一）：k-近鄰演算法

機器學習實戰：K近鄰演算法--學習筆記

Python3《機器學習實戰》筆記：K-近鄰演算法

2.1 實施KNN演算法

2.2 使用K——近鄰演算法對約會網站的匹配效果進行改進

2.2.1 在KNN.py中加入下列程式碼：

2.2.2 分析資料：使用Matplotlib建立散點圖

2.2.3 準備資料： 歸一化數值

2.2.4 測試演算法：作為完整的程式驗證分類器

最後一步操作：約會網站預測函式

相關推薦

2.2.3 準備資料：歸一化數值