《Machine Learning In Action》學習筆記(1)-KNN(k-近鄰演算法)

阿新 • • 發佈：2019-01-09

knn演算法我在之前的部落格從零開始-Machine Learning學習筆記(20)-kNN(k-Nearset Neignbor)學習筆記中也已經提到了，大家如果感興趣可以回過頭去看看，knn原理非常簡單。不需要訓練，當有待分類樣本時，只需要從資料集中選取k個與這個樣本距離最近的樣本，將k個樣本中最多的label作為該待分類樣本的label。
我將書中所給的程式碼使用Python3編譯，添加了註釋以便於快速理解。完整程式碼在我的github上，有興趣的朋友可以自行下載。

'''
Created on Sep 16, 2010
kNN: k Nearest Neighbors

Input:      inX: vector to compare to existing dataset (1xN)
            dataSet: size m data set of known vectors (NxM)
            labels: data set labels (1xM vector)
            k: number of neighbors to use for comparison (should be an odd number)

Output:     the most popular class label

@author: pbharrin

---------------------------
@modified: Kabuto_hui
@date: 2018/12/19
---------------------------
''' 

from numpy import *
import operator
from os import listdir


def classify0(inX, dataSet, labels, k):
    '''
    knn分類器
    :param inX:         待分類向量
    :param dataSet:     資料集
    :param labels:      資料集對應的標籤
    :param k:           鄰居個數
    :return:            返回分類的類別
    '''
    # 獲取訓練集的樣本數量
    dataSetSize = 
 dataSet.shape[0]
    # 先在列方向上重複待分類向量dataSetSize次，再減去訓練集；其實就是待分類向量與訓練集中的每個向量相減
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet
    # 差值的平方
    sqDiffMat = diffMat ** 2
    # 差值的平方和
    sqDistances = sqDiffMat.sum(axis=1)
    # 差值平方和再開方
    distances = sqDistances ** 0.5
    # 將距離從小到大排序並返回index
    sortedDistIndicies = 
 distances.argsort()
    classCount = {}
    # 遍歷與待測樣本距離最近的k個訓練集樣本，選擇數量最多的label作為返回值
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 統計最近的k個訓練集樣本中各個label的數量
    # sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) # 對dict按value值由大到小排序
    return sortedClassCount[0][0]   # 返回value值最大key作為返回，即為label


def file2matrix(filename):
    '''
    讀取檔案中的資料，並返回資料集及其對應的label
    :param filename:    檔名稱
    :return:            返回樣本集合及其label
    '''
    fr = open(filename)
    # 獲取行數【樣本個數】
    numberOfLines = len(fr.readlines())  # get the number of lines in the file
    returnMat = zeros((numberOfLines, 3))  # prepare matrix to return
    classLabelVector = []  # prepare labels return
    index = 0
    # 按行讀取
    fr = open(filename)
    for line in fr.readlines():
        line = line.strip()    # 去掉空格
        listFromLine = line.split('\t')         # 以\t作為分割
        returnMat[index, :] = listFromLine[0:3] # 取前三列作為特徵
        classLabelVector.append(int(listFromLine[-1]))  # 取最後一列作為label
        index += 1
    return returnMat, classLabelVector


def autoNorm(dataSet):
    '''
    歸一化特徵值x' = (x - x_min) / (x_max - x_min)
    :param dataSet: 資料集
    :return:        歸一化後的資料集， 最大減最小值， 最小值
    '''
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    # 獲取樣本的個數
    m = dataSet.shape[0]
    # 對於資料集中的每個資料進行處理： x' = (x - x_min) / (x_max - x_min)
    normDataSet = dataSet - tile(minVals, (m, 1))
    normDataSet = normDataSet / tile(ranges, (m, 1))  # element wise divide
    return normDataSet, ranges, minVals


def datingClassTest():
    '''
    針對於約會網站的測試程式碼
    '''
    # 測試集的比例
    hoRatio = 0.10  # hold out 10%
    # 獲取資料
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')  # load data setfrom file
    # 歸一化
    normMat, ranges, minVals = autoNorm(datingDataMat)
    # 獲取資料集中的樣本數量
    m = normMat.shape[0]
    # 獲取測試集數量
    numTestVecs = int(m * hoRatio)
    # 初始化誤差
    errorCount = 0.0
    for i in range(numTestVecs):
        # 對測試集中的每一個樣本進行分類，k=3
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
        # 如果分類錯誤，錯誤的個數+1
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    print("the total error rate is: %f" % (errorCount / float(numTestVecs)))
    print(errorCount)


def img2vector(filename):
    '''
    圖片轉向量
    Note： 書中所給的圖片都轉化成只有01的32*32矩陣，這個函式將矩陣轉化為向量
    :param filename:    檔名
    :return:            返回向量
    '''
    returnVect = zeros((1, 1024))
    fr = open(filename)
    for i in range(32):
        # 讀取一行
        lineStr = fr.readline()
        for j in range(32):
            # 對一行中的每個資料轉化為int型再存入向量中
            returnVect[0, 32 * i + j] = int(lineStr[j])
    return returnVect


def handwritingClassTest():
    '''
    針對於手寫識別實驗的測試程式碼
    '''
    hwLabels = []
    # 讀取資料夾中的檔案列表
    trainingFileList = listdir('trainingDigits')  # load the training set
    # 獲取檔案的個數
    m = len(trainingFileList)
    # 初始化訓練集
    trainingMat = zeros((m, 1024))
    # 對每個檔案進行處理
    for i in range(m):
        # 獲取檔名稱
        fileNameStr = trainingFileList[i]
        # 將檔名切割以獲取label： 如0_0.txt
        fileStr = fileNameStr.split('.')[0]  # take off .txt
        classNumStr = int(fileStr.split('_')[0])
        # 儲存該圖片的label
        hwLabels.append(classNumStr)
        # 圖片矩陣轉化為向量存入訓練集中
        trainingMat[i, :] = img2vector('trainingDigits/%s' % fileNameStr)

    # 獲取測試集的檔案列表
    testFileList = listdir('testDigits')  # iterate through the test set
    # 初始化錯分的個數
    errorCount = 0.0
    # 獲取測試集的數量
    mTest = len(testFileList)
    for i in range(mTest):
        # 獲取該測試樣本的label
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]  # take off .txt
        classNumStr = int(fileStr.split('_')[0])
        # 將該測試樣本轉化為向量
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        # 呼叫knn進行分類
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr))
        # 統計錯分的個數
        if (classifierResult != classNumStr): errorCount += 1.0
    print("\nthe total number of errors is: %d" % errorCount)
    print("\nthe total error rate is: %f" % (errorCount / float(mTest)))

if __name__ == '__main__':
    print('-'*15, '約會網站實驗的測試程式碼', '-'*15)
    datingClassTest()
    print('-' * 15, '手寫識別實驗的測試程式碼', '-' * 15)
    handwritingClassTest()

相關檔案及完整程式碼：kabutohui/Machine_Learning_In_Action

總結與思考

在這一章中，使用了大量的矩陣加減操作。尤其使用了未曾用過的tile()函式，這個函式可以將一個向量重複n次變成一個矩陣，因此在求待分類向量的與所有訓練集樣本的距離的時候，使用此函式可以代替迴圈求解。

另外值得可取的部分是檔案讀取的時候，大量的使用了str的分割，從而獲取相關的label。

《Machine Learning In Action》學習筆記(1)-KNN(k-近鄰演算法)

總結與思考

《Machine Learning in Action》| 第1章 k-近鄰演算法

《Machine Learning In Action》學習筆記(1)-KNN(k-近鄰演算法)

機器學習筆記九：K近鄰演算法（KNN）

【機器學習筆記】基於k-近鄰演算法的數字識別

我與機器學習 - [Today is Knn] - [K-近鄰演算法]

學習筆記：使用k-近鄰演算法改進約會網站的配對效果

《機器學習實戰》學習筆記一：K近鄰演算法

《機器學習實戰》—— KNN(K近鄰演算法)

CSDN機器學習筆記十二 k-近鄰演算法實現手寫識別系統

機器學習實戰（Machine Learning in Action）學習筆記————02.k-鄰近演算法（KNN）

機器學習實戰（Machine Learning in Action）學習筆記————05.Logistic迴歸

機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

機器學習實戰（Machine Learning in Action）學習筆記————03.決策樹原理、原始碼解析及測試

機器學習實戰（Machine Learning in Action）學習筆記————08.使用FPgrowth演算法來高效發現頻繁項集

機器學習實戰（Machine Learning in Action）學習筆記————07.使用Apriori演算法進行關聯分析

機器學習實戰（Machine Learning in Action）學習筆記————06.k-均值聚類演算法（kMeans）學習筆記

機器學習實戰（Machine Learning in Action）學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、資料降維

機器學習實戰（Machine Learning in Action）學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、數據降維

Machine Learning in Action 之 kNN

《機器學習實戰》（Machine Learning in Action) 一書中的錯誤之處（內容、程式碼）

《Machine Learning In Action》學習筆記(1)-KNN(k-近鄰演算法)

總結與思考

相關推薦