Machine Learning in Action 之 kNN

阿新 • • 發佈：2018-12-25

k臨近演算法(kNN)採用測量不同特徵值之間的距離方法進行分類，也是一種非常直觀的方法。本文主要記錄了使用kNN演算法改進約會網站的例子。任務一：分類演算法classify0 就是使用距離公式計算特徵值之間的距離，選擇最鄰近的k個點，通過統計這k個點的結果來得出樣本的預測值。 def classify0 (inX,dataset,labels,k) : #shape 返回行列數，shape[0]是行數，有多少元組 datasetsize = dataset.shape[

0 ] #tile 複製inX，使其與dataset一樣大小 diffmat = tile(inX,(datasetsize, 1 )) - dataset #**表示乘方 sqdiffmat = diffmat ** 2 #按行將計算結果求和 sqdistances = sqdiffmat.sum(axis= 1 ) distances = sqdistances ** 0.5 #使用argsort排序，返回索引值

sortedDistIndicies = distances.argsort() #用於計數，計算結果 classcount = {} for i in range(k) : voteIlabel = labels[sortedDistIndicies[i]] classcount[voteIlabel] = classcount.get(voteIlabel, 0 )+ 1 #按照第二個元素降序排列 sortedClasscount = sorted(classcount.iteritems(),key=operator.itemgetter(

1 ),reverse= True ) #返回出現次數最多的那一個label的值 return sortedClasscount[ 0 ][ 0 ] 1、get()方法語法： dict . get ( key , default = None ) key -- 字典中要查詢的鍵。 default -- 如果指定鍵的值不存在時，返回該預設值值。 2、 sorted 語法： sorted ( iterable [, cmp [, key [, reverse ]]]) 引數說明： iterable -- 可迭代物件。 cmp -- 比較的函式，這個具有兩個引數，引數的值都是從可迭代物件中取出，此函式必須遵守的規則為，大於則返回1，小於則返回-1，等於則返回0。 key -- 主要是用來進行比較的元素，只有一個引數，具體的函式的引數就是取自於可迭代物件中，指定可迭代物件中的一個元素來進行排序。 reverse -- 排序規則，reverse = True 降序， reverse = False 升序（預設）。 3、 python字典的iteritems方法作用 與items方法相比作用大致相同，只是它的返回值不是列表，而是一個迭代器。 dict iteritems()操作方法 ： >>> f = x.iteritems() >>> f <dictionary-itemiterator object at 0xb74d5e3c> >>> type(f) <type 'dictionary-itemiterator'> #字典項的迭代器 >>> list(f) [('url', ' www.iplaypy.com' ;), ('title', 'python web site')] 字典.iteritems()方法在需要迭代結果的時候使用最適合，而且它的工作效率非常的高。 4、 tile函式用法

>>> numpy.tile([ 0 , 0 ], 5 ) #在列方向上重複[0,0]5次，預設行1次 array([ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ]) >>> numpy.tile([ 0 , 0 ],( 1 , 1 )) #在列方向上重複[0,0]1次，行1次 array([[ 0 , 0 ]]) >>> numpy.tile([ 0 , 0 ],( 2 , 1 )) #在列方向上重複[0,0]1次，行2次 array([[ 0 , 0 ], [ 0 , 0 ]]) >>> numpy.tile([ 0 , 0 ],( 3 , 1 )) array([[ 0 , 0 ], [ 0 , 0 ], [ 0 , 0 ]]) 5、 argsort函式任務二：讀入資料注意這裡書上寫錯了，應該讀入的是datingTestSet2.txt而不是datingTestSet.txt def file2matrix (filename) : fr = open(filename) #開啟檔案，按行讀入 arrayOLines = fr.readlines() #獲得檔案行數 numberOfLines = len(arrayOLines) #建立m行n列的零矩陣 returnMat = zeros((numberOfLines, 3 )) classLabelVector = [] index = 0 for line in arrayOLines: line = line.strip() #刪除行前面的空格 listFromLine = line.split( '\t' ) #根據分隔符劃分 returnMat[index,:] = listFromLine[ 0 : 3 ] #取得每一行的內容存起來 classLabelVector.append(int(listFromLine[- 1 ])) index += 1 return returnMat,classLabelVector 任務三：使用Matplotlib畫圖安裝Matplotlib時還需要numpy, dateutil, pytz, pyparsing, six, setuptools這幾個包。可以在這裡下載到，挺全的。加入到python27\Lib\site-packages目錄下。在powershell中cd到datingTestSet2.txt所在資料夾輸入python命令並且輸入以下命令貼上： import numpy import matplotlib import matplotlib.pyplot as plt import k NN reload (k NN ) datingDataMat , datingLabels = kNN . file2matrix ('dating TestSet2 .txt') fig = plt . figure () ax = fig . add_subplot ( 111 ) ax . scatter (dating DataMat [:, 1 ],dating DataMat [:, 2 ], 15.0 *numpy. array (dating Labels ), 15.0 *numpy. array (dating Labels )) plt . show () 下面使用後兩個特徵的圖片將scatter函式修改為： ax .scatter (datingDataMat[:, 0 ],datingDataMat[:, 1 ], 15.0 *numpy .array (datingLabels), 15.0 *numpy .array (datingLabels)) 1 使用前兩個特徵的圖片 任務四：歸一化 免除較大數值的資料給分類帶來的影響，將每一項資料歸一化為0~1之間的數字。 def autoNorm (dataSet) : #找出樣本集中的最小值 minVals = dataSet.min( 0 ) #找出樣本集中的最大值 maxVals = dataSet.max( 0 ) #最大最小值之間的差值 ranges = maxVals - minVals #建立與樣本集一樣大小的零矩陣 normDataSet = zeros(shape(dataSet)) m = dataSet.shape[ 0 ] #樣本集中的元素與最小值的差值 normDataSet = dataSet - tile(minVals, (m, 1 )) #資料相除，歸一化 normDataSet = normDataSet/tile(ranges, (m, 1 )) return normDataSet, ranges, minVals 任務五：分類並檢驗書中所給的資料 def datingClassTest () : #選取多少資料測試分類器 hoRatio = 0.10 #從datingTestSet2.txt中獲取資料 datingDataMat,datingLabels = file2matrix( 'datingTestSet2.txt' ) #歸一化資料 normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[ 0 ] #設定測試個數 numTestVecs = int(m*hoRatio) #記錄錯誤數量 errorCount = 0.0 for i in range(numTestVecs): #分類演算法 classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m], 3 ) print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]) if (classifierResult != datingLabels[i]): errorCount += 1.0 #計算錯誤率 print "the total error rate is: %f" % (errorCount/float(numTestVecs)) print errorCount 出錯率為5% 任務六：將影象轉換為測試向量任務七：測試識別手寫數字

def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits')           #load the training set
    m = len(trainingFileList)
    trainingMat = zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)
    testFileList = listdir('testDigits')        #iterate through the test set
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
        if (classifierResult != classNumStr): errorCount += 1.0
    print "\nthe total number of errors is: %d" % errorCount
    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

Machine Learning in Action 之 kNN

Machine Learning in Action 之 kNN

<Machine Learning in Action >之二樸素貝葉斯 C#實現文章分類

機器學習實戰（Machine Learning in Action）學習筆記————02.k-鄰近演算法（KNN）

《Machine Learning In Action》學習筆記(1)-KNN(k-近鄰演算法)

《機器學習實戰》（Machine Learning in Action) 一書中的錯誤之處（內容、程式碼）

Machine Learning in Action-chapter2-k近鄰算法

機器學習實戰（Machine Learning in Action）學習筆記————05.Logistic迴歸

機器學習實戰（Machine Learning in Action）學習筆記————04.樸素貝葉斯分類（bayes）

機器學習實戰（Machine Learning in Action）學習筆記————03.決策樹原理、原始碼解析及測試

機器學習實戰（Machine Learning in Action）學習筆記————08.使用FPgrowth演算法來高效發現頻繁項集

機器學習實戰（Machine Learning in Action）學習筆記————07.使用Apriori演算法進行關聯分析

機器學習實戰（Machine Learning in Action）學習筆記————06.k-均值聚類演算法（kMeans）學習筆記

《Machine Learning in Action》| 第1章 k-近鄰演算法

《Machine Learning in Action》| 第2章決策樹

機器學習實戰（Machine Learning in Action）學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、資料降維

機器學習實戰（Machine Learning in Action）學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、數據降維

《Machine Learning in Action》—— 剖析支援向量機，單手狂撕線性SVM

《Machine Learning in Action》—— 剖析支援向量機，優化SMO

《Machine Learning in Action》—— Taoye給你講講決策樹到底是支什麼“鬼”

《Machine Learning in Action》—— hao朋友，快來玩啊，決策樹呦

Machine Learning in Action 之 kNN

相關推薦