1. 程式人生 > >《機器學習(周志華)》習題10.1 答案

《機器學習(周志華)》習題10.1 答案

程式設計實現K鄰近分類器,在西瓜資料集[email protected](屬性只有密度與含糖率)上,比較其分類邊界與決策樹分類邊界之異同。

KNN用歐式距離計算,投票時根據距離加權,在訓練集上最小化錯誤率,選擇k=1時最優。

KNN決策面圖如下:


決策樹分類面如下(手繪)


將兩幅圖對比,可以看出分類邊界基本一致,但決策樹更絕對,不像knn部分不同類別的點還混在一起。

KNN和繪製決策面程式碼如下:

#coding: utf-8
import numpy as np 
from numpy import linalg as LA 
from numpy import random
import matplotlib.pyplot as plt

file = open('西瓜資料集3.csv'.decode('utf-8'))
data = [raw.strip('\n').split(',')[-3:] for raw in file][1:]
for i in range(len(data)):
	for j in [0, 1]:
		data[i][j] = float(data[i][j])

def predict(test, train, k):
	dist = map(lambda x: LA.norm(x), [np.array(raw[:-1])-np.array(test) for raw in train])
	dist = [(dist[i], i) for i in range(len(dist))]
	dist = sorted(dist, key=lambda x: x[0])[1:]
	count = {}
	for i in range(min(k, len(dist))):
		idx = dist[i][1]
		label = train[idx][-1]
		count[label] = count.get(label, 0) + dist[i][0]
	return max(count, key=lambda x: count[x])

correct = 0
k = 1
for i in range(len(data)):
	res = predict(data[i][:-1], data, k)
	correct += 1 if res==data[i][-1] else 0
accuracy = float(correct) / len(data)
print accuracy

def plot_decision_boundary(train_data, k):
	test_data = random.rand(1000, 2)
	x = [i[0] for i in test_data]
	y = [i[1] for i in test_data]
	color = ['g' if predict(test, train_data, k)=='是' else 'b' for test in test_data]
	plt.scatter(y, x, c=color)
	plt.show()	

plot_decision_boundary(data, 1)