1. 程式人生 > >機器學習之K近鄰演算法 kNN(2)

機器學習之K近鄰演算法 kNN(2)

1.knn演算法的超引數問題

"""
    超引數 :執行機器學習演算法之前需要指定的引數
    模型引數:演算法過程中學習的引數

    kNN演算法沒有模型引數
    kNN演算法中的k是典型的超引數

    尋找最好的k
"""

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

digits = datasets.load_digits()

# 資料矩陣
X = digits.data # 特徵 Y = digits.target x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) best_score = 0.0 best_k = -1 best_method = "" # for k in range(1, 11): # kNeighborsClassifier = KNeighborsClassifier(n_neighbors=k) # kNeighborsClassifier.fit(x_train, y_train) # score = kNeighborsClassifier.score(x_test, y_test)
# if score > best_score: # best_k = k # best_score = score for method in ["uniform", "distance"]: for k in range(1, 11): kNeighborsClassifier = KNeighborsClassifier(n_neighbors=k, weights=method) kNeighborsClassifier.fit(x_train, y_train) score = kNeighborsClassifier.score(x_test, y_test) if
score > best_score: best_k = k best_score = score best_method = method print(best_k) print(best_score) print(best_method)

2.使用GridSearchCV

"""
    Grid Search
"""

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        'weights': ['uniform'],
        'n_neighbors': [i for i in range(1, 11)],
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)],
        'p': [i for i in range(1, 6)]
    }
]

digits = datasets.load_digits()

# 資料矩陣
X = digits.data
#  特徵
Y = digits.target

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

kNeighborsClassifier = KNeighborsClassifier()

grid_search = GridSearchCV(kNeighborsClassifier, param_grid, verbose=2)

grid_search.fit(x_train, y_train)

result = grid_search.best_estimator_
best_score = grid_search.best_score_
best_params = grid_search.best_params_m

print(result)
print(best_score)
print(best_params)

3.為什麼要資料歸一化

腫瘤大小 (釐米) 發現時間(天)
樣本1 1 200
樣本2 5 100

樣本間的距離被發現時間所主導

資料歸一化

解決方案:將所有的資料對映到同一尺度

最值歸一化 (normalization): 把所有的資料對映到0-1之間

x(scale) = (x - x(min))/(x(max) -x(min))

適用於分佈明顯邊界的情況;受outlier 影響 較大

均值方差歸一法:把所有資料歸一到均值為0方差為1的分佈中

適用於分佈沒有明顯的邊界;可能存在極端資料

x(scale) =(x-x(mean))/S

注: x(mean) 均值 S:方差

案例

import numpy as np

x = np.random.randint(0, 100, size=100)

# 最值歸一化
x_data = (x - np.min(x)) / (np.max(x) - np.min(x))

X = np.random.randint(0, 100, (50, 2))
X = np.array(X, dtype=float)

X[:, 0] = (X[:, 0] - np.min(X)) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 1] = (X[:, 1] - np.min(X)) / (np.max(X[:, 1]) - np.min(X[:, 1]))

# 均值方差歸一化 Standardization
X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)

X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])

print(np.mean(X2[:, 0]))
print(np.std(X2[:, 0]))

對測試資料集歸一化

將測試資料集 使用 訓練資料得到的mean_train 以及std_train相應的進行歸一化

(x_test-mean_train) / std_train

得到測試資料集歸一化的結果

測試資料是模擬真實環境
真實環境很有可能無法得到所有的測試資料的均值和方差
對資料的歸一化也是演算法的一部分

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

iris = datasets.load_iris()
X = iris.data
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

standardScaler = StandardScaler()
#  fit 之後 已經存放了均值方差歸一化相關資訊
standardScaler.fit(x_train)

x_train = standardScaler.transform(x_train)
x_test_standard = standardScaler.transform(x_test)

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(x_train, y_train)
score = knn_clf.score(x_test_standard, y_test)

print(standardScaler.mean_)
print(standardScaler.scale_)
print(score)  # 1.0

歸一化處理後 成功率為1.0