機器學習之K近鄰演算法 kNN(2)
阿新 • • 發佈:2019-02-16
1.knn演算法的超引數問題
"""
超引數 :執行機器學習演算法之前需要指定的引數
模型引數:演算法過程中學習的引數
kNN演算法沒有模型引數
kNN演算法中的k是典型的超引數
尋找最好的k
"""
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
digits = datasets.load_digits()
# 資料矩陣
X = digits.data
# 特徵
Y = digits.target
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
best_score = 0.0
best_k = -1
best_method = ""
# for k in range(1, 11):
# kNeighborsClassifier = KNeighborsClassifier(n_neighbors=k)
# kNeighborsClassifier.fit(x_train, y_train)
# score = kNeighborsClassifier.score(x_test, y_test)
# if score > best_score:
# best_k = k
# best_score = score
for method in ["uniform", "distance"]:
for k in range(1, 11):
kNeighborsClassifier = KNeighborsClassifier(n_neighbors=k, weights=method)
kNeighborsClassifier.fit(x_train, y_train)
score = kNeighborsClassifier.score(x_test, y_test)
if score > best_score:
best_k = k
best_score = score
best_method = method
print(best_k)
print(best_score)
print(best_method)
2.使用GridSearchCV
"""
Grid Search
"""
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
param_grid = [
{
'weights': ['uniform'],
'n_neighbors': [i for i in range(1, 11)],
},
{
'weights': ['distance'],
'n_neighbors': [i for i in range(1, 11)],
'p': [i for i in range(1, 6)]
}
]
digits = datasets.load_digits()
# 資料矩陣
X = digits.data
# 特徵
Y = digits.target
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
kNeighborsClassifier = KNeighborsClassifier()
grid_search = GridSearchCV(kNeighborsClassifier, param_grid, verbose=2)
grid_search.fit(x_train, y_train)
result = grid_search.best_estimator_
best_score = grid_search.best_score_
best_params = grid_search.best_params_m
print(result)
print(best_score)
print(best_params)
3.為什麼要資料歸一化
腫瘤大小 (釐米) | 發現時間(天) | |
---|---|---|
樣本1 | 1 | 200 |
樣本2 | 5 | 100 |
樣本間的距離被發現時間所主導
資料歸一化
解決方案:將所有的資料對映到同一尺度
最值歸一化 (normalization): 把所有的資料對映到0-1之間
x(scale) = (x - x(min))/(x(max) -x(min))
適用於分佈明顯邊界的情況;受outlier 影響 較大
均值方差歸一法:把所有資料歸一到均值為0方差為1的分佈中
適用於分佈沒有明顯的邊界;可能存在極端資料
x(scale) =(x-x(mean))/S
注: x(mean) 均值 S:方差
案例
import numpy as np
x = np.random.randint(0, 100, size=100)
# 最值歸一化
x_data = (x - np.min(x)) / (np.max(x) - np.min(x))
X = np.random.randint(0, 100, (50, 2))
X = np.array(X, dtype=float)
X[:, 0] = (X[:, 0] - np.min(X)) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 1] = (X[:, 1] - np.min(X)) / (np.max(X[:, 1]) - np.min(X[:, 1]))
# 均值方差歸一化 Standardization
X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)
X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])
print(np.mean(X2[:, 0]))
print(np.std(X2[:, 0]))
對測試資料集歸一化
將測試資料集 使用 訓練資料得到的mean_train 以及std_train相應的進行歸一化
(x_test-mean_train) / std_train
得到測試資料集歸一化的結果
測試資料是模擬真實環境
真實環境很有可能無法得到所有的測試資料的均值和方差
對資料的歸一化也是演算法的一部分
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
X = iris.data
y = iris.target
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
standardScaler = StandardScaler()
# fit 之後 已經存放了均值方差歸一化相關資訊
standardScaler.fit(x_train)
x_train = standardScaler.transform(x_train)
x_test_standard = standardScaler.transform(x_test)
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(x_train, y_train)
score = knn_clf.score(x_test_standard, y_test)
print(standardScaler.mean_)
print(standardScaler.scale_)
print(score) # 1.0
歸一化處理後 成功率為1.0