1. 程式人生 > >python資料探勘入門與實踐--------電離層(Ionosphere), scikit-learn估計器,K近鄰分類器,交叉檢驗,設定引數

python資料探勘入門與實踐--------電離層(Ionosphere), scikit-learn估計器,K近鄰分類器,交叉檢驗,設定引數

ionosphere.data下載地址:http://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/

原始碼及相關資料下載  https://github.com/xxg1413/MachineLearning/tree/master/Learning%20Data%20Mining%20with%20Python/Chapter2

import numpy as np
import csv
data_filename="D:\\python27\\study\\code\\Chapter2\\ionosphere.data"
#初始化接受資料的陣列

X = np.zeros( (351, 34),dtype='float')
y = np.zeros((351,),dtype='bool')

#讀取檔案資訊
with open(data_filename,'r') as data:
    reader = csv.reader(data)
    for i, row in enumerate(reader):   #  通過列舉函式獲得每行的索引號
        X[i] = [ float(datum) for datum in row[:-1] ]   #  獲取每一個個體的前34個值
        y[i] = row[-1] == 'g' #把g轉換為0,1
from sklearn.cross_validation import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=14)

from sklearn.neighbors import KNeighborsClassifier   # 匯入K近鄰分類器,並初始化一個例項
estimator = KNeighborsClassifier()

estimator.fit(X_train,y_train)
y_preditcted = estimator.predict(X_test)

accuracy = np.mean(y_test == y_preditcted) * 100
print("準確率:",accuracy)


from sklearn.cross_validation import  cross_val_score   #  交叉檢驗

scores = cross_val_score(estimator,X,y,scoring='accuracy')
avg_accuracy = np.mean(scores) * 100
print("平均準確率:",avg_accuracy)

# 設定引數,增強演算法的泛化能力,調整近鄰數量
avg_scores = []
all_scores = []

num_size = list(range(1,21))  #  包括20

for n_neighbors in num_size:
    esimator = KNeighborsClassifier(n_neighbors=n_neighbors)
    scores = cross_val_score(esimator,X,y,scoring='accuracy')
    avg_scores.append(np.mean(scores))
    all_scores.append(scores)

%matplotlib inline 

import matplotlib.pyplot as plt

plt.plot(num_size,avg_scores,'-o',linewidth=5, markersize=12)

#隨著近鄰的增加  準確率不斷的下降