[python機器學習及實踐(6)]Sklearn實現主成分分析（PCA）

阿新 • • 發佈：2018-07-19

相關性 hit 變量 gray tran total 空間 mach show

1.PCA原理

主成分分析（Principal Component Analysis，PCA），是一種統計方法。通過正交變換將一組可能存在相關性的變量轉換為一組線性不相關的變量，轉換後的這組變量叫主成分。

PCA算法：

技術分享圖片

2.PCA的實現

數據集：

64維的手寫數字圖像

代碼：

#coding=utf-8
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
from sklearn.svm import LinearSVC
 
from sklearn.metrics import classification_report

#1.初始化一個線性矩陣並求秩
M = np.array([[1,2],[2,4]])   #初始化一個2*2的線性相關矩陣
np.linalg.matrix_rank(M,tol=None)  # 計算矩陣的秩

#2.讀取訓練數據與測試數據集。
digits_train = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra‘, header=None)
digits_test  
= pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes‘, header=None)
print digits_train.shape   #(3823, 65)    3000+個樣本，每個數據由64個特征，1個標簽構成
print digits_test.shape    #(1797, 65)

#3將數據降維到2維並可視化

# 3.1 分割訓練數據的特征向量和標記
X_digits = digits_train[np.arange(64)]         #得到64位特征值 

y_digits = digits_train[64]                    #得到對應的標簽

#3.2 PCA降維：降到2維
estimator = PCA(n_components=2)
X_pca=estimator.fit_transform(X_digits)

#3.3 顯示這10類手寫體數字圖片經PCA壓縮後的2維空間分布
def plot_pca_scatter():
    colors = [‘black‘, ‘blue‘, ‘purple‘, ‘yellow‘, ‘white‘, ‘red‘, ‘lime‘, ‘cyan‘, ‘orange‘, ‘gray‘]
    for i in xrange(len(colors)):
        px = X_pca[:, 0][y_digits.as_matrix() == i]
        py = X_pca[:, 1][y_digits.as_matrix() == i]
        plt.scatter(px, py, c=colors[i])
    plt.legend(np.arange(0, 10).astype(str))
    plt.xlabel(‘First Principal Component‘)
    plt.ylabel(‘Second Principal Component‘)
    plt.show()
plot_pca_scatter()

# 4.用SVM分別對原始空間的數據（64維）和降到20維的數據進行訓練，預測

# 4.1 對訓練數據／測試數據進行特征向量與分類標簽的分離
X_train = digits_train[np.arange(64)]
y_train = digits_train[64]
X_test = digits_test[np.arange(64)]
y_test = digits_test[64]

#4.2 用SVM對64維數據進行進行訓練
svc = LinearSVC()  # 初始化線性核的支持向量機的分類器
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)

#4.3 用SVM對20維數據進行進行訓練
estimator = PCA(n_components=20)   # 使用PCA將原64維度圖像壓縮為20個維度
pca_X_train = estimator.fit_transform(X_train)   # 利用訓練特征決定20個正交維度的方向，並轉化原訓練特征
pca_X_test = estimator.transform(X_test)

psc_svc = LinearSVC()
psc_svc.fit(pca_X_train,y_train)
pca_y_pred = psc_svc.predict(pca_X_test)

#5.獲取結果報告
#輸出用64維度訓練的結果
print svc.score(X_test,y_test)
print classification_report(y_test,y_pred,target_names=np.arange(10).astype(str))

#輸出用20維度訓練的結果
print psc_svc.score(pca_X_test,y_test)
print classification_report(y_test,pca_y_pred,target_names=np.arange(10).astype(str))

運行結果：

1）將數據壓縮到兩維，在二維平面的可視化。

技術分享圖片

2）SVM對64維和20維數據的訓練結果

0.9220923761825265
precision recall f1-score support

0 0.99 0.98 0.99 178
1 0.97 0.76 0.85 182
2 0.99 0.98 0.98 177
3 1.00 0.87 0.93 183
4 0.95 0.97 0.96 181
5 0.90 0.97 0.93 182
6 0.99 0.97 0.98 181
7 0.99 0.90 0.94 179
8 0.67 0.97 0.79 174
9 0.90 0.86 0.88 180

avg / total 0.94 0.92 0.92 1797

0.9248747913188647
precision recall f1-score support

0 0.97 0.96 0.96 178
1 0.88 0.90 0.89 182
2 0.96 0.99 0.97 177
3 0.99 0.91 0.95 183
4 0.92 0.96 0.94 181
5 0.87 0.96 0.91 182
6 0.98 0.97 0.98 181
7 0.98 0.89 0.93 179
8 0.91 0.83 0.86 174
9 0.83 0.88 0.85 180

avg / total 0.93 0.92 0.93 1797

結論：降維後的準確率降低，但卻用了更少的維度。

3.PCA的優缺點

PCA算法的主要優點有：

1）僅僅需要以方差衡量信息量，不受數據集以外的因素影響。　

2）各主成分之間正交，可消除原始數據成分間的相互影響的因素。

3）計算方法簡單，主要運算是特征值分解，易於實現。

PCA算法的主要缺點有：

1）主成分各個特征維度的含義具有一定的模糊性，不如原始樣本特征的解釋性強。

2）方差小的非主成分也可能含有對樣本差異的重要信息，因降維丟棄可能對後續數據處理有影響。

[python機器學習及實踐(6)]Sklearn實現主成分分析（PCA）

相關性 hit 變量 gray tran total 空間 mach show 1.PCA原理主成分分析（Principal Component Analysis，PCA），是一種統計方法。通過正交變換將一組可能存在相關性的變量轉換為一組線性不相關的變量，轉換後的這組

[python機器學習及實踐(6)]Sklearn實現主成分分析（PCA）

[python機器學習及實踐(6)]Sklearn實現主成分分析（PCA）

深入學習主成分分析（PCA）演算法原理及其Python實現

機器學習實戰學習筆記5——主成分分析（PCA）

主成分分析（pca）演算法的實現步驟及程式碼

機器學習（十三）：CS229ML課程筆記（9）——因子分析、主成分分析（PCA）、獨立成分分析（ICA）

優達機器學習：主成分分析（PCA）

【機器學習】資料降維—主成分分析（PCA）

Machine Learning第八講【非監督學習】--（三）主成分分析（PCA）

資料探勘學習------------------1-資料準備-４-主成分分析（PCA）降維和相關係數降維

python小白進階三：主成分分析（PCA）

主成分分析（PCA）演算法以及PCA在人臉識別上的應用及程式碼

《python機器學習及實踐-從零開始通往kaggle競賽之路（程式碼Python 3.6 版）》chapter1.1

重回機器學習-《python機器學習及實踐》讀書筆記二

PYTHON機器學習及實踐_從零開始通往KAGGLE競賽之路pdf

python機器學習及實踐學習筆記1-如何開啟ipynb字尾檔案

Python機器學習及實踐——基礎篇11（迴歸樹）

Python機器學習及實踐——基礎篇7（分類整合模型）

《Python機器學習及實踐》----無監督學習之資料聚類

Python機器學習及實踐——基礎篇10（K近鄰迴歸）

《Python機器學習及實踐》----模型實用技巧

[python機器學習及實踐(6)]Sklearn實現主成分分析（PCA）

相關推薦