1. 程式人生 > >人工智障學習筆記——機器學習(11)PCA降維

人工智障學習筆記——機器學習(11)PCA降維

一.概念

Principal Component Analysis(PCA):主成分分析法,是最常用的線性降維方法,它的目標是通過某種線性投影,將高維的資料對映到低維的空間中表示,即把原先的n個特徵用數目更少的m個特徵取代,新特徵是舊特徵的線性組合。並期望在所投影的維度上資料的方差最大,儘量使新的m個特徵互不相關。從舊特徵到新特徵的對映捕獲資料中的固有變異性。以此使用較少的資料維度,同時保留住較多的原資料點的特性。

二.演算法

1.對所有樣本進行中心化操作
2.計算樣本的協方差矩陣
3.對協方差矩陣做特徵值分解
4.取最大的d個特徵值對應的特徵向量,構造投影矩陣W
通常低維空間維數d的選取有兩種方法:
1)通過交叉驗證法選取較好的d  
2)從演算法原理的角度設定一個閾值,比如t=0.95,然後選取似的下式成立的最小的d值:
    Σ(i->d)λi/Σ(i->n)λi>=t,其中λi從大到小排列
PCA降維的準則有以下兩個:
最近重構性:重構後的點距離原來的點的誤差之和最小
最大可分性:樣本點在低維空間的投影儘可能分開

三.sklearn提供的API

sklearn的decomposition提供了PCA一系列降維的方法

"""
The :mod:`sklearn.decomposition` module includes matrix decomposition
algorithms, including among others PCA, NMF or ICA. Most of the algorithms of
this module can be regarded as dimensionality reduction techniques.
"""

from .nmf import NMF, non_negative_factorization
from .pca import PCA, RandomizedPCA
from .incremental_pca import IncrementalPCA
from .kernel_pca import KernelPCA
from .sparse_pca import SparsePCA, MiniBatchSparsePCA
from .truncated_svd import TruncatedSVD
from .fastica_ import FastICA, fastica
from .dict_learning import (dict_learning, dict_learning_online, sparse_encode,
                            DictionaryLearning, MiniBatchDictionaryLearning,
                            SparseCoder)
from .factor_analysis import FactorAnalysis
from ..utils.extmath import randomized_svd
from .online_lda import LatentDirichletAllocation

__all__ = ['DictionaryLearning',
           'FastICA',
           'IncrementalPCA',
           'KernelPCA',
           'MiniBatchDictionaryLearning',
           'MiniBatchSparsePCA',
           'NMF',
           'PCA',
           'RandomizedPCA',
           'SparseCoder',
           'SparsePCA',
           'dict_learning',
           'dict_learning_online',
           'fastica',
           'non_negative_factorization',
           'randomized_svd',
           'sparse_encode',
           'FactorAnalysis',
           'TruncatedSVD',
           'LatentDirichletAllocation']

其中,KernelPCA可以選擇適用的核函式,主要針對非線性的資料降維。RandomizedPCA是採用隨機奇異值的線性降維,將資料投影到較低維空間的奇異向量。IncrementalPCA主要是解決單機記憶體限制,IncrementalPCA先將資料分成多個batch,然後對每個batch依次遞增呼叫partial_fit函式,這樣一步步的得到最終的樣本最優降維。SparsePCA和MiniBatchSparsePCA使用了L1的正則化,這樣可以將很多非主要成分的影響度降為0,,避免了一些噪聲之類的因素對我們PCA降維的影響。SparsePCA和MiniBatchSparsePCA之間的區別是MiniBatchSparsePCA通過使用一部分樣本特徵和給定的迭代次數來進行PCA降維,以解決在大樣本時特徵分解過慢的問題,當然,代價就是PCA降維的精確度可能會降低。使用SparsePCA和MiniBatchSparsePCA需要對L1正則化引數進行調參。

PCA主要引數:

n_components:這個引數可以幫我們指定希望PCA降維後的特徵維度數目
whiten :判斷是否進行白化,就是對降維後的資料的每個特徵進行歸一化
svd_solver:即指定奇異值分解SVD的方法

 Parameters
    ----------
    n_components : int, float, None or string
        Number of components to keep.
        if n_components is not set all components are kept::

            n_components == min(n_samples, n_features)

        if n_components == 'mle' and svd_solver == 'full', Minka\'s MLE is used
        to guess the dimension
        if ``0 < n_components < 1`` and svd_solver == 'full', select the number
        of components such that the amount of variance that needs to be
        explained is greater than the percentage specified by n_components
        n_components cannot be equal to n_features for svd_solver == 'arpack'.

    copy : bool (default True)
        If False, data passed to fit are overwritten and running
        fit(X).transform(X) will not yield the expected results,
        use fit_transform(X) instead.

    whiten : bool, optional (default False)
        When True (False by default) the `components_` vectors are multiplied
        by the square root of n_samples and then divided by the singular values
        to ensure uncorrelated outputs with unit component-wise variances.

        Whitening will remove some information from the transformed signal
        (the relative variance scales of the components) but can sometime
        improve the predictive accuracy of the downstream estimators by
        making their data respect some hard-wired assumptions.

    svd_solver : string {'auto', 'full', 'arpack', 'randomized'}
        auto :
            the solver is selected by a default policy based on `X.shape` and
            `n_components`: if the input data is larger than 500x500 and the
            number of components to extract is lower than 80% of the smallest
            dimension of the data, then the more efficient 'randomized'
            method is enabled. Otherwise the exact full SVD is computed and
            optionally truncated afterwards.
        full :
            run exact full SVD calling the standard LAPACK solver via
            `scipy.linalg.svd` and select the components by postprocessing
        arpack :
            run SVD truncated to n_components calling ARPACK solver via
            `scipy.sparse.linalg.svds`. It requires strictly
            0 < n_components < X.shape[1]
        randomized :
            run randomized SVD by the method of Halko et al.

        .. versionadded:: 0.18.0

    tol : float >= 0, optional (default .0)
        Tolerance for singular values computed by svd_solver == 'arpack'.

        .. versionadded:: 0.18.0

    iterated_power : int >= 0, or 'auto', (default 'auto')
        Number of iterations for the power method computed by
        svd_solver == 'randomized'.

        .. versionadded:: 0.18.0

    random_state : int, RandomState instance or None, optional (default None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`. Used when ``svd_solver`` == 'arpack' or 'randomized'.

def __init__(self, n_components=None, copy=True, whiten=False,
                 svd_solver='auto', tol=0.0, iterated_power='auto',
                 random_state=None):
        self.n_components = n_components
        self.copy = copy
        self.whiten = whiten
        self.svd_solver = svd_solver
        self.tol = tol
        self.iterated_power = iterated_power
        self.random_state = random_state

PCA的輸出就是Y = W‘X,由X的原始維度降低到了k維。除此之外,PCA還有兩個重要的成員值:
explained_variance_,代表降維後的各主成分的方差值。方差值越大,則說明越是重要的主成分。
explained_variance_ratio_,它代表降維後的各主成分的方差值佔總方差值的比例,這個比例越大,則越是重要的主成分。

程式碼如下:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,decomposition,manifold
from itertools import cycle
def load_data():
    iris=datasets.load_iris()
    return iris.data,iris.target


PCA_Set=[   
    decomposition.PCA(n_components=None),
    decomposition.PCA(svd_solver = 'randomized'), 
    decomposition.SparsePCA(n_components=None),
    decomposition.IncrementalPCA(n_components=None),
    decomposition.KernelPCA(n_components=None,kernel='linear'),
    decomposition.KernelPCA(n_components=None,kernel='rbf'),
    decomposition.KernelPCA(n_components=None,kernel='poly'),
    decomposition.KernelPCA(n_components=None,kernel='sigmoid'),  
    decomposition.FastICA(n_components=None)
    ]
PCA_Set_Name=[   
    'Default',
    'Randomized',
    'Sparse',
    'Incremental',
    'Kernel(linear)',
    'Kernel(rbf)',
    'Kernel(poly)',
    'Kernel(sigmoid)',  
    'ICA'
    ]    



def plot_PCA(*data):
    X,Y=data
    fig=plt.figure("PCA",figsize=(20, 8))

    ax=fig.add_subplot(2,5,1) 
    colors=cycle('rgbcmykw')
    for label,color in zip(np.unique(Y),colors):
        position=Y==label
        ax.scatter(X[position,0],X[position,1],label="target=%d"%label,color=color)
    plt.xticks(fontsize=10, color="darkorange")  
    plt.yticks(fontsize=10, color="darkorange") 
    ax.set_title('Original')  
    
    for i,PCA in enumerate(PCA_Set):
        pca=PCA
        pca.fit(X)       
        X_r=pca.transform(X)

        if i==0:
            print("各主成分的方差值:"+str(pca.explained_variance_))
            print("各主成分的方差值比:"+str(pca.explained_variance_ratio_))

        ax=fig.add_subplot(2,5,i+2)   
        colors=cycle('rgbcmykw')
        for label,color in zip(np.unique(Y),colors):
            position=Y==label
            ax.scatter(X_r[position,0],X_r[position,1],label="target=%d"%label,color=color)
        plt.xticks(fontsize=10, color="darkorange")  
        plt.yticks(fontsize=10, color="darkorange") 
        ax.set_title(PCA_Set_Name[i])           
    plt.show()

X,Y=load_data()
plot_PCA(X,Y)


PS:ICA是獨立成分分析法。雖然看起來和PCA有些相似,但完全不是幹這個事的。我只是拿來客串湊個數的~


四.總結

PCA是多變數分析中較為古老的技術,它來源於通訊理論中的K-L變換,其實質就是在儘可能好的代表原特徵情況下,將原特徵進行線性變換、對映至低緯度空間。PCA追求的是在降維之後能夠最大化保持資料的內在資訊,並通過衡量在投影方向上的資料方差的大小來衡量該方向的重要性。但是這樣投影以後對資料的區分作用並不大,反而可能使得資料點揉雜在一起無法區分。這也是PCA存在的最大一個問題,這導致使用PCA在很多情況下的分類效果並不好。

五.相關學習資源