1. 程式人生 > >機器學習中的交叉驗證簡介

機器學習中的交叉驗證簡介

1.什麼是交叉驗證?

        交叉驗證是在實驗中的資料不充分的情況下,但是我們又想訓練出好的模型的情況下采用的措施。交叉驗證的思想:重複使用資料,把給定的資料進行拆分,將切分的資料集組合為訓練集與測試集,在此基礎上不斷反覆進行訓練、測試以及模型選擇。下邊來介紹下使用過的兩個交叉驗證方法,交叉驗證的方法主要是使用sklearn庫中方法,我們可以直接呼叫庫中的方法,主要是在於引數的設定以及你應用的場景。

2.常見的交叉驗證方法

        (1)簡單交叉驗證:

           隨機的把資料集分為兩部分,一部分作為訓練集,另一部分作為測試集。然後利用訓練集在各種條件下(引數個數不同的情況)訓練模型,從而得到不同的模型,在測試集上評價各個模型的測試誤差,選出測試誤差最小的模型。例如:70%的資料作為訓練集,30%的資料作為測試集,在Python中使用的方式如下:

sklearn.model_selection.train_test_split(*arrays, **option)
Parameters:	
          *arrays : sequence of indexables with same length / shape[0]

                    Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

          test_size : float, int, None, optional

                    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size.

          train_size : float, int, or None, default None

                     If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

          random_state : int, RandomState instance or None, optional (default=None)

                     If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

          shuffle : boolean, optional (default=True)

                     Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

          stratify : array-like or None (default is None)

                     If not None, data is split in a stratified fashion, using this as the class labels.

          Returns:	
                     splitting : list, length=2 * len(arrays)

                     List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

平時主要使用的情況是

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

        (2)K折交叉驗證---K-Fold

            原理:首先隨機的將已給資料切分為K個互不相交的大小相同的子集;然後使用K-1個子集的資料訓練模型,利用餘下的子集測試模型;將這一過程對可能的K種選擇重複進行;最後終選出K次評測中平均誤差最小的模型。下邊是sklearn中對這個交叉驗證的介紹,我們在使用的時候需要注意裡邊引數的設定

class KFold(_BaseKFold):
    """K-Folds cross-validator
    Provides train/test indices to split data in train/test sets. Split
    dataset into k consecutive folds (without shuffling by default).
    Each fold is then used once as a validation while the k - 1 remaining
    folds form the training set.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=3 # 這裡主要是設定交叉驗證的次數,次數不是越多越好,次數的設定一般在十次,這個根據具體應用場景進行設定
        Number of folds. Must be at least 2.
    shuffle : boolean, optional# 可選選項
        Whether to shuffle the data before splitting into batches.
    random_state : int, RandomState instance or None, optional, default=None# 隨機狀態
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`. Used when ``shuffle`` == True.
    Examples
    --------
    >>> from sklearn.model_selection import KFold
    >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
    >>> y = np.array([1, 2, 3, 4])
    >>> kf = KFold(n_splits=2)
    >>> kf.get_n_splits(X)
    2
    >>> print(kf)  # doctest: +NORMALIZE_WHITESPACE
    KFold(n_splits=2, random_state=None, shuffle=False)
    >>> for train_index, test_index in kf.split(X):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    ...    X_train, X_test = X[train_index], X[test_index]
    ...    y_train, y_test = y[train_index], y[test_index]
    TRAIN: [2 3] TEST: [0 1]
    TRAIN: [0 1] TEST: [2 3]
    Notes
    -----
    The first ``n_samples % n_splits`` folds have size
    ``n_samples // n_splits + 1``, other folds have size
    ``n_samples // n_splits``, where ``n_samples`` is the number of samples.
    Randomized CV splitters may return different results for each call of
    split. You can make the results identical by setting ``random_state``
    to an integer.
    See also
    --------
    StratifiedKFold
        Takes group information into account to avoid building folds with
        imbalanced class distributions (for binary or multiclass
        classification tasks).
    GroupKFold: K-fold iterator variant with non-overlapping groups.
    RepeatedKFold: Repeats K-Fold n times.
    """

    def __init__(self, n_splits=3, shuffle=False,
                 random_state=None):
        super(KFold, self).__init__(n_splits, shuffle, random_state)

    def _iter_test_indices(self, X, y=None, groups=None):
        n_samples = _num_samples(X)
        indices = np.arange(n_samples)
        if self.shuffle:
            check_random_state(self.random_state).shuffle(indices)

        n_splits = self.n_splits
        fold_sizes = (n_samples // n_splits) * np.ones(n_splits, dtype=np.int)
        fold_sizes[:n_samples % n_splits] += 1
        current = 0
        for fold_size in fold_sizes:
            start, stop = current, current + fold_size
            yield indices[start:stop]
            current = stop

總結:

        簡單介紹了日常用到的兩個交叉驗證的方法,在不斷學習的過程中,還有其他的交叉驗證方法沒有使用,日後會進行補充的。