1. 程式人生 > >python之sklearn常見資料預處理歸一化方式解析

python之sklearn常見資料預處理歸一化方式解析

標準歸一化

歸一化到均值為0,方差為1

  • sklearn.preprocessing.scale函式:Standardize a dataset along any axis
    • 先貼出主要的原始碼,乍一看,很亂,其實細看之下,就是多了一些判斷稀疏矩陣之類的條件性程式碼。
#coding=utf-8
import numpy as np
from scipy import sparse
def _handle_zeros_in_scale(scale, copy=True):
    ''' Makes sure that whenever scale is zero, we handle it correctly.
    This happens in most scalers when we have constant features.'''
# if we are fitting on 1D arrays, scale might be a scalar if np.isscalar(scale): if scale == .0: scale = 1. return scale elif isinstance(scale, np.ndarray): if copy: # New array to avoid side-effects scale = scale.copy() scale[scale == 0.0
] = 1.0 return scale def scale(X, axis=0, with_mean=True, with_std=True, copy=True): """Standardize a dataset along any axis Center to the mean and component wise scale to unit variance. Read more in the :ref:`User Guide <preprocessing_scaler>`. Parameters ---------- X : {array-like, sparse matrix} The data to center and scale. axis : int (0 by default) axis used to compute the means and standard deviations along. If 0, independently standardize each feature, otherwise (if 1) standardize each sample. with_mean : boolean, True by default If True, center the data before scaling. with_std : boolean, True by default If True, scale the data to unit variance (or equivalently, unit standard deviation). copy : boolean, optional, default True set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSC matrix and if axis is 1). Notes ----- This implementation will refuse to center scipy.sparse matrices since it would make them non-sparse and would potentially crash the program with memory exhaustion problems. Instead the caller is expected to either set explicitly `with_mean=False` (in that case, only variance scaling will be performed on the features of the CSC matrix) or to call `X.toarray()` if he/she expects the materialized dense array to fit in memory. To avoid memory copy the caller should pass a CSC matrix. See also -------- StandardScaler: Performs scaling to unit variance using the``Transformer`` API (e.g. as part of a preprocessing :class:`sklearn.pipeline.Pipeline`). """
# noqa X = check_array(X, accept_sparse='csc', copy=copy, ensure_2d=False, warn_on_dtype=True, estimator='the scale function', dtype=FLOAT_DTYPES) if sparse.issparse(X): if with_mean: raise ValueError( "Cannot center sparse matrices: pass `with_mean=False` instead" " See docstring for motivation and alternatives.") if axis != 0: raise ValueError("Can only scale sparse matrix on axis=0, " " got axis=%d" % axis) if with_std: _, var = mean_variance_axis(X, axis=0) var = _handle_zeros_in_scale(var, copy=False) inplace_column_scale(X, 1 / np.sqrt(var)) else: X = np.asarray(X) if with_mean: mean_ = np.mean(X, axis) if with_std: scale_ = np.std(X, axis) # Xr is a view on the original array broadcasting on the axis in which we are interested in #下面這一行一開始著實讓人不太懂,感覺是一直對Xr操作,怎麼突然返回X,後來才知道Xr是X的一個檢視, #np.rollaxis返回的是輸入陣列的檢視,兩者只是形式上不同,本質是相等的,通過assert(X==Xr)可以證實。 Xr = np.rollaxis(X, axis) if with_mean: Xr -= mean_ mean_1 = Xr.mean(axis=0) # Verify that mean_1 is 'close to zero'. If X contains very # large values, mean_1 can also be very large, due to a lack of # precision of mean_. In this case, a pre-scaling of the # concerned feature is efficient, for instance by its mean or # maximum. if not np.allclose(mean_1, 0): warnings.warn("Numerical issues were encountered " "when centering the data " "and might not be solved. Dataset may " "contain too large values. You may need " "to prescale your features.") Xr -= mean_1 if with_std: scale_ = _handle_zeros_in_scale(scale_, copy=False) Xr /= scale_ if with_mean: mean_2 = Xr.mean(axis=0) # If mean_2 is not 'close to zero', it comes from the fact that # scale_ is very small so that mean_2 = mean_1/scale_ > 0, even # if mean_1 was close to zero. The problem is thus essentially # due to the lack of precision of mean_. A solution is then to # subtract the mean again: if not np.allclose(mean_2, 0): warnings.warn("Numerical issues were encountered " "when scaling the data " "and might not be solved. The standard " "deviation of the data is probably " "very close to 0. ") Xr -= mean_2 return X
  • 簡化版scale程式碼
def scale_mean_var(input_arr,axis=0):
    #from sklearn import preprocessing
    #input_arr= preprocessing.scale(input_arr.astype('float'))
    mean_ = np.mean(input_arr,axis=0)
    scale_ = np.std(input_arr,axis=0)
    #減均值 
    output_arr= input_arr- mean_
    #判斷均值是否接近0
    mean_1 = output_arr.mean(axis=0)
    if not np.allclose(mean_1, 0):
        output_arr -= mean_1
    #將標準差為0元素的置1
    #scale_ = _handle_zeros_in_scale(scale_, copy=False)
    scale_[scale_ == 0.0] = 1.0
    #除以標準差
    output_arr /=scale_
    #再次判斷均值是否為0
    mean_2 = output_arr .mean(axis=0)
    if not np.allclose(mean_2, 0):
        output_arr  -= mean_2

    return output_arr 

最大最小歸一化

  • sklearn.preprocessing.minmax_scale函式:Transforms features by scaling each feature to a given range.
        X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
        X_scaled = X_std * (max - min) + min
  • 簡化版程式碼很簡單
def max_min(input_arr,o_min,o_max):
    """
    Transforms features by scaling each feature to a given range.
    """
    i_min = np.min(input_arr)
    i_max = np.max(input_arr)
    out_arr = np.clip(input_arr,i_min,i_max)
    out_arr = (out_arr- i_min)/(i_max - i_min)

    if o_max==1 and o_min==0return out_arr 
    else:
        out_arr = out_arr*(o_max-o_min)+o_min
        return out_arr 

最大絕對值歸一化

  • maxabs_scale函式:Scale each feature by its maximum absolute value.
def maxabs_scale(input,axis=0):
    """
    Scale each feature to the [-1, 1] range without breaking the sparsity
    """
    if not isinstance(input,numpy.ndarray):
        input = np.asarray(input).astype(np.float32)
    maxabs = np.max(abs(input),axis=0)
    out_array = input/maxabs 
    return out_array