python之sklearn常見資料預處理歸一化方式解析

阿新 • • 發佈：2018-12-09

標準歸一化

歸一化到均值為0，方差為1

sklearn.preprocessing.scale函式：Standardize a dataset along any axis
- 先貼出主要的原始碼，乍一看，很亂，其實細看之下，就是多了一些判斷稀疏矩陣之類的條件性程式碼。

#coding=utf-8
import numpy as np
from scipy import sparse
def _handle_zeros_in_scale(scale, copy=True):
    ''' Makes sure that whenever scale is zero, we handle it correctly.
    This happens in most scalers when we have constant features.''' 

    # if we are fitting on 1D arrays, scale might be a scalar
    if np.isscalar(scale):
        if scale == .0:
            scale = 1.
        return scale
    elif isinstance(scale, np.ndarray):
        if copy:
            # New array to avoid side-effects
            scale = scale.copy()
        scale[scale == 0.0 
] = 1.0
        return scale

def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
    """Standardize a dataset along any axis

    Center to the mean and component wise scale to unit variance.

    Read more in the :ref:`User Guide <preprocessing_scaler>`.

    Parameters
    ----------
    X : {array-like, sparse matrix}
        The data to center and scale.

    axis : int (0 by default)
        axis used to compute the means and standard deviations along. If 0,
        independently standardize each feature, otherwise (if 1) standardize
        each sample.

    with_mean : boolean, True by default
        If True, center the data before scaling.

    with_std : boolean, True by default
        If True, scale the data to unit variance (or equivalently,
        unit standard deviation).

    copy : boolean, optional, default True
        set to False to perform inplace row normalization and avoid a
        copy (if the input is already a numpy array or a scipy.sparse
        CSC matrix and if axis is 1).

    Notes
    -----
    This implementation will refuse to center scipy.sparse matrices
    since it would make them non-sparse and would potentially crash the
    program with memory exhaustion problems.

    Instead the caller is expected to either set explicitly
    `with_mean=False` (in that case, only variance scaling will be
    performed on the features of the CSC matrix) or to call `X.toarray()`
    if he/she expects the materialized dense array to fit in memory.
    To avoid memory copy the caller should pass a CSC matrix.
    See also
    --------
    StandardScaler: Performs scaling to unit variance using the``Transformer`` API
        (e.g. as part of a preprocessing :class:`sklearn.pipeline.Pipeline`).
    """ 
  # noqa
    X = check_array(X, accept_sparse='csc', copy=copy, ensure_2d=False,
                    warn_on_dtype=True, estimator='the scale function',
                    dtype=FLOAT_DTYPES)
    if sparse.issparse(X):
        if with_mean:
            raise ValueError(
                "Cannot center sparse matrices: pass `with_mean=False` instead"
                " See docstring for motivation and alternatives.")
        if axis != 0:
            raise ValueError("Can only scale sparse matrix on axis=0, "
                             " got axis=%d" % axis)
        if with_std:
            _, var = mean_variance_axis(X, axis=0)
            var = _handle_zeros_in_scale(var, copy=False)
            inplace_column_scale(X, 1 / np.sqrt(var))
    else:
        X = np.asarray(X)
        if with_mean:
            mean_ = np.mean(X, axis)
        if with_std:
            scale_ = np.std(X, axis)
        # Xr is a view on the original array broadcasting on the axis in which we are interested in
        #下面這一行一開始著實讓人不太懂，感覺是一直對Xr操作，怎麼突然返回X，後來才知道Xr是X的一個檢視，
        #np.rollaxis返回的是輸入陣列的檢視，兩者只是形式上不同，本質是相等的，通過assert(X==Xr)可以證實。
        Xr = np.rollaxis(X, axis)
        if with_mean:
            Xr -= mean_
            mean_1 = Xr.mean(axis=0)
            # Verify that mean_1 is 'close to zero'. If X contains very
            # large values, mean_1 can also be very large, due to a lack of
            # precision of mean_. In this case, a pre-scaling of the
            # concerned feature is efficient, for instance by its mean or
            # maximum.
            if not np.allclose(mean_1, 0):
                warnings.warn("Numerical issues were encountered "
                              "when centering the data "
                              "and might not be solved. Dataset may "
                              "contain too large values. You may need "
                              "to prescale your features.")
                Xr -= mean_1
        if with_std:
            scale_ = _handle_zeros_in_scale(scale_, copy=False)
            Xr /= scale_
            if with_mean:
                mean_2 = Xr.mean(axis=0)
                # If mean_2 is not 'close to zero', it comes from the fact that
                # scale_ is very small so that mean_2 = mean_1/scale_ > 0, even
                # if mean_1 was close to zero. The problem is thus essentially
                # due to the lack of precision of mean_. A solution is then to
                # subtract the mean again:
                if not np.allclose(mean_2, 0):
                    warnings.warn("Numerical issues were encountered "
                                  "when scaling the data "
                                  "and might not be solved. The standard "
                                  "deviation of the data is probably "
                                  "very close to 0. ")
                    Xr -= mean_2
    return X

簡化版scale程式碼

def scale_mean_var(input_arr，axis=0):
    #from sklearn import preprocessing
    #input_arr= preprocessing.scale(input_arr.astype('float'))
    mean_ = np.mean(input_arr,axis=0)
    scale_ = np.std(input_arr,axis=0)
    #減均值 
    output_arr= input_arr- mean_
    #判斷均值是否接近0
    mean_1 = output_arr.mean(axis=0)
    if not np.allclose(mean_1, 0):
        output_arr -= mean_1
    #將標準差為0元素的置1
    #scale_ = _handle_zeros_in_scale(scale_, copy=False)
    scale_[scale_ == 0.0] = 1.0
    #除以標準差
    output_arr /=scale_
    #再次判斷均值是否為0
    mean_2 = output_arr .mean(axis=0)
    if not np.allclose(mean_2, 0):
        output_arr  -= mean_2

    return output_arr

最大最小歸一化

sklearn.preprocessing.minmax_scale函式：Transforms features by scaling each feature to a given range.

        X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
        X_scaled = X_std * (max - min) + min

簡化版程式碼很簡單

def max_min(input_arr,o_min,o_max):
    """
    Transforms features by scaling each feature to a given range.
    """
    i_min = np.min(input_arr)
    i_max = np.max(input_arr)
    out_arr = np.clip(input_arr,i_min,i_max)
    out_arr = (out_arr- i_min)/(i_max - i_min)

    if o_max==1 and o_min==0：
        return out_arr 
    else:
        out_arr = out_arr*(o_max-o_min)+o_min
        return out_arr

最大絕對值歸一化

maxabs_scale函式：Scale each feature by its maximum absolute value.

def maxabs_scale(input,axis=0):
    """
    Scale each feature to the [-1, 1] range without breaking the sparsity
    """
    if not isinstance(input,numpy.ndarray):
        input = np.asarray(input).astype(np.float32)
    maxabs = np.max(abs(input),axis=0)
    out_array = input/maxabs 
    return out_array

python之sklearn常見資料預處理歸一化方式解析

標準歸一化歸一化到均值為0，方差為1 sklearn.preprocessing.scale函式：Standardize a dataset along any axis 先貼出主要的原始碼，乍一看，很亂，其實細看之下，就是多了一些判斷稀疏矩陣之類

【轉】關於使用sklearn進行資料預處理 —— 歸一化/標準化/正則化

一、標準化（Z-Score），或者去除均值和方差縮放公式為：(X-mean)/std 計算時對每個屬性/每列分別進行。將資料按期屬性（按列進行）減去其均值，並處以其方差。得到的結果是，對於每個屬性/每列來說所有資料都聚集在0附近，方差為1。實現時，有兩種不同的方式：

關於使用sklearn進行資料預處理 —— 歸一化/標準化/正則化

一、標準化（Z-Score），或者去除均值和方差縮放公式為：(X-mean)/std 計算時對每個屬性/每列分別進行。將資料按期屬性（按列進行）減去其均值，並處以其方差。得到的結果是，對於每個屬性/每列來說所有資料都聚集在0附近，方差為1。實現時，有兩種不同的方

使用sklearn進行資料預處理 —— 歸一化/標準化/正則化

本文主要是對照scikit-learn的preprocessing章節結合程式碼簡單的回顧下預處理技術的幾種方法，主要包括標準化、資料最大最小縮放處理、正則化、特徵二值化和資料缺失值處理。內容比較簡單，僅供參考！首先來回顧一下下面要用到的基本知識。均值公式：

Python資料預處理—歸一化，標準化，正則化

>>> X_train = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) ... >>> min_max_scaler = preprocessing.MinMaxScaler() >

資料預處理——歸一化標準化

資料的標準化（normalization）是將資料按比例縮放，使之落入一個小的特定區間。去除資料的單位限制，將其轉化為無量綱的純數值，便於不同單位或量級的指標能夠進行比較和加權最典型的就是資料的歸一化處理，即將資料統一對映到[0,1]區間上 import nu

資料預處理 —— 歸一化/標準化/正則化

一、標準化（Z-Score），或者去除均值和方差縮放公式為：(X-mean)/std 計算時對每個屬性/每列分別進行。將資料按期屬性（按列進行）減去其均值，並處以其方差。得到的結果是，對於每個屬性/每列來說所有資料都聚集在0附近，方差為1。實

機器學習（一）：用sklearn進行資料預處理：缺失值處理、資料標準化、歸一化

在我們平時進行資料資料探勘建模時，一般首先得對資料進行預處理，其中就包括資料缺失值、異常值處理、資料的標準化、歸一化等等。下面主要介紹如何對一個數據檔案進行資料的缺失值處理、標準化和歸一化 MID_SP MID_AC MID_R25 MID_COND LITHO1 55.

sklearn preprocessing 資料預處理 OneHotEncoder

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

關於使用Sklearn進行資料預處理 —— 缺失值（Missing Value）處理

關於缺失值（missing value）的處理在sklearn的preprocessing包中包含了對資料集中缺失值的處理，主要是應用Imputer類進行處理。首先需要說明的是，numpy的陣列中可以使用np.nan/np.NaN（Not A Number）來代替

入門｜三行Python程式碼，讓資料預處理速度提高2到6倍 python入門

在 Python 中，我們可以找到原生的並行化運算指令。本文可以教你僅使用 3 行程式碼，大大加快資料預處理的速度。入門｜三行Python程式碼，讓資料預處理速度提高2到6倍 Python 是機器學習領域內的首選程式語言，它易於使用，也有很多出色的庫來幫助你更

三行Python程式碼，讓資料預處理速度提高2到6倍

小編有自己的Python學習交流群865597862 ！進群可以免費領取2018Python最新的學習資料哦！ Python 是機器學習領域內的首選程式語言，它易於使用，也有很多出色的庫來幫助你更快處理資料。但當我們面臨大量資料時，一些問題就會顯現…… 目前，大資料（

機器學習之特徵工程-資料預處理

摘自 jacksu在簡書機器學習之特徵工程-資料預處理 https://www.jianshu.com/p/23b493d38b5b 通過特徵提取，我們能得到未經處理的特徵，這時的特徵可能有以下問題：不屬於同一量綱：即特徵的規格不一樣，不能夠放在

閒扯淡之機器學習——資料預處理

上篇文章我們針對ML閒扯了一番，並在最後又借鑑Data Mining的CRISP-DM模型分析了一個ML專案的開發過程。今天說點什麼呢？我猶豫了，我迷茫了！先給大家講個故事吧！有一天你的boss找到你說：XX聽說你對ML很熟悉啊，正好我們公司有很多*

資料預處理的四種方式

資料預處理調整資料尺寸讓所有的屬性按照相同的尺度來度量資料；梯度下降演算法神經網路 SVM 迴歸演算法 K 近鄰演算法 # 調整資料尺度（0..） import pandas as pd import numpy as np f

CS231n 卷積神經網路與計算機視覺 6 資料預處理權重初始化規則化損失函式等常用方法總結

1 資料處理首先註明我們要處理的資料是矩陣X，其shape為[N x D] (N =number of data, D =dimensionality). 1.1 Mean subtraction 去均值去均值是一種常用的資料處理方式.它是將各個特徵值減去其均

TensorFlow 影象資料預處理及視覺化

注：文章轉自《慢慢學TensorFlow》微信公眾號影象是人們喜聞樂見的一種資訊形式，“百聞不如一見”，有時一張圖能勝千言萬語。影象處理是利用計算機將數值化的影象進行一定（線性或非線性）變換獲得更好效果的方法。Photoshop，美顏相機就是利用影象處理技術的應用程

MATLAB中實現資料 [0,1] 歸一化

記錄一下，在做機器學習時，資料處理部分要花很多精力。資料處理的方式有很多種，今天記錄的是[0,1]歸一化，該法可以避免在較大數值範圍內的特性凌駕於較小數值範圍內的特性，先看下原理。設序列代表特性A，對它們進行變換則得到的新序列下面看程式碼： clc clear all %

機器學習中常見的幾種歸一化方法以及原因

在機器學習中，資料歸一化是非常重要，它可能會導致模型壞掉或者訓練出一個很奇怪的模型，為了讓機器學習的模型更加適合實際情況，需要對資料進行歸一化處理。 1.機器學習中常用的歸一化方法： 2. 不同歸一化方法分析： 2.1 線性變換和極差法（線性歸一化）將原始資料線性化的方

機器學習資料標準和歸一化

很多時候我們需要對資料集裡面的資料進行標準化和歸一化處理。例如： X={年齡(年），體重（kg）} 和 X={年齡（年），體重（g）}，雖然邏輯上表達的同樣的含義，但是反應在資料上兩個分量卻是相差特別大。這個時候我們往往希望兩個分量的值不會因為量綱不同而差

python之sklearn常見資料預處理歸一化方式解析

標準歸一化

最大最小歸一化

最大絕對值歸一化

相關推薦