python資料預處理（抽樣、資料集轉換）

阿新 • • 發佈：2019-01-03

博文：python大規模資料處理技巧之二：機器學習中常用操作（http://blog.csdn.net/asdfg4381/article/details/51725424）

1、資料預處理

隨機化操作

機器學習中的常用隨機化操作中可以使用random包做不重複隨機數生成，以此生成的隨機數作為資料集下標去擷取相應資料集。下面這句簡單有效的程式碼可以幫助實現基本所有的隨機化預處理操作。

import random
samp_ids = [i for i in sorted(random.sample(range(nItem), nSample)) ] 
    # nSample為需要取得樣本數

資料隨機抽樣：

    import random

    nItem = len(df)
    nSample = 1000
    samp_ids = [i for i in sorted(random.sample(range(nItem), nSample)) ]
         # nSample為需要取得樣本數
    samp_idList = df.id.isin(samp_ids)
    df_sample = df[samp_idList]

資料集切分為訓練集與測試集：

    import random

    nRatio = 2 

    nTest = int(nSample / nRatio)
    nTrain = nSample - nTest
    samp_ix = [rowId[i] for i in sorted(random.sample(range(nSample), nTest)) ]
        # 隨機產生要擷取的下標
    list_testSamp = df.row_id.isin(samp_ix)   
    list_trainSamp = list_testSamp.apply(lambda x: not x)
        # 獲得擷取列表
    samp_test = df[list_testSamp]
    samp_train = df[list_trainSamp]

隨機化資料集樣本位序：

推薦：一句輸出下標：
sorted(random.sample(range(nSample), nSample))再根據隨機下標的順序去遍歷以此資料集
使用sklearn包的內建操作：其機器學習演算法的train方法都有一個random_state引數用於設定資料隨機初始化的

但如果不用上面的方法實現，可以使用如下的方法思路：為每個樣本設定兩個值a、b：前者為隨機值，後者為下標值
- 以隨機值a作為排序標準對每個樣本的兩個值進行排序（升降序都可以）
- 以排序後的樣本值b去定址原樣本集的樣本，依次按排序的順序執行操作即可

 def randomlizeSample(X_row, y_row):
     nSample, nFeat = np.shape(X_row)
     inx1 = DataFrame(np.random.randn(nSample), columns = ['randVal'])
     inx2 = DataFrame(range(nSample), columns = ['inxVal'])
     inx = pd.concat([inx1, inx2], axis = 1)
     inx = inx.sort_index(by = 'randVal', ascending = False)
     cnt = 0;
     X = np.zeros((nSample, nFeat))
     y = np.zeros((nSample))
         # you should not set X and y to []
     for line in inx['inxVal']:
         X[cnt] = X_row[line]
         y[cnt] = y_row[line]
         cnt += 1
     return X, y

不平衡分類抽取：

保持稀有類樣本數的數量，跟據比例隨機抽取多數類的樣本

相對高效的程式碼：

def sampleBalance(df, lableColumn, th):
    label_counts = df[lableColumn].value_counts()
        # 先獲取當前的標籤統計
    mask = (label_counts[df[lableColumn].values] >= th).values
    df = df.loc[mask]
        # 篩選掉低統計量的標籤樣本
    label_counts = df[lableColumn].value_counts()
        # 再次獲取當前的標籤統計
    labels = label_counts.order(ascending = False).index
    nLabel = len(labels)
    nSampPerLabel = label_counts[labels[-1]]

    balancedSamples = pd.DataFrame()
    for n in range(nLabel):
        df_label =  df[lableColumn == labels[n]]
        nItem = len(df_label)
        df_label.reindex(range(nItem))
            # 重新排序
        samp_index = [i for i in sorted(random.sample(range(nItem),
                                                         nSampPerLabel))]
        samp_list = df_label[lableColumn].isin(samp_index) 
        df_label = df_label[samp_list]
        balancedSamples = pd.concat([balancedSamples, df_label], axis = 1)

    return balancedSamples, [label for lable in labels]

寫過的拙劣程式碼（不推薦的寫法）：沒有運用Python的特性，純粹是用c/c++的思想去編寫程式碼，程式碼執行效率差。

- def sampleBalance(X_row, y_row):
-     rate_np = 1
-         # radio of negative sample and positive sample
-     nSample, nFeat = np.shape(X_row)
-     nSample_pos = np.sum(y_row == 1)
-     nSample_neg = np.sum(y_row == 0)
-     nSample_negOnRate = np.floor(nSample_pos * rate_np)
-     print(nSample, nSample_pos, nSample_neg)

-     X = np.zeros((nSample_pos + nSample_negOnRate, nFeat))
-     y = np.zeros((nSample_pos + nSample_negOnRate))
-     # get pos sample
-     id_pos = 0
-     id_neg = 0
-     X_neg = np.zeros((nSample_neg, nFeat))
-     for i in range(nSample):
-         if y_row[i] == 1:
-             X[id_pos] = X_row[i]
-             y[id_pos] = 1
-             id_pos += 1
-         else:
-             X_neg[id_neg] = X_row[i]
-             id_neg += 1

-     inx1 = DataFrame(np.random.randn(nSample_neg), columns = ['randVal'])
-     inx2 = DataFrame(range(nSample_neg), columns = ['inxVal'])
-     inx = pd.concat([inx1, inx2], axis = 1)
-     inx = inx.sort_index(by = 'randVal', ascending = False)

-     cnt = 0
-     for line in inx['inxVal']:
-         if cnt >= nSample_negOnRate:
-             break
-         X[nSample_pos + cnt] = X_neg[line]
-         y[nSample_pos + cnt] = 0
-         cnt += 1

-     X_rand, y_rand = randomlizeSample(X, y)

-     return X_rand, y_rand

2、資料集轉換

不同包間的格式轉換

內建資料結構、numpy與pandas的資料結構的用處簡述：（資料儲存形式未總結）
內建資料結構，如list、dict、set與tuple是最通用的資料結構，使用方便。
numpy的資料結構與matlab的非常相似，適合用來做矩陣運算等算術計算。也是機器學習包scikit-learn的所支援的資料結構。
dataframe的功能與資料庫有幾分相似，適合做資料的大規模處理與分析。

numpy與list之間的轉換：

list轉換成numpy：

    data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]
    arr = np.array(data)

numpy轉換成list

    ##　使用numpy方法:
    data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]
    arr = np.array(data)
    data = arr.tolist()

    ##　暴力方法：
    data = [[elem for elem in line] for line in arr]

dataframe與numpy之間的轉換:

dataframe轉numpy

    X_train = df.values.astype(int) # df轉化為numpy的ndarray，資料型別為int

numpy轉dataframe

    columns = ['c0', 'c1', 'c2', 'c3', 'c4']
    df = pd.DataFrame(X_train, columns = columns)

dataframe，series與list之間的轉換:

list轉換成dataframe與series

    data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]
    columns = ['c0', 'c1', 'c2', 'c3', 'c4']
    df = pd.DataFrame(data, columns = columns)

    data = [6, 7.5, 8, 0, 1]
    ser = pd.series(data)

dataframe與series轉換成list

    ##　dataframe轉換成list
    df['c0'].values.tolist() # 將某一列轉化成list
    df.values.tolist() # 將整個dataframe轉化成list

    ##　series轉換成list
    ser.values.tolist() # 將series值轉化為list

python資料預處理（抽樣、資料集轉換）

博文：python大規模資料處理技巧之二：機器學習中常用操作（http://blog.csdn.net/asdfg4381/article/details/51725424） 1、資料預處理隨機化操作機器學習中的常用隨機化操作中可以使用random包做不重

Deep Learning 11_深度學習UFLDL教程：資料預處理（斯坦福大學深度學習教程）

資料預處理是深度學習中非常重要的一步！如果說原始資料的獲得，是深度學習中最重要的一步，那麼獲得原始資料之後對它的預處理更是重要的一部分。 1.資料預處理的方法： ①資料歸一化：簡單縮放：對資料的每一個維度的值進行重新調節，使其在 [0,1]或[ − 1,1] 的區間內逐樣本均值消減：在每個

機器學習第4篇：資料預處理（sklearn 插補缺失值）

由於各種原因，現實世界中的許多資料集都包含缺失值，通常把缺失值編碼為空白，NaN或其他佔位符。但是，此類資料集與scikit-learn估計器不相容，這是因為scikit-learn的估計器假定陣列中的所有值都是數字，並且都存在有價值的含義。如果必須使用不完整資料集，那麼處理缺失資料的基本策略是丟棄包含缺失值

python機器學習：：資料預處理（1）【轉】

轉載自：http://2hwp.com/2016/02/03/data-preprocessing/ 常見的資料預處理方法，以下通過sklearn的preprocessing模組來介紹; 1. 標準化（Standardization or Mean Removal and

資料預處理（3） ——資料歸約使用python（sklearn，pandas，numpy）實現

資料預處理的主要任務有：一、資料預處理 1.資料清洗 2.資料整合 3.資料轉換 4.資料歸約 4.資料歸約資料規約技術可以用來得到資料集的規約表示，它小得多，但仍接近於保持原始資料的完整性。也就是說，在規約後的資料集挖掘將更加有效。（1）資料立方體

小白學 Python 資料分析（9）：Pandas （八）資料預處理（2）

人生苦短，我用 Python 前文傳送門：小白學 Python 資料分析（1）：資料分析基礎小白學 Python 資料分析（2）：Pandas （一）概述小白學 Python 資料分析（3）：Pandas （二）資料結構 Series 小白學 Python 資料分析（4）：Pandas （三）資

【ADNI】資料預處理（6）ADNI_slice_dataloader ||| show image

ADNI Series 1、【ADNI】資料預處理（1）SPM，CAT12 2、【ADNI】資料預處理（2）獲取 subject slices 3、【ADNI】資料預處理（3）CNNs 4、【ADNI】資料預處理（4）Get top k slices according to CNN

【ADNI】資料預處理（5）Get top k slices (pMCI_sMCI) according to CNNs

【ADNI】資料預處理（4）Get top k slices according to CNNs

【ADNI】資料預處理（3）CNNs

【ADNI】資料預處理（2）獲取 subject slices

【ADNI】資料預處理（1）SPM，CAT12

程世東老師TensorFlow實戰——個性化推薦，程式碼學習筆記之資料匯入&資料預處理（上）

程式碼來自於知乎:https://zhuanlan.zhihu.com/p/32078473 /程式碼地址https://github.com/chengstone/movie_recommender/blob/master/movie_recommender.ipynb 下一篇有一些資料的

程世東老師TensorFlow實戰——個性化推薦，程式碼學習筆記之資料匯入&資料預處理（下）

這篇主要是進行程式碼中的一些數值視覺化，幫助理解程式碼來自於知乎:https://zhuanlan.zhihu.com/p/32078473 /程式碼地址https://github.com/chengstone/movie_recommender/blob/master/movie_re

資料預處理（方法總結）

資料預處理（方法總結）轉自-https://www.cnblogs.com/sherial/archive/2018/03/07/8522405.html 一、概述在工程實踐中，我們得到的資料會存在有缺失值、重複值等，在使用之前需要進行資料預處理。資料預處理沒有標準的流程，通常針對不

資料預處理（2）資料整合和資料變換資料規約

資料整合資料探勘的過程中往往需要的資料分佈在不同的資料庫，資料整合就是將多個數據源合併存放在一個一致的資料儲存（如資料倉庫）中的過程。實體識別同名異義名字相同但實際代表的含義不同異名同義名字不同但代表的意思相同單位不統一冗餘屬性識別

資料預處理（1）資料清洗

資料預處理的內容主要包括資料清洗，資料整合，資料變換和資料規約。資料清洗資料清洗主要是刪除原始資料集中的無關資料、重複資料，平滑噪聲資料，帥選掉與挖掘主題無關的資料，處理缺失值、異常值等。缺失值處理缺失值處理的方法可分為三類：刪除記錄、資料插補和不處理。常用的

機器學習資料預處理（sklearn庫系列函式）

【1】 sklearn.preprocessing.PolynomialFeatures PolynomialFeatures有三個引數 degree：控制多項式的度 interaction_

基於深度學習的CT影象肺結節自動檢測技術一——資料預處理（歸一化，資料增強，資料標記）

開發環境 Anaconda:jupyter notebook /pycharm pip install SimpleItk # 讀取CT醫學影象 pip install tqdm # 可擴充套件的Python進度條，封裝

【ADNI】資料預處理（1）SPM，CAT12；資料集

ADNI Series 1、【ADNI】資料預處理（1）SPM，CAT12 2、【ADNI】資料預處理（2）獲取 subject slices 3、【ADNI】資料預處理（3）CNNs 4、【ADNI】資料預處理（4）Get top k slices accordin

python資料預處理（抽樣、資料集轉換）

1、 資料預處理

隨機化操作

資料隨機抽樣：

資料集切分為訓練集與測試集：

隨機化資料集樣本位序：

不平衡分類抽取：

2、資料集轉換

不同包間的格式轉換

numpy與list之間的轉換：

dataframe與numpy之間的轉換:

dataframe，series與list之間的轉換:

相關推薦

1、資料預處理