python資料預處理(抽樣、資料集轉換)
阿新 • • 發佈:2019-01-03
博文:python大規模資料處理技巧之二:機器學習中常用操作(http://blog.csdn.net/asdfg4381/article/details/51725424)
1、 資料預處理
隨機化操作
機器學習中的常用隨機化操作中可以使用random包做不重複隨機數生成,以此生成的隨機數作為資料集下標去擷取相應資料集。下面這句簡單有效的程式碼可以幫助實現基本所有的隨機化預處理操作。
import random
samp_ids = [i for i in sorted(random.sample(range(nItem), nSample)) ]
# nSample為需要取得樣本數
- 1
- 2
- 3
- 1
- 2
- 3
資料隨機抽樣:
import random
nItem = len(df)
nSample = 1000
samp_ids = [i for i in sorted(random.sample(range(nItem), nSample)) ]
# nSample為需要取得樣本數
samp_idList = df.id.isin(samp_ids)
df_sample = df[samp_idList]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
資料集切分為訓練集與測試集:
import random
nRatio = 2
nTest = int(nSample / nRatio)
nTrain = nSample - nTest
samp_ix = [rowId[i] for i in sorted(random.sample(range(nSample), nTest)) ]
# 隨機產生要擷取的下標
list_testSamp = df.row_id.isin(samp_ix)
list_trainSamp = list_testSamp.apply(lambda x: not x)
# 獲得擷取列表
samp_test = df[list_testSamp]
samp_train = df[list_trainSamp]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
隨機化資料集樣本位序:
- 推薦:一句輸出下標:
sorted(random.sample(range(nSample), nSample))
再根據隨機下標的順序去遍歷以此資料集- 使用sklearn包的內建操作:其機器學習演算法的train方法都有一個random_state引數用於設定資料隨機初始化的
- 但如果不用上面的方法實現,可以使用如下的方法思路: 為每個樣本設定兩個值a、b:前者為隨機值,後者為下標值
- 以隨機值a作為排序標準對每個樣本的兩個值進行排序(升降序都可以)
- 以排序後的樣本值b去定址原樣本集的樣本,依次按排序的順序執行操作即可
def randomlizeSample(X_row, y_row):
nSample, nFeat = np.shape(X_row)
inx1 = DataFrame(np.random.randn(nSample), columns = ['randVal'])
inx2 = DataFrame(range(nSample), columns = ['inxVal'])
inx = pd.concat([inx1, inx2], axis = 1)
inx = inx.sort_index(by = 'randVal', ascending = False)
cnt = 0;
X = np.zeros((nSample, nFeat))
y = np.zeros((nSample))
# you should not set X and y to []
for line in inx['inxVal']:
X[cnt] = X_row[line]
y[cnt] = y_row[line]
cnt += 1
return X, y
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
不平衡分類抽取:
- 保持稀有類樣本數的數量,跟據比例隨機抽取多數類的樣本
相對高效的程式碼:
def sampleBalance(df, lableColumn, th):
label_counts = df[lableColumn].value_counts()
# 先獲取當前的標籤統計
mask = (label_counts[df[lableColumn].values] >= th).values
df = df.loc[mask]
# 篩選掉低統計量的標籤樣本
label_counts = df[lableColumn].value_counts()
# 再次獲取當前的標籤統計
labels = label_counts.order(ascending = False).index
nLabel = len(labels)
nSampPerLabel = label_counts[labels[-1]]
balancedSamples = pd.DataFrame()
for n in range(nLabel):
df_label = df[lableColumn == labels[n]]
nItem = len(df_label)
df_label.reindex(range(nItem))
# 重新排序
samp_index = [i for i in sorted(random.sample(range(nItem),
nSampPerLabel))]
samp_list = df_label[lableColumn].isin(samp_index)
df_label = df_label[samp_list]
balancedSamples = pd.concat([balancedSamples, df_label], axis = 1)
return balancedSamples, [label for lable in labels]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
寫過的拙劣程式碼(不推薦的寫法):沒有運用Python的特性,純粹是用c/c++的思想去編寫程式碼,程式碼執行效率差。
- def sampleBalance(X_row, y_row):
- rate_np = 1
- # radio of negative sample and positive sample
- nSample, nFeat = np.shape(X_row)
- nSample_pos = np.sum(y_row == 1)
- nSample_neg = np.sum(y_row == 0)
- nSample_negOnRate = np.floor(nSample_pos * rate_np)
- print(nSample, nSample_pos, nSample_neg)
- X = np.zeros((nSample_pos + nSample_negOnRate, nFeat))
- y = np.zeros((nSample_pos + nSample_negOnRate))
- # get pos sample
- id_pos = 0
- id_neg = 0
- X_neg = np.zeros((nSample_neg, nFeat))
- for i in range(nSample):
- if y_row[i] == 1:
- X[id_pos] = X_row[i]
- y[id_pos] = 1
- id_pos += 1
- else:
- X_neg[id_neg] = X_row[i]
- id_neg += 1
- inx1 = DataFrame(np.random.randn(nSample_neg), columns = ['randVal'])
- inx2 = DataFrame(range(nSample_neg), columns = ['inxVal'])
- inx = pd.concat([inx1, inx2], axis = 1)
- inx = inx.sort_index(by = 'randVal', ascending = False)
- cnt = 0
- for line in inx['inxVal']:
- if cnt >= nSample_negOnRate:
- break
- X[nSample_pos + cnt] = X_neg[line]
- y[nSample_pos + cnt] = 0
- cnt += 1
- X_rand, y_rand = randomlizeSample(X, y)
- return X_rand, y_rand
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
2、資料集轉換
不同包間的格式轉換
- 內建資料結構、numpy與pandas的資料結構的用處簡述:(資料儲存形式未總結)
- 內建資料結構,如list、dict、set與tuple是最通用的資料結構,使用方便。
- numpy的資料結構與matlab的非常相似,適合用來做矩陣運算等算術計算。也是機器學習包scikit-learn的所支援的資料結構。
- dataframe的功能與資料庫有幾分相似,適合做資料的大規模處理與分析。
numpy與list之間的轉換:
- list轉換成numpy:
data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]
arr = np.array(data)
- 1
- 2
- 1
- 2
- numpy轉換成list
## 使用numpy方法:
data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]
arr = np.array(data)
data = arr.tolist()
- 1
- 2
- 3
- 4
- 1
- 2
- 3
- 4
## 暴力方法:
data = [[elem for elem in line] for line in arr]
- 1
- 2
- 1
- 2
dataframe與numpy之間的轉換:
- dataframe轉numpy
X_train = df.values.astype(int) # df轉化為numpy的ndarray,資料型別為int
- 1
- 1
- numpy轉dataframe
columns = ['c0', 'c1', 'c2', 'c3', 'c4']
df = pd.DataFrame(X_train, columns = columns)
- 1
- 2
- 1
- 2
dataframe,series與list之間的轉換:
- list轉換成dataframe與series
data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]
columns = ['c0', 'c1', 'c2', 'c3', 'c4']
df = pd.DataFrame(data, columns = columns)
- 1
- 2
- 3
- 1
- 2
- 3
data = [6, 7.5, 8, 0, 1]
ser = pd.series(data)
- 1
- 2
- 1
- 2
- dataframe與series轉換成list
## dataframe轉換成list
df['c0'].values.tolist() # 將某一列轉化成list
df.values.tolist() # 將整個dataframe轉化成list
- 1
- 2
- 3
- 1
- 2
- 3
## series轉換成list
ser.values.tolist() # 將series值轉化為list