1. 程式人生 > >利用python進行資料分析之——資料規整化2(ETL)

利用python進行資料分析之——資料規整化2(ETL)

待我學有所成,結髮與蕊可好。@夏瑾墨 by Jooey

3.資料的軸向連線
Nunpy 有一個用於合併串聯原始Numpy陣列的concatenation函式

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

arr=np.arange(12).reshape((3,4))
print (arr)
print (np.concatenate([arr,arr],axis=1))

輸出結果:

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0 1 2 3 0 1 2 3] [ 4 5 6 7 4 5 6 7] [ 8 9 10 11 8 9 10 11]]

假設有三個沒有重疊索引的Series

s1=Series([0,1],index=['a','b'])
s2=Series([2,3,4],index=['c','d','e'])
s3=Series([5,6],index=['f','g'])
print (pd.concat([s1,s2,s3]))

輸出結果:

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

預設情況下,concat是在axis=0上工作的,最終產生一個新的Series。如果傳入axis=1,則結果就會變成一個DataFrame(axis=1是列)

print (pd.concat([s1,s2,s3],axis=1))

輸出結果:

     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

這種情況下,另外一條軸上沒有重疊,從索引的有序並集(外連線)上就可以看出來。傳入join=‘inner’即可得到它們的交集

s4=pd.concat([s1*5,s3])
print (pd.concat([s1,s4],axis=1))
print (pd.concat([s1,s4],axis=1,join='inner'))

輸出結果:

     0  1
a  0.0  0
b  1.0  5
f  NaN  5
g  NaN  6
   0  1
a  0  0
b  1  5

你可以通過join_axes指定要在其它軸上使用的索引

print (pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']]))

輸出結果:

     0    1
a  0.0  0.0
c  NaN  NaN
b  1.0  5.0
e  NaN  NaN

Nan := Not A Number
有個問題,參與連線的片段在結果中區分不開。假設你想在連線軸上建立一個層次化索引。使用keys引數即可達到這個目的

result=pd.concat([s1,s2,s3],keys=['one','two','three'])
print (result)
print (result.unstack())

輸出結果:

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
         a    b    c    d    e    f    g
one    0.0  1.0  NaN  NaN  NaN  NaN  NaN
two    NaN  NaN  2.0  3.0  4.0  NaN  NaN
three  NaN  NaN  NaN  NaN  NaN  5.0  6.0

如果沿著axis=1對Series進行合併,則keys就會成為DataFrame的列頭


print (pd.concat([s1,s2,s3],axis=1,keys=['one','two','three']))

輸出結果:

   one  two  three
a    0  NaN    NaN
b    1  NaN    NaN
c  NaN    2    NaN
d  NaN    3    NaN
e  NaN    4    NaN
f  NaN  NaN      5
g  NaN  NaN      6

同樣的邏輯對DataFrame物件也是一樣

df5=DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
df6=DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
print (pd.concat([df5,df6],axis=1,keys=['level1','level2']))

輸出結果:

  level1     level2     
     one two  three four
a      0   1      5    6
b      2   3    NaN  NaN
c      4   5      7    8

如果傳入的不是列表而是一個字典,則字典的鍵就會被當做keys選項的值

print (pd.concat({'level1':df5,'level2':df6},axis=1))

輸出結果:

  level1     level2     
     one two  three four
a      0   1      5    6
b      2   3    NaN  NaN
c      4   5      7    8

此外還有兩個用於管理層次化索引建立方式的引數,見下表

print (pd.concat([df5,df6],axis=1,keys=['level1','level2'],names=['upper','lower']))

輸出結果:

upper level1     level2     
lower    one two  three four
a          0   1      5    6
b          2   3    NaN  NaN
c          4   5      7    8

python3裡面寫函式的相關引數只需依次逗號分隔即可。
這裡寫圖片描述
最後一個需要考慮的問題就是,跟當前分析工作無關的DataFrame行索引。傳入ignore_index=True即可

df7=DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
df8=DataFrame(np.random.randn(2,3),columns=['b','d','a'])
print (df7)
print (df8)
print (pd.concat([df7,df8],ignore_index=True))

輸出結果:

       a         b         c         d
0 -0.844224  0.593684  0.144469  0.729945
1  0.484216 -0.736679 -2.385474  0.004167
2 -0.007380 -0.129935 -0.014069  0.907947
          b         d         a
0 -1.377938 -0.616348  0.936278
1  0.400851  2.066192  0.127229
          a         b         c         d
0 -0.844224  0.593684  0.144469  0.729945
1  0.484216 -0.736679 -2.385474  0.004167
2 -0.007380 -0.129935 -0.014069  0.907947
3  0.936278 -1.377938       NaN -0.616348
4  0.127229  0.400851       NaN  2.066192

待我學有所成,結髮與蕊可好。@夏瑾墨 by Jooey