利用python進行資料分析之——資料規整化2(ETL)
阿新 • • 發佈:2019-02-16
待我學有所成,結髮與蕊可好。@夏瑾墨 by Jooey
3.資料的軸向連線
Nunpy 有一個用於合併串聯原始Numpy陣列的concatenation函式
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
arr=np.arange(12).reshape((3,4))
print (arr)
print (np.concatenate([arr,arr],axis=1))
輸出結果:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[ 0 1 2 3 0 1 2 3]
[ 4 5 6 7 4 5 6 7]
[ 8 9 10 11 8 9 10 11]]
假設有三個沒有重疊索引的Series
s1=Series([0,1],index=['a','b'])
s2=Series([2,3,4],index=['c','d','e'])
s3=Series([5,6],index=['f','g'])
print (pd.concat([s1,s2,s3]))
輸出結果:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
預設情況下,concat是在axis=0上工作的,最終產生一個新的Series。如果傳入axis=1,則結果就會變成一個DataFrame(axis=1是列)
print (pd.concat([s1,s2,s3],axis=1))
輸出結果:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
這種情況下,另外一條軸上沒有重疊,從索引的有序並集(外連線)上就可以看出來。傳入join=‘inner’即可得到它們的交集
s4=pd.concat([s1*5,s3])
print (pd.concat([s1,s4],axis=1))
print (pd.concat([s1,s4],axis=1,join='inner'))
輸出結果:
0 1
a 0.0 0
b 1.0 5
f NaN 5
g NaN 6
0 1
a 0 0
b 1 5
你可以通過join_axes指定要在其它軸上使用的索引
print (pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']]))
輸出結果:
0 1
a 0.0 0.0
c NaN NaN
b 1.0 5.0
e NaN NaN
Nan := Not A Number
有個問題,參與連線的片段在結果中區分不開。假設你想在連線軸上建立一個層次化索引。使用keys引數即可達到這個目的
result=pd.concat([s1,s2,s3],keys=['one','two','three'])
print (result)
print (result.unstack())
輸出結果:
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
如果沿著axis=1對Series進行合併,則keys就會成為DataFrame的列頭
print (pd.concat([s1,s2,s3],axis=1,keys=['one','two','three']))
輸出結果:
one two three
a 0 NaN NaN
b 1 NaN NaN
c NaN 2 NaN
d NaN 3 NaN
e NaN 4 NaN
f NaN NaN 5
g NaN NaN 6
同樣的邏輯對DataFrame物件也是一樣
df5=DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
df6=DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
print (pd.concat([df5,df6],axis=1,keys=['level1','level2']))
輸出結果:
level1 level2
one two three four
a 0 1 5 6
b 2 3 NaN NaN
c 4 5 7 8
如果傳入的不是列表而是一個字典,則字典的鍵就會被當做keys選項的值
print (pd.concat({'level1':df5,'level2':df6},axis=1))
輸出結果:
level1 level2
one two three four
a 0 1 5 6
b 2 3 NaN NaN
c 4 5 7 8
此外還有兩個用於管理層次化索引建立方式的引數,見下表
print (pd.concat([df5,df6],axis=1,keys=['level1','level2'],names=['upper','lower']))
輸出結果:
upper level1 level2
lower one two three four
a 0 1 5 6
b 2 3 NaN NaN
c 4 5 7 8
python3裡面寫函式的相關引數只需依次逗號分隔即可。
最後一個需要考慮的問題就是,跟當前分析工作無關的DataFrame行索引。傳入ignore_index=True即可
df7=DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
df8=DataFrame(np.random.randn(2,3),columns=['b','d','a'])
print (df7)
print (df8)
print (pd.concat([df7,df8],ignore_index=True))
輸出結果:
a b c d
0 -0.844224 0.593684 0.144469 0.729945
1 0.484216 -0.736679 -2.385474 0.004167
2 -0.007380 -0.129935 -0.014069 0.907947
b d a
0 -1.377938 -0.616348 0.936278
1 0.400851 2.066192 0.127229
a b c d
0 -0.844224 0.593684 0.144469 0.729945
1 0.484216 -0.736679 -2.385474 0.004167
2 -0.007380 -0.129935 -0.014069 0.907947
3 0.936278 -1.377938 NaN -0.616348
4 0.127229 0.400851 NaN 2.066192
待我學有所成,結髮與蕊可好。@夏瑾墨 by Jooey