Pandas資料基礎(索引、排序、連線、去重、分箱、異常處理)
阿新 • • 發佈:2019-02-06
使用pandas,首先匯入包:
from pandas import Series, DataFrame
import pandas as pd
- 1
- 2
- 3
- 1
- 2
- 3
一、建立Series,DataFrame
1,建立Series
a,通過列表建立
obj = Series([4, 7, -5, 3])
obj2 = Series([4, 7, -5, 3], index=['d','b','a','c']) #指定索引
- 1
- 2
- 3
- 1
- 2
- 3
b,通過字典建立Series
sdata = {'Ohio':35000, 'Texas':7100, 'Oregon':1600,'Utah' :500}
obj3 = Series(sdata)
- 1
- 2
- 3
- 1
- 2
- 3
c,通過字典 + 索引
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
- 1
- 2
- 1
- 2
指定索引時,跟states索引匹配的那3個值會被找出並放到相應的位置,‘California’對應的sdata值找不到,其結果為NaN。
2,建立DataFrame
a,詞典生成
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada','Nevada' ],
'year':[2000, 2001, 2002, 2011, 2002],
'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame2 = DataFrame(data, columns=['year', 'state', 'pop']) #指定列
frame3 = DataFrame(data, columns=['year', 'state', 'pop'],
index=['one', 'two', 'three', 'four', 'five']) #指定列和索引
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
b,列表生成
>>> errors = [('c',1,'right'), ('b', 2,'wrong')]
>>> df = pd.DataFrame(errors)
>>> df
0 1 2
0 c 1 right
1 b 2 wrong
>>> df = pd.DataFrame(errors, columns=['name', 'count', 'result']) #指定列名
>>> df
name count result
0 c 1 right
1 b 2 wrong
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
c, 巢狀詞典(也就是詞典的詞典)
pop = {'Nevada':{2001:2.4, 2002:2.9},
'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame4 = DataFrame(pop)
Out[138]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
d,Series組合
按行生成DataFrame
In [4]: a = pd.Series([1,2,3])
In [5]: b = pd.Series([2,3,4])
In [6]: c = pd.DataFrame([a,b])
In [7]: c
Out[7]:
0 1 2
0 1 2 3
1 2 3 4
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
按列生成DataFrame
In [8]: c = pd.DataFrame({'a':a,'b':b})
In [9]: c
Out[9]:
a b
0 1 2
1 2 3
2 3 4
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 1
- 2
- 3
- 4
- 5
- 6
- 7
二,選取
對於一組資料DataFrame:
data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio', 'Colorado','Utah','New York'],columns=['one','two','three','four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 1
- 2
- 3
- 4
- 5
- 6
- 7
1,選取列,返回一個Series
>>> data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
- 1
- 2
- 3
- 4
- 5
- 6
- 1
- 2
- 3
- 4
- 5
- 6
2,選取行,返回一個Series
>>> data.ix['Ohio']
one 0
two 1
three 2
four 3
Name: Ohio, dtype: int64
- 1
- 2
- 3
- 4
- 5
- 6
- 1
- 2
- 3
- 4
- 5
- 6
3, 選取行和列, 可以是行名,列名,或列的序號
>>> data.ix['Ohio', ['two','three']]
two 1
three 2
Name: Ohio, dtype: int64
- 1
- 2
- 3
- 4
- 1
- 2
- 3
- 4
>>> data.ix[data.three > 3, :3]
one two three
Colorado 4 5 6
Utah 8 9 10
New York 12 13 14
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
三、遍歷與彙總
1,按行遍歷
for ix, row in df.iterrows():
- 1
- 1
2,按列遍歷
for ix, col in df.iteritems():
- 1
- 1
3,彙總
In[95]: frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
In[99]: frame.sum()
Out[99]:
a 2
b 10
dtype: int64
- 1
- 2
- 3
- 4
- 5
- 6
- 1
- 2
- 3
- 4
- 5
- 6
四、排序
1,對索引排序
對軸索引排序
Series用sort_index()按索引排序,sort()按值排序;
DataFrame用sort_index()和sort()是一樣的。
In[73]: obj = Series(range(4), index=['d','a','b','c'])
In[74]: obj.sort_index()
Out[74]:
a 1
b 2
c 3
d 0
dtype: int64
In[78]: frame = DataFrame(np.arange(8).reshape((2,4)),index=['three', 'one'],columns=['d','a','b','c'])
In[79]: frame
Out[79]:
d a b c
three 0 1 2 3
one 4 5 6 7
In[86]: frame.sort_index()
Out[86]:
d a b c
one 4 5 6 7
three 0 1 2 3
In[87]: frame.sort()
Out[87]:
d a b c
one 4 5 6 7
three 0 1 2 3
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
2,按行排序
In[89]: frame.sort_index(axis=1, ascending=False)
Out[89]:
d c b a
three 0 3 2 1
one 4 7 6 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
3,按列排序(只針對Series)
In[90]: obj.sort()
In[91]: obj
Out[91]:
d 0
a 1
b 2
c 3
dtype: int64
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
4,按值排序
Series:
In[92]: obj = Series([4, 7, -3, 2])
In[94]: obj.order()
Out[94]:
2 -3
3 2
0 4
1 7
dtype: int64
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
DataFrame:
In[95]: frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
In[97]: frame.sort_index(by='b')
Out[97]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
五、刪除
1,刪除指定軸上的項
即刪除 Series 的元素或 DataFrame 的某一行(列)的意思,通過物件的 .drop(labels, axis=0) 方法:
刪除Series的一個元素:
In[11]: ser = Series([4.5,7.2,-5.3,3.6], index=['d','b','a','c'])
In[13]: ser.drop('c')
Out[13]:
d 4.5
b 7.2
a -5.3
dtype: float64
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 1
- 2
- 3
- 4
- 5
- 6
- 7
刪除DataFrame的行或列:
In[17]: df = DataFrame(np.arange(9).reshape(3,3), index=['a','c','d'], columns=['oh','te','ca'])
In[18]: df
Out[18]:
oh te ca
a 0 1 2
c 3 4 5
d 6 7 8
In[19]: df.drop('a')
Out[19]:
oh te ca
c 3 4 5
d 6 7 8
In[20]: df.drop(['oh','te'],axis=1)
Out[20]:
ca
a 2
c 5
d 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
.drop() 返回的是一個新物件,元物件不會被改變。
六、DataFrame連線
1,算術運算(+,-,*,/)
是df中對應位置的元素的算術運算
In[5]: df1 = DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd'))
In[6]: df2 = DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde'))In[9]: df1+df2
Out[9]:
a b c d e
00246NaN19111315NaN218202224NaN3NaNNaNNaNNaNNaN
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
傳入填充值
In[11]: df1.add(df2, fill_value=0)
Out[11]:
a b c d e
0 0 2 4 6 4
1 9 11 13 15 9
2 18 20 22 24 14
3 15 16 17 18 19
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 1
- 2
- 3
- 4
- 5
- 6
- 7
2,pandas.merge
pandas.merge可根據一個或多個鍵將不同DataFrame中的行連線起來。
預設情況下,merge做的是“inner”連線,結果中的鍵是交集,其它方式還有“left”,“right”,“outer”。“outer”外連線求取的是鍵的並集,組合了左連線和右連線。
內連線
In[14]: df1 = DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
In[15]: df2 = DataFrame({'key':['a','b','d'],'data2':range(3)})
In[18]: pd.merge(df1, df2) #或顯式: pd.merge(df1, df2, on='key')
Out[18]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
外連線
In[19]: pd.merge(df1, df2, how='outer')
Out[19]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
6 3 c NaN
7 NaN d 2
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
軸向連線
這種資料合併運算被稱為連線(concatenation)、繫結(binding)或堆疊(stacking)。
對於Series
In[23]: s1 = Series([0, 1], index=['a','b'])
In[24]: s2 = Series([2, 3, 4], index=['c','d','e'])
In[25]: s3 = Series([5, 6], index=['f','g'])
In[26]: pd.concat([s1,s2,s3])
Out[26]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6<