資料集合與分組運算 《利用python進行資料分析》筆記,第9章
阿新 • • 發佈:2019-01-17
pandas的groupby功能,可以計算 分組統計和生成透視表,可對資料集進行靈活的切片、切塊、摘要等操作
GroupBy技術
“split-apply-comebine”(拆分-應用-合併)
import numpy as np
from pandas import DataFrame,Series
df=DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2' :np.random.randn(5)})
df
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
0 | 1.160760 | 0.360555 | a | one |
1 | -0.992606 | -0.120562 | a | two |
2 | -0.616727 | 0.856179 | b | one |
3 | -1.921879 | -0.690846 | b | two |
4 | -0.458540 | -0.093610 | a | one |
grouped=df['data1'].groupby(df['key1'])
grouped
grouped.mean()
key1
a -0.096796
b -1.269303
Name: data1, dtype: float64
#如果一次傳入多個數組,會得到不同=結果
means=df['data1'].groupby([df['key1'],df['key2']]).mean()
means#得到的資料具有層次化索引
key1 key2
a one 0.351110
two -0.992606
b one -0.616727
two -1.921879
Name: data1, dtype: float64
means.unstack()#將層次化索引展開
key2 | one | two |
---|---|---|
key1 | ||
a | 0.351110 | -0.992606 |
b | -0.616727 | -1.921879 |
#分組鍵可以是任何長度適合的陣列
states=np.array(['Ohio','Colifornia','Colifornia','Ohio','Ohio'])
years=np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
Colifornia 2005 -0.992606
2006 -0.616727
Ohio 2005 -0.380560
2006 -0.458540
Name: data1, dtype: float64
#此外,你還可以將列名(字串、數字或其他python物件)當做分組鍵
df.groupby(['key1']).mean()
data1 | data2 | |
---|---|---|
key1 | ||
a | -0.096796 | 0.048794 |
b | -1.269303 | 0.082666 |
df.groupby(['key1','key2']).mean()
data1 | data2 | ||
---|---|---|---|
key1 | key2 | ||
a | one | 0.351110 | 0.133473 |
two | -0.992606 | -0.120562 | |
b | one | -0.616727 | 0.856179 |
two | -1.921879 | -0.690846 |
#因為df['key2]不是數值資料,所有被從結果中排除了,預設情況下,所以後數值列都會被聚合
#Groupby的size方法,返回一個含有分組大小的Series
df.groupby(['key1','key2']).size()
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
對分組進行迭代
for name,group in df.groupby('key1'):
print(name)
print(group)
a
data1 data2 key1 key2
0 1.160760 0.360555 a one
1 -0.992606 -0.120562 a two
4 -0.458540 -0.093610 a one
b
data1 data2 key1 key2
2 -0.616727 0.856179 b one
3 -1.921879 -0.690846 b two
#對於多重鍵的情況,元組的第一個元素將會是由鍵值組成的元組
for (k1,k2),group in df.groupby(['key1','key2']):
print(k1,k2)
print(group)
a one
data1 data2 key1 key2
0 1.16076 0.360555 a one
4 -0.45854 -0.093610 a one
a two
data1 data2 key1 key2
1 -0.992606 -0.120562 a two
b one
data1 data2 key1 key2
2 -0.616727 0.856179 b one
b two
data1 data2 key1 key2
3 -1.921879 -0.690846 b two
#可以對分組的片段做任何操作,常用的操作時將這些資料片段做成一個字典
pieces=dict(list(df.groupby('key1')))
pieces['b']
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
2 | -0.616727 | 0.856179 | b | one |
3 | -1.921879 | -0.690846 | b | two |
#groupby預設是在axis=0上進行分組的,同樣也可以在其他軸上進行分組
#我們可以根據dtype對列進行分組
df.dtypes
data1 float64
data2 float64
key1 object
key2 object
dtype: object
grouped=df.groupby(df.dtypes,axis=1)
dict(list(grouped))
{dtype(‘float64’): data1 data2
0 1.160760 0.360555
1 -0.992606 -0.120562
2 -0.616727 0.856179
3 -1.921879 -0.690846
4 -0.458540 -0.093610, dtype(‘O’): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
選取一個或一組列
#對於由DataFrame產生的GroupBy物件,如果用一個(單個字元)或一組(字串陣列)列名對其
#進行索引,就能實現選取部分列進行聚合的目的。
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
#以上兩行是以下程式碼的語法糖
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
df.groupby(['key1','key2'])[['data2']].mean()
data2 | ||
---|---|---|
key1 | key2 | |
a | one | 0.133473 |
two | -0.120562 | |
b | one | 0.856179 |
two | -0.690846 |
s_grouped=df.groupby(['key1','key2'])['data2']
s_grouped
s_grouped.mean()
key1 key2
a one 0.133473
two -0.120562
b one 0.856179
two -0.690846
Name: data2, dtype: float64
通過字典或Series進行分組
除陣列外,分組資訊還可以其他形式存著
people=DataFrame(np.random.randn(5,5),
columns=['a','b','c','d','e'],
index=['Joe','Steve','Wes','Jim','Travis'])
people.ix[2:3,['b','c']]=np.nan#新增幾個NA值
people
a | b | c | d | e | |
---|---|---|---|---|---|
Joe | 0.246182 | 0.556642 | 0.530663 | 0.072457 | 0.769930 |
Steve | -0.735543 | -0.046147 | 0.092191 | 0.659066 | 0.563112 |
Wes | -0.671631 | NaN | NaN | 0.351555 | 0.320022 |
Jim | 0.730654 | -0.554864 | -0.013574 | -0.238270 | -1.276084 |
Travis | -0.246124 | 0.494404 | 0.782177 | -1.856125 | 0.838289 |
#假設已知列的分組關係,希望根據分組計算列的總計
mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
#將上面字典傳給groupby即可
by_column=people.groupby(mapping,axis=1)
by_column.sum()
blue | red | |
---|---|---|
Joe | 0.603120 | 1.572753 |
Steve | 0.751258 | -0.218579 |
Wes | 0.351555 | -0.351610 |
Jim | -0.251844 | -1.100294 |
Travis | -1.073948 | 1.086570 |
#Series也有同樣的功能,它可以被看做一個固定大小的對映。對於上面的例子,如果
#用Series作為分組鍵,則pandas會檢查Series以確保其索引跟分組軸是對齊的
map_series=Series(mapping)
map_series
a red
b red
c blue
d blue
e red
f orange
dtype: object
people.groupby(map_series,axis=1).count()
blue | red | |
---|---|---|
Joe | 2 | 3 |
Steve | 2 | 3 |
Wes | 1 | 2 |
Jim | 2 | 3 |
Travis | 2 | 3 |
通過函式進行分組
#任何被當做分組鍵的函式都會在各個索引值上被呼叫一次,其返回值就會被用作分組名稱
#以上節的例子說明,其索引值是人的名字,我們希望根據人名的長度進行分組
people.groupby(len).sum()
a | b | c | d | e | |
---|---|---|---|---|---|
3 | 0.305204 | 0.001778 | 0.517089 | 0.185742 | -0.186132 |
5 | -0.735543 | -0.046147 | 0.092191 | 0.659066 | 0.563112 |
6 | -0.246124 | 0.494404 | 0.782177 | -1.856125 | 0.838289 |
#將函式跟跟陣列、列表、字典、Series混合使用也不是問題,
# 因為任何東西最終都會被轉換為陣列
key_list=['one','one','one','two','two']
people.groupby([len,key_list]).min()
a | b | c | d | e | ||
---|---|---|---|---|---|---|
3 | one | -0.671631 | 0.556642 | 0.530663 | 0.072457 | 0.320022 |
two | 0.730654 | -0.554864 | -0.013574 | -0.238270 | -1.276084 | |
5 | one | -0.735543 | -0.046147 | 0.092191 | 0.659066 | 0.563112 |
6 | two | -0.246124 | 0.494404 | 0.782177 | -1.856125 | 0.838289 |
根據索引級別分組
#層次化索引資料集最方便的地方在於它能夠根據索引級別進行聚合。要實現該目的,通過level關鍵字
#傳入級別編號或名稱即可
import pandas as pd
columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
[1,3,5,1,3]],names=['city','tenor'])
hier_df=DataFrame(np.random.randn(4,5),columns=columns)
hier_df
city | US | JP | |||
---|---|---|---|---|---|
tenor | 1 | 3 | 5 | 1 | 3 |
0 | -0.729876 | -0.490356 | 1.200420 | -1.594183 | -0.571277 |
1 | -1.336457 | -2.033271 | -0.356616 | 0.915616 | -0.234895 |
2 | -0.065620 | -0.102485 | 0.605027 | -0.518972 | 1.190415 |
3 | 0.985298 | 0.923531 | 1.784194 | 1.815795 | -1.261107 |
hier_df.groupby(level='city',axis=1).count()
city | JP | US |
---|---|---|
0 | 2 | 3 |
1 | 2 | 3 |
2 | 2 | 3 |
3 | 2 | 3 |
資料聚合
#聚合,這裡指任何能從陣列產生標量值的資料轉換過程。比如mean.count.min及sum等。
#也可以自己定義聚合函式
df
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
0 | 1.160760 | 0.360555 | a | one |
1 | -0.992606 | -0.120562 | a | two |
2 | -0.616727 | 0.856179 | b | one |
3 | -1.921879 | -0.690846 | b | two |
4 | -0.458540 | -0.093610 | a | one |
grouped=df.groupby('key1')
grouped['data1'].quantile(0.9)#這裡的quantile是一個Series方法
key1
a 0.836900
b -0.747242
Name: data1, dtype: float64
#使用自己的聚合函式,只需要將其傳入aggregate或agg方法即可
def peak_to_peak(arr):
return arr.max()-arr.min()
grouped.agg(peak_to_peak)
data1 | data2 | |
---|---|---|
key1 | ||
a | 2.153366 | 0.481117 |
b | 1.305152 | 1.547025 |
#注意,有些方法(如describe)也是可以用在這裡的,即使嚴格來講,它們並非集合運算
grouped.describe()
data1 | data2 | ||
---|---|---|---|
key1 | |||
a | count | 3.000000 | 3.000000 |
mean | -0.096796 | 0.048794 | |
std | 1.121334 | 0.270329 | |
min | -0.992606 | -0.120562 | |
25% | -0.725573 | -0.107086 | |
50% | -0.458540 | -0.093610 | |
75% | 0.351110 | 0.133473 | |
max | 1.160760 | 0.360555 | |
b | count | 2.000000 | 2.000000 |
mean | -1.269303 | 0.082666 | |
std | 0.922882 | 1.093912 | |
min | -1.921879 | -0.690846 | |
25% | -1.595591 | -0.304090 | |
50% | -1.269303 | 0.082666 | |
75% | -0.943015 | 0.469422 | |
max | -0.616727 | 0.856179 |
#一般自定義的聚合函式要比經過優化的Groupby的方法慢得多(count,sum,mean,median,std,var
# (無偏,分母為n-1),min,max,prod(非NA值的積),first、last(第一個和最後一個非NA值) )
#下面為了演示一些更高階的集合功能,將使用一個有關餐館小費的資料集
tips=pd.read_csv('ch08/tips.csv')
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
#新增‘小費佔總額百分比’的了
tips['tip_pct']=tips['tip']/tips['total_bill']
tips[:6]
total_bill | tip | sex | smoker | day | time | size | tip_pct | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | 0.186240 |
面向列的多函式應用
#根據sex和smoker對tips進行分組
grouped=tips.groupby(['sex','smoker'])
#可以將函式名以字串的形式傳入agg函式
grouped_pct=grouped['tip_pct']
grouped_pct.agg('mean')
sex smoker
Female No 0.156921
Yes 0.182150
Male No 0.160669
Yes 0.152771
Name: tip_pct, dtype: float64
#如果傳入一組寒山寺或函式名,得到的DataFrame的列就會以相應的函式命名
grouped_pct.agg(['mean','std',peak_to_peak])
mean | std | peak_to_peak | ||
---|---|---|---|---|
sex | smoker | |||
Female | No | 0.156921 | 0.036421 | 0.195876 |
Yes | 0.182150 | 0.071595 | 0.360233 | |
Male | No | 0.160669 | 0.041849 | 0.220186 |
Yes | 0.152771 | 0.090588 | 0.674707 |
#如果傳入的是一個由(name,function)元組組成的列表,則各元組的第一個元素就會被用作
#DataFrame的列名(可以將這種二元元組列表看做一個有序對映)
grouped_pct.agg([('foo','mean'),('bar',np.std)])
foo | bar | ||
---|---|---|---|
sex | smoker | ||
Female | No | 0.156921 | 0.036421 |
Yes | 0.182150 | 0.071595 | |
Male | No | 0.160669 | 0.041849 |
Yes | 0.152771 | 0.090588 |
#對於DataFrame,還可以定義一組應用於全部列的函式,或不同的列應用不同的函式。
#假設我們想對tip_pct和total_bill列計算三個統計資訊
functions=['count','mean','max']
result=grouped['tip_pct','total_bill'].agg(functions)
result
tip_pct | total_bill | ||||||
---|---|---|---|---|---|---|---|
count | mean | max | count | mean | max | ||
sex | smoker | ||||||
Female | No | 54 | 0.156921 | 0.252672 | 54 | 18.105185 | 35.83 |
Yes | 33 | 0.182150 | 0.416667 | 33 | 17.977879 | 44.30 | |
Male | No | 97 | 0.160669 | 0.291990 | 97 | 19.791237 | 48.33 |
Yes | 60 | 0.152771 | 0.710345 | 60 | 22.284500 | 50.81 |
#結果DataFrame擁有層次化的列,這相當於分別對各列進行聚合,然後用concat將結果組裝
#到一起(列名用作keys引數)
result['tip_pct']
count | mean | max | ||
---|---|---|---|---|
sex | smoker | |||
Female | No | 54 | 0.156921 | 0.252672 |
Yes | 33 | 0.182150 | 0.416667 | |
Male | No | 97 | 0.160669 | 0.291990 |
Yes | 60 | 0.152771 | 0.710345 |
#跟前面一樣,這裡也可以傳入自定義名稱的元組列表
ftuples=[('Durchschnitt','mean'),('Abweichung',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)
tip_pct | total_bill | ||||
---|---|---|---|---|---|
Durchschnitt | Abweichung | Durchschnitt | Abweichung | ||
sex | smoker | ||||
Female | No | 0.156921 | 0.001327 | 18.105185 | 53.092422 |
Yes | 0.182150 | 0.005126 | 17.977879 | 84.451517 | |
Male | No | 0.160669 | 0.001751 | 19.791237 | 76.152961 |
Yes | 0.152771 | 0.008206 | 22.284500 | 98.244673 |
#現在假設需求是對不同的列應用不同的函式。具體的辦法是向agg傳入一個從列名對映到函式的字典
grouped.agg({'tip':np.max,'size':'sum'})
size | tip | ||
---|---|---|---|
sex | smoker | ||
Female | No | 140 | 5.2 |
Yes | 74 | 6.5 | |
Male | No | 263 | 9.0 |
Yes | 150 | 10.0 |
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
size | tip_pct | |||||
---|---|---|---|---|---|---|
sum | min | max | mean | std | ||
sex | smoker | |||||
Female | No | 140 | 0.056797 | 0.252672 | 0.156921 | 0.036421 |
Yes | 74 | 0.056433 | 0.416667 | 0.182150 | 0.071595 | |
Male | No | 263 | 0.071804 | 0.291990 | 0.160669 | 0.041849 |
Yes | 150 | 0.035638 | 0.710345 | 0.152771 | 0.090588 |
以”無索引”的形式返回聚合資料
#可以向groupby傳入as_index=False禁用聚合資料分組鍵組成的分層索引
tips.groupby(['sex','smoker'],as_index=False).mean()
sex | smoker | total_bill | tip | size | tip_pct | |
---|---|---|---|---|---|---|
0 | Female | No | 18.105185 | 2.773519 | 2.592593 | 0.156921 |
1 | Female | Yes | 17.977879 | 2.931515 | 2.242424 | 0.182150 |
2 | Male | No | 19.791237 | 3.113402 | 2.711340 | 0.160669 |
3 | Male | Yes | 22.284500 | 3.051167 | 2.500000 | 0.152771 |
#當然,對結果呼叫reset_index也能得到這種形式的結果
分組級運算和轉換
聚合只不過是分組運算的其中一種而已。它能夠接受將一維陣列簡化為標量值的函式。本節介紹transform和apply方法,他們能執行更多其他的分組運算
#任務 ,給DataFrame新增一個用於存放各索引分組平均值的列,一個辦法是先聚合再合併
df
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
0 | 1.160760 | 0.360555 | a | one |
1 | -0.992606 | -0.120562 | a | two |
2 | -0.616727 | 0.856179 | b | one |
3 | -1.921879 | -0.690846 | b | two |
4 | -0.458540 | -0.093610 | a | one |
k1_means=df.groupby('key1').mean().add_prefix('mean_')
k1_means
mean_data1 | mean_data2 | |
---|---|---|
key1 | ||
a | -0.096796 | 0.048794 |
b | -1.269303 | 0.082666 |
pd.merge(df,k1_means,left_on='key1',right_index=True)
data1 | data2 | key1 | key2 | mean_data1 | mean_data2 | |
---|---|---|---|---|---|---|
0 | 1.160760 | 0.360555 | a | one | -0.096796 | 0.048794 |
1 | -0.992606 | -0.120562 | a | two | -0.096796 | 0.048794 |
4 | -0.458540 | -0.093610 | a | one | -0.096796 | 0.048794 |
2 | -0.616727 | 0.856179 | b | one | -1.269303 | 0.082666 |
3 | -1.921879 | -0.690846 | b | two | -1.269303 | 0.082666 |
#這樣雖然達到了目的,但是不靈活,該過程可是看做利用np.mean函式對兩個資料列
# 進行轉換。以People DataFrame為例
key=['one','two','one','two','one']
people.groupby(key).mean()
a | b | c | d | e | |
---|---|---|---|---|---|
one | -0.223858 | 0.525523 | 0.656420 | -0.477371 | 0.642747 |
two | -0.002445 | -0.300505 | 0.039309 | 0.210398 | -0.356486 |
people.groupby(key).transform(np.mean)
a | b | c | d | e | |
---|---|---|---|---|---|
Joe | -0.223858 | 0.525523 | 0.656420 | -0.477371 | 0.642747 |
Steve | -0.002445 | -0.300505 | 0.039309 | 0.210398 | -0.356486 |
Wes | -0.223858 | 0.525523 | 0.656420 | -0.477371 | 0.642747 |
Jim | -0.002445 | -0.300505 | 0.039309 | 0.210398 | -0.356486 |
Travis | -0.223858 | 0.525523 | 0.656420 | -0.477371 | 0.642747 |
#transform會將一個函式應用到各個分組,然後將結果放置到合適的位置上,如果分組
#產生的是一個標量值,則該值會被廣播出去。
#比如,任務是從各組中減去平均值。先建立一個距平化函式(demeaning function),然後
# 將其傳給transform
def demean(arr):
return arr-arr.mean()
demeaned=people.groupby(key).transform(demean)
demeaned
a | b | c | d | e | |
---|---|---|---|---|---|
Joe | 0.470039 | 0.031119 | -0.125757 | 0.549828 | 0.127183 |
Steve | -0.733099 | 0.254358 | 0.052883 | 0.448668 | 0.919598 |
Wes | -0.447773 | NaN | NaN | 0.828926 | -0.322725 |
Jim | 0.733099 | -0.254358 | -0.052883 | -0.448668 | -0.919598 |
Travis | -0.022266 | -0.031119 | 0.125757 | -1.378754 | 0.195542 |
#檢查下demeaned現在的分組平均值是否為0
demeaned.groupby(key).mean()
a | b | c | d | e | |
---|---|---|---|---|---|
one | 1.850372e-17 | -5.551115e-17 | 0.000000e+00 | 0.000000e+00 | 1.110223e-16 |
two | 0.000000e+00 | 2.775558e-17 | -3.469447e-18 | -2.775558e-17 | 0.000000e+00 |
apply:一般性的“拆分-應用-合併”
跟aggregate一樣,transform是一個有著嚴格條件的特殊函式:傳入的函式只能產生兩種結果,要麼產生一個可以廣播的標量值
如(np.mean),要麼產生一個相同大小的結果陣列。最一般化的GroupBy方法是apply,本節重點演示它。
apply會將待處理的物件拆分成多個片段,然後對各片段呼叫傳入的函式,最後嘗試將各片段組合到一起。
#繼續例子,利用之前的那個小費資料集,假設要根據分組選出最高的5個tip_pct值,
# 首先,得編寫一個函式,作用是在指定列找出最大值,然後把這個值所在的行選取出來
def top(df,n=5,column='tip_pct'):
return df.sort_values(by=column)[-n:]
top(tips,n=6)
total_bill | tip | sex | smoker | day | time | size | tip_pct | |
---|---|---|---|---|---|---|---|---|
109 | 14.31 | 4.00 | Female | Yes | Sat | Dinner | 2 | 0.279525 |
183 | 23.17 | 6.50 | Male | Yes | Sun | Dinner | 4 | 0.280535 |
232 | 11.61 | 3.39 | Male | No | Sat | Dinner | 2 | 0.291990 |
67 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | 0.325733 |
178 | 9.60 | 4.00 | Female | Yes | Sun | Dinner | 2 | 0.416667 |
172 | 7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 | 0.710345 |
#現在,如果對smoker分組並用該函式呼叫apply,將會得到:
tips.groupby('smoker').apply(top)
total_bill | tip | sex | smoker | day | time | size | tip_pct | ||
---|---|---|---|---|---|---|---|---|---|
smoker | |||||||||
No | 88 | 24.71 | 5.85 | Male | No | Thur | Lunch | 2 | 0.236746 |
185 | 20.69 | 5.00 | Male | No | Sun | Dinner | 5 | 0.241663 | |
51 | 10.29 | 2.60 | Female | No | Sun | Dinner | 2 | 0.252672 | |
149 | 7.51 | 2.00 | Male | No | Thur | Lunch | 2 | 0.266312 | |
232 | 11.61 | 3.39 | Male | No | Sat | Dinner | 2 | 0.291990 | |
Yes | 109 | 14.31 | 4.00 | Female | Yes | Sat | Dinner | 2 | 0.279525 |
183 | 23.17 | 6.50 | Male | Yes | Sun | Dinner | 4 | 0.280535 | |
67 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | 0.325733 | |
178 | 9.60 | 4.00 | Female | Yes | Sun | Dinner | 2 | 0.416667 | |
172 | 7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 | 0.710345 |
#top函式在DataFrame的各個片段上呼叫,然後結果由pandas.concat組裝到一起,並以
#分組名稱進行了標記。於是,最終結果就有了一個層次化索引,其內層索引值來自原DataFrame
# 如果傳給apply的函式能夠接受其他引數或關鍵字,則可以將這些內容放在函式名後面一併傳入
tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')
total_bill | tip | sex | smoker | day | time | size | tip_pct | |||
---|---|---|---|---|---|---|---|---|---|---|
smoker | day | |||||||||
No | Fri | 94 | 22.75 | 3.25 | Female | No | Fri | Dinner | 2 | 0.142857 |
Sat | 212 | 48.33 | 9.00 | Male | No | Sat | Dinner | 4 | 0.186220 | |
Sun | 156 | 48.17 | 5.00 | Male | No | Sun | Dinner | 6 | 0.103799 | |
Thur | 142 | 41.19 | 5.00 | Male | No | Thur | Lunch | 5 | 0.121389 | |
Yes | Fri | 95 | 40.17 | 4.73 | Male | Yes | Fri | Dinner | 4 | 0.117750 |
Sat | 170 | 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 | 0.196812 | |
Sun | 182 | 45.35 | 3.50 | Male | Yes | Sun | Dinner | 3 | 0.077178 | |
Thur | 197 | 43.11 | 5.00 | Female | Yes | Thur | Lunch | 4 | 0.115982 |
除了基本用法之外,能否發揮apply的威力很大程度取決於你的創造力,傳入的函式能做什麼全由你說了算,只需要返回一個pandas物件或者標量即可。
#在Groupby物件上呼叫過decribe
result=tips.groupby('smoker')['tip_pct'].describe()
result
smoker
No count 151.000000
mean 0.159328
std 0.039910
min 0.056797
25% 0.136906
50% 0.155625
75% 0.185014
max 0.291990
Yes count 93.000000
mean 0.163196
std 0.085119
min 0.035638
25% 0.106771
50% 0.153846
75% 0.195059
max 0.710345
Name: tip_pct, dtype: float64
result.unstack('smoker')
smoker | No | Yes |
---|---|---|
count | 151.000000 | 93.000000 |
mean | 0.159328 | 0.163196 |
std | 0.039910 | 0.085119 |
min | 0.056797 | 0.035638 |
25% | 0.136906 | 0.106771 |
50% | 0.155625 | 0.153846 |
75% | 0.185014 | 0.195059 |
max | 0.291990 | 0.710345 |
#在Groupby中,當你呼叫諸如describe之類的方法時,實際上只是應用了下面兩條程式碼
f=lambda x:x.describe()
grouped.apply(f)
禁止分組鍵
從上面的例子可以看出,分組鍵會跟原始的索引共同構成結果中的層次化索引,將group_keys=Flase傳入groupby即可禁止該效果
tips.groupby('smoker',group_keys=False).apply(top)
total_bill | tip | sex | smoker | day | time | size | tip_pct | |
---|---|---|---|---|---|---|---|---|
88 | 24.71 | 5.85 | Male | No | Thur | Lunch | 2 | 0.236746 |
185 | 20.69 | 5.00 | Male | No | Sun | Dinner | 5 | 0.241663 |
51 | 10.29 | 2.60 | Female | No | Sun | Dinner | 2 | 0.252672 |
149 | 7.51 | 2.00 | Male | No | Thur | Lunch | 2 | 0.266312 |
232 | 11.61 | 3.39 | Male | No | Sat | Dinner | 2 | 0.291990 |
109 | 14.31 | 4.00 | Female | Yes | Sat | Dinner | 2 | 0.279525 |
183 | 23.17 | 6.50 | Male | Yes | Sun | Dinner | 4 | 0.280535 |
67 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | 0.325733 |
178 | 9.60 | 4.00 | Female | Yes | Sun | Dinner | 2 | 0.416667 |
172 | 7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 | 0.710345 |
分位數和桶分析
pandas有一些能根據指定面元或者樣本分位數將資料拆分成多塊的工具(比如cut和qcut)。將這些函式跟groupby結合其阿里,能
輕鬆實現對資料集的桶(bucket)或者分位數(quantile)分析了。
#以簡單隨機資料集為例,利用cut將其裝入長度相等的桶中
frame=DataFrame({'data1':np.random.randn(100),
'data2':np.random.randn(100)})
frame.head()
data1 | data2 | |
---|---|---|
0 | 1.421652 | -0.133642 |
1 | 1.663593 | 1.570306 |
2 | 0.072588 | 1.445291 |
3 | -1.117481 | 0.485219 |
4 | 0.673224 | -0.565916 |
factor=pd.cut(frame.data1,4)
factor[:10]
0 (0.592, 1.913]
1 (0.592, 1.913]
2 (-0.73, 0.592]
3 (-2.0564, -0.73]
4 (0.592, 1.913]
5 (0.592, 1.913]
6 (0.592, 1.913]
7 (-0.73, 0.592]
8 (-0.73, 0.592]
9 (-2.0564, -0.73]
Name: data1, dtype: category
Categories (4, object): [(-2.0564, -0.73]
#由cut返回的Factor物件可以直接用於groupby.
def get_stats(group):
return{'min':group.min(),'max':group.max(),'count':group.count()
,'mean':group.mean()}
grouped=frame.data2.groupby(factor)
grouped.apply(get_stats).unstack()