1. 程式人生 > >DataFrame分組級運算和轉換

DataFrame分組級運算和轉換

目錄

前言

禁止分組建

前言

假設我們為DataFrame新增用於存放各索引分組平均值的列,一個辦法是先聚合在合併。

>>> k1_means = df.groupby('key1').mean().add_prefix('mean_')
>>> k1_means
      mean_data1  mean_data2
key1                        
a      -0.380460   -0.332537
b      -0.314586   -0.605574
>>> pd.merge(df,k1_means,left_on='key1',right_index=True)
      data1     data2 key1 key2  mean_data1  mean_data2
0 -0.291328  0.257737    a  one   -0.380460   -0.332537
1 -1.390843 -1.081238    a  two   -0.380460   -0.332537
4  0.540790 -0.174112    a  one   -0.380460   -0.332537
2  0.574857  0.202979    b  one   -0.314586   -0.605574
3 -1.204029 -1.414127    b  two   -0.314586   -0.605574

這次我們在GroupBy上使用transForm方法。

transForm或將一個函式應用到各個分組,然後將結果放到適當的位置上。

>>> people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'],
...                    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
>>> people.groupby(key).mean()
            a         b         c         d         e
one  0.684081  0.110111 -0.122685 -0.392944  0.676586
two  0.295614 -0.488849  0.111023 -0.452018 -0.593795
>>> people.groupby(key).transform(np.mean)
               a         b         c         d         e
Joe     0.684081  0.110111 -0.122685 -0.392944  0.676586
Steve   0.295614 -0.488849  0.111023 -0.452018 -0.593795
Wes     0.684081  0.110111 -0.122685 -0.392944  0.676586
Jim     0.295614 -0.488849  0.111023 -0.452018 -0.593795
Travis  0.684081  0.110111 -0.122685 -0.392944  0.676586

假如你希望從各組中減去平均值,為此我們先建立一個距平化函式,然後將其傳給transform

>>> def demean(arr):
...     return arr - arr.mean()
... 
>>> demeaned  =people.groupby(key).transform(demean)
>>> demeaned
               a         b         c         d         e
Joe    -0.779960  0.893851 -1.448675 -0.091887 -0.162785
Steve  -0.323736  0.072072  0.659981 -0.131960 -0.498387
Wes     0.305050 -1.817776  0.450697 -0.454107 -0.952844
Jim     0.323736 -0.072072 -0.659981  0.131960  0.498387
Travis  0.474909  0.923925  0.997978  0.545994  1.115629

你可以檢查一下demeaned各組的平均值是否為0

apply:一般性的‘拆份-應用-合併’

假設你想要根據分組選出5個最高的tip_pct值,首先先寫一個指定列具有最大值的行的函式

>>> def top(df,n=5,columns='tip_pct'):
...     return df.sort_index(by=columns)[-n:]
>>> top(tips,n=6)
     total_bill   tip smoker  day    time  size   tip_pct
109       14.31  4.00    Yes  Sat  Dinner     2  0.279525
183       23.17  6.50    Yes  Sun  Dinner     4  0.280535
232       11.61  3.39     No  Sat  Dinner     2  0.291990
67         3.07  1.00    Yes  Sat  Dinner     1  0.325733
178        9.60  4.00    Yes  Sun  Dinner     2  0.416667
172        7.25  5.15    Yes  Sun  Dinner     2  0.710345

top涵數在DataFrame的個個片段上呼叫,最後由pandas.concat組裝到一起。

>>> tips.groupby(['smoker','day']).apply(top)
                 total_bill   tip smoker   day    time  size   tip_pct
smoker day                                                            
No     Fri  99        12.46  1.50     No   Fri  Dinner     2  0.120385
            94        22.75  3.25     No   Fri  Dinner     2  0.142857
            91        22.49  3.50     No   Fri  Dinner     2  0.155625
            223       15.98  3.00     No   Fri   Lunch     3  0.187735
       Sat  228       13.28  2.72     No   Sat  Dinner     2  0.204819
            108       18.24  3.76     No   Sat  Dinner     2  0.206140
            110       14.00  3.00     No   Sat  Dinner     2  0.214286
            20        17.92  4.08     No   Sat  Dinner     2  0.227679
            232       11.61  3.39     No   Sat  Dinner     2  0.291990
       Sun  46        22.23  5.00     No   Sun  Dinner     2  0.224921
            17        16.29  3.71     No   Sun  Dinner     3  0.227747
            6          8.77  2.00     No   Sun  Dinner     2  0.228050
            185       20.69  5.00     No   Sun  Dinner     5  0.241663
            51        10.29  2.60     No   Sun  Dinner     2  0.252672
       Thur 81        16.66  3.40     No  Thur   Lunch     2  0.204082
            139       13.16  2.75     No  Thur   Lunch     2  0.208967
            87        18.28  4.00     No  Thur   Lunch     2  0.218818
            88        24.71  5.85     No  Thur   Lunch     2  0.236746
            149        7.51  2.00     No  Thur   Lunch     2  0.266312
Yes    Fri  226       10.09  2.00    Yes   Fri   Lunch     2  0.198216
            100       11.35  2.50    Yes   Fri  Dinner     2  0.220264
            222        8.58  1.92    Yes   Fri   Lunch     1  0.223776
            221       13.42  3.48    Yes   Fri   Lunch     2  0.259314
            93        16.32  4.30    Yes   Fri  Dinner     2  0.263480
       Sat  171       15.81  3.16    Yes   Sat  Dinner     2  0.199873
            63        18.29  3.76    Yes   Sat  Dinner     4  0.205577
            214       28.17  6.50    Yes   Sat  Dinner     3  0.230742
            109       14.31  4.00    Yes   Sat  Dinner     2  0.279525
            67         3.07  1.00    Yes   Sat  Dinner     1  0.325733
       Sun  174       16.82  4.00    Yes   Sun  Dinner     2  0.237812
            181       23.33  5.65    Yes   Sun  Dinner     2  0.242177
            183       23.17  6.50    Yes   Sun  Dinner     4  0.280535
            178        9.60  4.00    Yes   Sun  Dinner     2  0.416667
            172        7.25  5.15    Yes   Sun  Dinner     2  0.710345
       Thur 204       20.53  4.00    Yes  Thur   Lunch     4  0.194837
            205       16.47  3.23    Yes  Thur   Lunch     3  0.196114
            191       19.81  4.19    Yes  Thur   Lunch     2  0.211509
            200       18.71  4.00    Yes  Thur   Lunch     3  0.213789
            194       16.58  4.00    Yes  Thur   Lunch     2  0.241255

禁止分組建

>>> tips.groupby('smoker',group_keys=False).apply(top)
     total_bill   tip smoker   day    time  size   tip_pct
88        24.71  5.85     No  Thur   Lunch     2  0.236746
185       20.69  5.00     No   Sun  Dinner     5  0.241663
51        10.29  2.60     No   Sun  Dinner     2  0.252672
149        7.51  2.00     No  Thur   Lunch     2  0.266312
232       11.61  3.39     No   Sat  Dinner     2  0.291990
109       14.31  4.00    Yes   Sat  Dinner     2  0.279525
183       23.17  6.50    Yes   Sun  Dinner     4  0.280535
67         3.07  1.00    Yes   Sat  Dinner     1  0.325733
178        9.60  4.00    Yes   Sun  Dinner     2  0.416667
172        7.25  5.15    Yes   Sun  Dinner     2  0.710345

分位數和桶分析

>>> frame = DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})
>>> factor = pd.cut(frame.data1,4)
>>> factor[:10]
0     (-1.6, -0.026]
1     (-1.6, -0.026]
2     (1.548, 3.123]
3     (-1.6, -0.026]
4    (-0.026, 1.548]
5    (-0.026, 1.548]
6     (-1.6, -0.026]
7    (-0.026, 1.548]
8    (-0.026, 1.548]
9     (-1.6, -0.026]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.181, -1.6] < (-1.6, -0.026] < (-0.026, 1.548] <
                                    (1.548, 3.123]]
>>> def get_stats(group):
...     return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
... 
>>> grouped = frame.data2.groupby(factor)
>>> grouped.apply(get_stats).unstack()
                 count       max      mean       min
data1                                               
(-3.181, -1.6]    47.0  1.560586  0.067778 -3.094980
(-1.6, -0.026]   431.0  2.920156 -0.031899 -2.778233
(-0.026, 1.548]  460.0  2.339734 -0.057856 -2.739892
(1.548, 3.123]    62.0  1.728365 -0.143399 -2.449822
>>> grouping = pd.qcut(frame.data1,10,labels=False)
>>> grouped = frame.data2.groupby(grouping)
>>> grouped.apply(get_stats).unstack()
       count       max      mean       min
data1                                     
0      100.0  2.248114  0.069002 -3.094980
1      100.0  1.923236 -0.237785 -2.743977
2      100.0  2.920156  0.115480 -2.778233
3      100.0  2.481512 -0.060810 -2.581747
4      100.0  2.793314  0.030760 -2.595131
5      100.0  2.337741 -0.142877 -2.332392
6      100.0  2.339734 -0.046468 -2.589412
7      100.0  2.275533 -0.008744 -2.588843
8      100.0  1.901215 -0.095933 -2.739892
9      100.0  2.229256 -0.083296 -2.449822

示例:用特定於分組的值填充缺失值

用平均值填充NA值:

>>> s = Series(np.random.randn(6))
>>> s[::2] = np.nan
>>> s
0         NaN
1   -1.430336
2         NaN
3    0.937739
4         NaN
5    0.236223
dtype: float64
>>> s.fillna(s.mean())
0   -0.085458
1   -1.430336
2   -0.085458
3    0.937739
4   -0.085458
5    0.236223
dtype: float64

假設你想根據分組填充不同資料,只需要將資料分組,並使用apply和一個能夠對個數據塊呼叫的fillna對的函式即可

>>> states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
>>> group_key = ['East']*4 + ['West'] *4
>>> data = Series(np.random.randn(8),index=states)
>>> data[['Vermont','Nevada','Idaho']] = np.nan
>>> data
Ohio         -0.734886
New York      1.573174
Vermont            NaN
Florida      -1.172843
Oregon        0.988466
Nevada             NaN
California   -1.872393
Idaho              NaN
dtype: float64
>>> data.groupby(group_key).mean()
East   -0.111518
West   -0.441964
dtype: float64

我們利用分組平均值去填充NA值

>>> fill_mean = lambda g: g.fillna(g.mean())
>>> data.groupby(group_key).apply(fill_mean)
Ohio         -0.734886
New York      1.573174
Vermont      -0.111518
Florida      -1.172843
Oregon        0.988466
Nevada       -0.441964
California   -1.872393
Idaho        -0.441964
dtype: float64

我們也可以在程式碼中預定義各組的填充值

>>> fill_values = {'East':0.5,'West':-1}
>>> fill_func = lambda g: g.fillna(fill_values[g.name])
>>> data.groupby(group_key).apply(fill_func)
Ohio         -0.734886
New York      1.573174
Vermont       0.500000
Florida      -1.172843
Oregon        0.988466
Nevada       -1.000000
California   -1.872393
Idaho        -1.000000
dtype: float64

示例:隨機取樣和佇列

np.random.permutation(N),N為完整資料大小

>>> suits = ['H','S','C','D']
>>> card_val = (range(1,11)+[10]*3)*4
>>> card_val
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
>>> base_names = ["A"] + range(2,11) + ['J','K','Q']
>>> base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
>>> carda = []
>>> for suit in suits:
...     carda.extend(str(num) + suit for num in base_names)
>>> deck = Series(card_val,index=carda)
>>> deck[:13]
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64
>>> def draw(deck,n=5):
...     return deck.take(np.random.permutation(len(deck))[:n])
... 
>>> draw(deck)
3H      3
KS     10
QC     10
JS     10
10C    10
dtype: int64
>>> get_suit = lambda card: card[-1]
>>> deck.groupby(get_suit).apply(draw,n=2)
C  9C      9
   JC     10
D  4D      4
   AD      1
H  4H      4
   10H    10
S  3S      3
   KS     10
dtype: int64
>>> deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
7C     7
4C     4
AD     1
5D     5
9H     9
4H     4
6S     6
KS    10
dtype: int64

示例:分組加權平均數和相關係數

>>> df = DataFrame({'category':['a','a','a','a','b','b','b','b'],'data':np.random.randn(8),'weights':np.random.rand(8)})
>>> df
  category      data   weights
0        a -1.493554  0.300840
1        a -2.008278  0.693407
2        a  1.006548  0.736280
3        a -1.226051  0.128157
4        b -0.981050  0.327538
5        b -0.487632  0.201700
6        b -1.262182  0.201121
7        b -0.205049  0.206801
>>> grouped = df.groupby('category')
>>> get_wavg = lambda g: np.average(g['data'],weights = g['weights'])
>>> grouped.apply(get_wavg)
category
a   -0.676769
b   -0.763949
dtype: float64