1. 程式人生 > >pandas基礎屬性方法隨機整理(四)---例項梳理(多知識點)

pandas基礎屬性方法隨機整理(四)---例項梳理(多知識點)

  • 源資料格式:
    “”
    Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
    61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
    61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
    61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
    “”“

  • 匯入:
    data = pd.read_table(‘https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data‘,sep=’\s+’, parse_dates = [ [0,1,2] ])
    注:
    pd.read_table() 引數解析: parse_dates
    list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.

  • 更改column列名


    data.rename(columns = {‘Yr_Mo_Dy’:’DATE’}, inplace=True)

data.rename(columns = {'Yr_Mo_Dy':'DATE'}, inplace=True)
data.iloc[:,0:9].head(3)
Out[158]: 
         DATE    RPT    VAL    ROS    KIL    SHA   BIR    DUB    CLA
0  1961-01-01  15.04  14.96  13.17   9.29    NaN  9.87  13.67  10.25
1  1961-01-02  14.71
NaN 10.83 6.50 12.62 7.67 11.50 10.04 2 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN
  • 更改日期資料型別: object – > datetime64[ns]
    data.info():
    DATE 列為日期,但是資料型別是’object’, 需將其更改為 ‘datetime64[ns]’型別
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6574 entries, 0 to 6573
Data columns (total 13 columns):
DATE    6574 non-null object
RPT     6568 non-null float64
VAL     6571 non-null float64
ROS     6572 non-null float64
KIL     6569 non-null float64
SHA     6572 non-null float64
BIR     6574 non-null float64
DUB     6571 non-null float64
CLA     6572 non-null float64
MUL     6571 non-null float64
CLO     6573 non-null float64
BEL     6574 non-null float64
MAL     6570 non-null float64
dtypes: float64(12), object(1)
memory usage: 719.0+ KB

方法1: .astype()

data['DATE'].astype('datetime64[ns]').dtype
Out[169]: dtype('<M8[ns]')

方法2:pd.to_datetime(…) # 特殊格式,推薦使用

data['DATE'] = pd.to_datetime(data['DATE'])
data['DATE'].dtype
Out[178]: dtype('<M8[ns]')
  • 將日期列設定為索引(column –> index):
    data = data.set_index(‘DATE’)
data = data.set_index('DATE')
data.iloc[:,0:9].head(3)
Out[186]: 
              RPT    VAL    ROS    KIL    SHA   BIR    DUB    CLA    MUL
DATE                                                                    
1961-01-01  15.04  14.96  13.17   9.29    NaN  9.87  13.67  10.25  10.83
1961-01-02  14.71    NaN  10.83   6.50  12.62  7.67  11.50  10.04   9.79
1961-01-03  18.50  16.88  12.33  10.13  11.17  6.17  11.25    NaN   8.50
  • 缺失值數量:
    data.isnull().sum() # notnull isnull的否定式
data.isnull().sum()
Out[192]: 
RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64
  • 描述與統計資訊:
    問題1:
    Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
    方法:data.describe().loc[ [‘min’,’max’,’mean’,’std’], :].T
loc_stats = data.describe().loc[['min','max','mean','std'],:].T
loc_stats
Out[207]: 
      min    max       mean       std
RPT  0.67  35.80  12.362987  5.618413
VAL  0.21  33.37  10.644314  5.267356
ROS  1.50  33.84  11.660526  5.008450
KIL  0.00  28.46   6.306468  3.605811
SHA  0.13  37.54  10.455834  4.936125
BIR  0.00  26.16   7.092254  3.968683
DUB  0.00  30.37   9.797343  4.977555
CLA  0.00  31.08   8.495053  4.499449
MUL  0.00  25.88   8.493590  4.166872
CLO  0.04  28.21   8.707332  4.503954
BEL  0.13  42.38  13.121007  5.835037
MAL  0.67  42.54  15.599079  6.699794

問題2:
Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
方法:data.T.describe().loc[ [‘min’,’max’,’mean’,’std’], :].T
note: 巧妙利用轉置函式 df.T 實現維度轉換

days_stats = data.T.describe().loc[['min','max','mean','std'],:].T
days_stats.head()
Out[229]: 
             min    max       mean       std
DATE                                        
1961-01-01  9.29  18.50  13.018182  2.808875
1961-01-02  6.50  17.54  11.336364  3.188994
1961-01-03  6.17  18.50  11.641818  3.681912
1961-01-04  1.79  11.75   6.619167  3.198126
1961-01-05  6.17  13.33  10.630000  2.445356

- 條件查詢df.query(‘cond…’)
問題:
Find the average windspeed in January for each location.

  • 方法1:輔助列 data[‘Mon’], data[data['Mon']==1].mean() # 利用掩碼,即bool作為篩選條件

    即 陣列[關係表示式]:
    關係表示式是一個布林型書序,其中為True的元素對應於陣列中滿足關係表示式的元素,以上下標運算的值就是從陣列中挑選與布林陣列中為True的元素相對應的元素

  • 方法2:輔助列data[‘Mon’], # 查詢.query, 功能與索引一樣,有時更方便
  • 方法3: 不新增輔助列,利用groupby()方法實現降取樣
    data.groupby ( lambda x: x.month ). mean(). T[1]
data['Mon'] = data['date_col'].apply(lambda x: x.month)
data[data['Mon']==1].mean()     # data['Mon'==1] 掩碼
Out[248]: 
RPT       14.847325
VAL       12.914560
ROS       13.299624
KIL        7.199498
SHA       11.667734
BIR        8.054839
DUB       11.819355
CLA        9.512047
MUL        9.543208
CLO       10.053566
BEL       14.550520
MAL       18.028763
Mon        1.000000
Year    1969.500000
day       16.000000
dtype: float64

相比於條件查詢方式data[data[‘Mon’]==1].mean().query查詢方式更簡潔直觀data.query(‘Mon == 1’).mean()

data.query('Mon == 1').mean()
Out[267]: 
RPT       14.847325
VAL       12.914560
ROS       13.299624
KIL        7.199498
SHA       11.667734
BIR        8.054839
DUB       11.819355
CLA        9.512047
MUL        9.543208
CLO       10.053566
BEL       14.550520
MAL       18.028763
Mon        1.000000
Year    1969.500000
day       16.000000
dtype: float64

方法3說明:
a) 刪除輔助列:
data.drop(‘Mon’], axis=1, inplace=True) #
b) .groupby() 傳入lambda函式處理時間序列進行重取樣
- mondata_loc_per = data.groupby(lambda x: x.month).mean().T

mondata_loc_per.head()
Out[311]: 
            1          2          3          4          5          6   \
RPT  14.847325  13.710906  13.158687  12.555648  11.724032  10.451317   
VAL  12.914560  12.111122  11.505842  10.429759  10.145619   8.949704   
ROS  13.299624  12.879132  12.648118  12.204815  11.550394  10.361315   
KIL   7.199498   6.942411   7.265907   6.898037   6.307487   5.652278   
SHA  11.667734  11.551772  11.554516  10.677667  10.224301   9.529926   

           7          8          9          10         11         12  
RPT  9.992007  10.213411  11.458519  12.660610  13.200722  14.446398  
VAL  8.357778   8.415143   9.981002  11.010681  11.639500  12.353602  
ROS  9.349642   9.993441  10.756883  11.453943  12.293407  13.212276  
KIL  5.416935   5.270681   5.615176   6.065215   6.247611   6.829910  
SHA  9.302634   8.901559   9.766315  10.550251  10.501130  11.301254  

c) 索引:
mondata_loc_per[4]: [N] 中N的數字代表月份Mon

mondata_loc_per[4]
Out[312]: 
RPT       12.555648
VAL       10.429759
ROS       12.204815
KIL        6.898037
SHA       10.677667
BIR        7.441389
DUB       10.221315
CLA        8.909056
MUL        8.930870
CLO        9.158019
BEL       12.664759
MAL       14.937611
Mon        4.000000
Year    1969.500000
day       15.500000
Name: 4, dtype: float64
  • resample重取樣:
data_rew = data.resample('W',closed='right',kind='period').agg(['min','max','mean','std'])
data_rew.iloc[:,0:7].head()
Out[418]: 
                         RPT                                VAL         \
                         min    max       mean       std    min    max   
DATE                                                                     
1960-12-26/1961-01-01  15.04  15.04  15.040000       NaN  14.96  14.96   
1961-01-02/1961-01-08  10.58  18.50  13.541429  2.631321   6.63  16.88   
1961-01-09/1961-01-15   9.04  19.75  12.468571  3.555392   3.54  12.08   
1961-01-16/1961-01-22   4.92  19.83  13.204286  5.337402   3.42  14.37   
1961-01-23/1961-01-29  13.62  25.04  19.880000  4.619061   9.96  23.91 


data.resample('A', kind='period',axis=0,label='right').mean().head(3)
Out[411]: 
            RPT        VAL        ROS       KIL        SHA       BIR  \
DATE                                                                   
1961  12.299583  10.351796  11.362369  6.958227  10.881763  7.729726   
1962  12.246923  10.110438  11.732712  6.960440  10.657918  7.393068   
1963  12.813452  10.836986  12.541151  7.330055  11.724110  8.434712   

help()程式碼資訊擷取:

parse_dates : boolean or list of ints or names or list of lists or dict, default False

        * boolean. If True -> try parsing the index.
        * list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
          each as a separate date column.
        * list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
          a single date column.
        * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
          'foo'

        If a column or index contains an unparseable date, the entire column or
        index will be returned unaltered as an object data type. For non-standard
        datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``