1. 程式人生 > >Pandas 10分鐘入門(官方文檔註釋版二)

Pandas 10分鐘入門(官方文檔註釋版二)

logs ble light col util nump std 我們 部分

本文接續註釋版1,前文重點講述了如何創建一個panads對象,本文重點講述如何查看這些已經創建的對象。

【查看數據】

  • See the top & bottom rows of the frame(查看frame頭部和尾部的行)
>>> import pandas as pd
>>> long_series = pd.Series(np.random.randn(1000))
>>> import numpy as np
>>> long_series = pd.Series(np.random.randn(1000))
>>> long_series 0 0.526507 1 -0.085210 2 1.292113 3 -1.948114 4 -1.386582 5 -2.596821 6 0.268965 7 -0.635905 8 -1.839953 9 -1.240820 10 0.122215 .......

上面為完成的series,可以看到定義了一個10000個值,現在我們只取頭部和尾部,因此可以使用head()和tail()兩個方法,兩個方法默認取的數據都是5個,當然你可以自己定義取幾個,具體如下:

>>> long_series.head()
0    
0.526507 1 -0.085210 2 1.292113 3 -1.948114 4 -1.386582 dtype: float64 >>> long_series.tail(6) ----lst: 取最後6個值 994 -1.300574 995 0.659815 996 -0.340045 997 0.685664 998 -0.972145 999 0.410191 dtype: float64
  • 顯示索引、列和底層numpy數據

pandas獲取這些比較簡單,直接采用屬性的方式即可。如下:

>>> df = pd.DataFrame(np.random.randn(6,4), index=dates,columns=list(‘ABCD‘))
>>> df
                   A         B         C         D
2017-01-01  0.906245  1.815924  0.123356 -1.798571
2017-01-02 -0.459646  0.520100  0.511138  0.183975
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481
2017-01-04  1.505464 -1.743313  1.020903 -1.049047
2017-01-05 -0.709366  1.378030  1.874955 -1.017548
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571
>>> df.index    獲取行索引
DatetimeIndex([‘2017-01-01‘, ‘2017-01-02‘, ‘2017-01-03‘, ‘2017-01-04‘,
               ‘2017-01-05‘, ‘2017-01-06‘],
              dtype=‘datetime64[ns]‘, freq=‘D‘)
>>> df.columns   獲取列索引
Index([u‘A‘, u‘B‘, u‘C‘, u‘D‘], dtype=‘object‘)
>>> df.values    獲取值
array([[ 0.90624543,  1.81592368,  0.12335647, -1.79857091],
       [-0.45964616,  0.52009988,  0.51113763,  0.1839755 ],
       [ 0.46332631, -0.97048662, -1.12078016, -0.61448135],
       [ 1.50546445, -1.74331294,  1.02090281, -1.04904748],
       [-0.70936561,  1.37802983,  1.87495471, -1.01754786],
       [ 1.11355431, -0.95196258, -1.2668023 , -0.58657136]])
  • 對數據的一些快速基本統計
>>> df.describe()
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.469930  0.008049  0.190462 -0.813707
std    0.886775  1.439019  1.222903  0.656284
min   -0.709366 -1.743313 -1.266802 -1.798571
25%   -0.228903 -0.965856 -0.809746 -1.041173
50%    0.684786 -0.215931  0.317247 -0.816015
75%    1.061727  1.163547  0.893462 -0.593549
max    1.505464  1.815924  1.874955  0.183975

註意上述的統計,是按照不同維度(也就是列)進行統計。

  • 數據的行列轉換
>>> df.T
   2017-01-01  2017-01-02  2017-01-03  2017-01-04  2017-01-05  2017-01-06
A    0.906245   -0.4596 46    0.463326    1.505464   -0.709366    1.113554
B    1.815924    0.520100   -0.970487   -1.743313    1.378030   -0.951963
C    0.123356    0.511138   -1.120780    1.020903    1.874955   -1.266802
D   -1.798571    0.183975   -0.614481   -1.049047   -1.017548   -0.586571
  • 按照某一個軸axis進行排序
>>> df.sort_index(axis=1,ascending=False)
                   D         C         B         A
2017-01-01 -1.798571  0.123356  1.815924  0.906245
2017-01-02  0.183975  0.511138  0.520100 -0.459646
2017-01-03 -0.614481 -1.120780 -0.970487  0.463326
2017-01-04 -1.049047  1.020903 -1.743313  1.505464
2017-01-05 -1.017548  1.874955  1.378030 -0.709366
2017-01-06 -0.586571 -1.266802 -0.951963  1.113554
  • 按值進行排序 (lst:以前的版本是sort(columns=xxx),該方法將被廢止,現在官方已經開始使用sort_values)
>>> df.sort_values(by=B)
                   A         B         C         D
2017-01-04  1.505464 -1.743313  1.020903 -1.049047
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571
2017-01-02 -0.459646  0.520100  0.511138  0.183975
2017-01-05 -0.709366  1.378030  1.874955 -1.017548
2017-01-01  0.906245  1.815924  0.123356 -1.798571

【選擇數據】

註意:雖然標準的Python/Numpy表達式是直觀且可用的,但是我們推薦使用優化後的pandas方法,例如:.at,.iat,.loc,.iloc以及.ix 詳情請查看: Indexing and Selecting Data 和 MultiIndex / Advanced Indexing

  • 獲取

獲取一個單獨的列

>>> df[A]
2017-01-01    0.906245
2017-01-02   -0.459646
2017-01-03    0.463326
2017-01-04    1.505464
2017-01-05   -0.709366
2017-01-06    1.113554
Freq: D, Name: A, dtype: float64   

通過切片獲取數據

>>> df[1:3]
                   A         B         C         D
2017-01-02 -0.459646  0.520100  0.511138  0.183975
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481

通過標簽獲取數據 (獲取時間為2017-01-01的數據)

>> df.loc[dates[0]]
A    0.906245
B    1.815924
C    0.123356
D   -1.798571
Name: 2017-01-01 00:00:00, dtype: float64

通過標簽獲取多軸數據

>>> df.loc[:,[A,C]]
                   A         C
2017-01-01  0.906245  0.123356
2017-01-02 -0.459646  0.511138
2017-01-03  0.463326 -1.120780
2017-01-04  1.505464  1.020903
2017-01-05 -0.709366  1.874955
2017-01-06  1.113554 -1.266802

標簽切片(Showing label slicing, both endpoints are included

>>> df.loc[20170101:20170103,[A,B]]
                   A         B
2017-01-01  0.906245  1.815924
2017-01-02 -0.459646  0.520100
2017-01-03  0.463326 -0.970487
  • 對返回的對象進行維度縮減
>>> df.loc[20170103,[A,B]]
A    0.463326
B   -0.970487
Name: 2017-01-03 00:00:00, dtype: float64

獲取單個值

>>> df.loc[dates[0],A]
0.90624542800545049

快速訪問單個值(與上相同,區別還不明白)

>>> df.at[dates[0],A]
0.90624542800545049

以上獲取數據,大部分都是采用loc的方式獲取的數據,下面將主要采用iloc的方式獲取數據。兩者主要的區別是:loc主要是通過行標簽的方式獲取,仔細觀察上面的代碼,可以發現我們變換的主要都是第一個參數,也就是行的標簽,而下面獲取的iloc主要變換的是行號。

  • 位置式選擇獲取

數值選擇獲取

>>> df
                   A         B         C         D
2017-01-01  0.906245  1.815924  0.123356 -1.798571
2017-01-02 -0.459646  0.520100  0.511138  0.183975
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481
2017-01-04  1.505464 -1.743313  1.020903 -1.049047
2017-01-05 -0.709366  1.378030  1.874955 -1.017548
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571
>>> df.iloc[3]
A    1.505464
B   -1.743313
C    1.020903
D   -1.049047
Name: 2017-01-04 00:00:00, dtype: float64

數值切片

>>> df.iloc[3:5,0:2]  註意切片是左閉環
                   A         B
2017-01-04  1.505464 -1.743313
2017-01-05 -0.709366  1.378030

獲取指定列表位置數據

>>> df.iloc[[1,2,4],[0,2]]
                   A         C
2017-01-02 -0.459646  0.511138
2017-01-03  0.463326 -1.120780
2017-01-05 -0.709366  1.874955
>>>

行、列切片

>>> df.iloc[1:3,:]
                   A         B         C         D
2017-01-02 -0.459646  0.520100  0.511138  0.183975
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481
>>> df.iloc[:,1:3]
                   B         C
2017-01-01  1.815924  0.123356
2017-01-02  0.520100  0.511138
2017-01-03 -0.970487 -1.120780
2017-01-04 -1.743313  1.020903
2017-01-05  1.378030  1.874955
2017-01-06 -0.951963 -1.266802

獲取特定值

>>> df.iloc[1,1]
0.52009988180243594
>>> df.iat[1,1]
0.52009988180243594
  • 布爾索引(通過增加條件判斷的結果來獲取數據)

使用一個單獨列的值來選擇數據

>>> df
                   A         B         C         D
2017-01-01  0.906245  1.815924  0.123356 -1.798571
2017-01-02 -0.459646  0.520100  0.511138  0.183975
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481
2017-01-04  1.505464 -1.743313  1.020903 -1.049047
2017-01-05 -0.709366  1.378030  1.874955 -1.017548
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571
>>> df[df.A>0]
                   A         B         C         D
2017-01-01  0.906245  1.815924  0.123356 -1.798571
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481
2017-01-04  1.505464 -1.743313  1.020903 -1.049047
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571

Selecting values from a DataFrame where a boolean condition is met.

(獲取所有DataFrame中滿足條件的數據)

>>> df[df>0]
                   A         B         C         D
2017-01-01  0.906245  1.815924  0.123356       NaN
2017-01-02       NaN  0.520100  0.511138  0.183975
2017-01-03  0.463326       NaN       NaN       NaN
2017-01-04  1.505464       NaN  1.020903       NaN
2017-01-05       NaN  1.378030  1.874955       NaN
2017-01-06  1.113554       NaN       NaN       NaN

通過isin()過濾數據

>>> df2 = df.copy()
>>> df2[E] =[one,one,two,three,four,three]
>>> df2
                   A         B         C         D      E
2017-01-01  0.906245  1.815924  0.123356 -1.798571    one
2017-01-02 -0.459646  0.520100  0.511138  0.183975    one
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481    two
2017-01-04  1.505464 -1.743313  1.020903 -1.049047  three
2017-01-05 -0.709366  1.378030  1.874955 -1.017548   four
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571  three
>>> df2[df2[E].isin([two,four])]
                   A         B         C         D     E
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481   two
2017-01-05 -0.709366  1.378030  1.874955 -1.017548  four

lst:此處官方的例子有點復雜。在Series的isin的方法中,其應該是返回一個包含布爾類型的Series對象,用以表示源對象是否包含傳入的參數值才對(DataFrame也類似)。isin的官方定義如下:

技術分享

>>> df3 = pd.DataFrame({A:[1,2,3],B:[a,b,c]})
>>> df3
   A  B
0  1  a
1  2  b
2  3  c
>>> df3.isin([1,3])
       A      B
0   True  False
1  False  False
2   True  False
>>> df

但在官方的例子中,返回的是一個DataFrame,主要原因是判斷完畢two和four是否在df2中以後,如果為TRUE將判斷結果傳入df2,並返回符合的結果。

  • 設置數據

通過索引新增一列數據

>>> s3 = pd.Series([1,2,3,4,5,6],index=pd.date_range(20170101,periods=6))
>>> s3
2017-01-01    1
2017-01-02    2
2017-01-03    3
2017-01-04    4
2017-01-05    5
2017-01-06    6
Freq: D, dtype: int64
>>> df[F]= s3
>>> df
                   A         B         C         D   E  F
2017-01-01  0.906245  1.815924  0.123356 -1.798571 NaN  1
2017-01-02 -0.459646  0.520100  0.511138  0.183975 NaN  2
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481 NaN  3
2017-01-04  1.505464 -1.743313  1.020903 -1.049047 NaN  4
2017-01-05 -0.709366  1.378030  1.874955 -1.017548 NaN  5
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571 NaN  6

通過標簽更新值

>>> df.at[dates[0],A] =1.5
>>> df.at[dates[0],A]
1.5

通過位置更新值

>>> df.iat[0,1]=2.5
>>> df.iat[0,1]
2.5
>>> df
                   A         B         C         D   E  F
2017-01-01  1.500000  2.500000  0.123356 -1.798571 NaN  1
2017-01-02 -0.459646  0.520100  0.511138  0.183975 NaN  2
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481 NaN  3
2017-01-04  1.505464 -1.743313  1.020903 -1.049047 NaN  4
2017-01-05 -0.709366  1.378030  1.874955 -1.017548 NaN  5
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571 NaN  6

通過數組更新

>>> df.loc[:,E] =np.array([5]*len(df))
>>> df
                   A         B         C         D  E  F
2017-01-01  1.500000  2.500000  0.123356 -1.798571  5  1
2017-01-02 -0.459646  0.520100  0.511138  0.183975  5  2
2017-01-03  0.463326 -0.970487 -1.120780 -0.614481  5  3
2017-01-04  1.505464 -1.743313  1.020903 -1.049047  5  4
2017-01-05 -0.709366  1.378030  1.874955 -1.017548  5  5
2017-01-06  1.113554 -0.951963 -1.266802 -0.586571  5  6

通過where條件更新值

>>> df4= df.copy()
>>> df4[df4<0] = 3.6
>>> df4
                   A        B         C         D  E  F
2017-01-01  1.500000  2.50000  0.123356  3.600000  5  1
2017-01-02  3.600000  0.52010  0.511138  0.183975  5  2
2017-01-03  0.463326  3.60000  3.600000  3.600000  5  3
2017-01-04  1.505464  3.60000  1.020903  3.600000  5  4
2017-01-05  3.600000  1.37803  1.874955  3.600000  5  5
2017-01-06  1.113554  3.60000  3.600000  3.600000  5  6

Pandas 10分鐘入門(官方文檔註釋版二)