pandas縱向學習之10 minutes to pandas(一)
阿新 • • 發佈:2018-11-01
10 Minutes to pandas
必要的庫匯入:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
建立物件
pandas常用資料型別有兩個:series和dataframe。
建立一個series:
s = pd.Series([1,3,5,np.nan,6,8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
通過數列建立一個dataframe:
dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
A B C D
2013-01-01 0.828772 -0.681941 -0.736688 0.497738
2013-01-02 -1.744554 1.840190 1.108693 0.718830
2013-01-03 1.022257 0.956576 -1.538469 -0.097789
2013-01-04 -0.818469 0.017786 0.365621 0.687680
2013-01-05 0.418984 1.301549 1.248974 -0.712357
2013-01-06 0.949965 -0.778907 0.029515 0.200063
通過字典建立一個dataframe:
df2 = pd.DataFrame({ 'A' : 1.,
....: 'B' : pd.Timestamp('20130102'),
....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'D' : np.array([3] * 4,dtype='int32'),
....: 'E' : pd.Categorical(["test","train","test","train"]),
....: 'F' : 'foo' })
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
每一列有不同的資料型別:
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
使用IPython時,可以通過按下tab自動完成要輸入的函式。
檢視資料
檢視頭幾行(預設5行)、後幾行的資料:
df.head()
A B C D
2013-01-01 0.828772 -0.681941 -0.736688 0.497738
2013-01-02 -1.744554 1.840190 1.108693 0.718830
2013-01-03 1.022257 0.956576 -1.538469 -0.097789
2013-01-04 -0.818469 0.017786 0.365621 0.687680
2013-01-05 0.418984 1.301549 1.248974 -0.712357
df.tail(2)
A B C D
2013-01-05 0.418984 1.301549 1.248974 -0.712357
2013-01-06 0.949965 -0.778907 0.029515 0.200063
檢視標籤、列、值:
df.index
A B C D
2013-01-05 0.418984 1.301549 1.248974 -0.712357
2013-01-06 0.949965 -0.778907 0.029515 0.200063
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[ 0.82877213, -0.68194098, -0.73668773, 0.49773805],
[-1.74455398, 1.84018976, 1.10869309, 0.71882978],
[ 1.02225702, 0.95657618, -1.53846868, -0.0977893 ],
[-0.81846875, 0.01778553, 0.36562078, 0.68767993],
[ 0.41898388, 1.30154949, 1.24897433, -0.71235652],
[ 0.94996532, -0.77890722, 0.02951456, 0.2000626 ]])
大致統計預覽:
df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.109493 0.442542 0.079608 0.215694
std 1.135895 1.085575 1.076593 0.550501
min -1.744554 -0.778907 -1.538469 -0.712357
25% -0.509106 -0.507009 -0.545137 -0.023326
50% 0.623878 0.487181 0.197568 0.348900
75% 0.919667 1.215306 0.922925 0.640194
max 1.022257 1.840190 1.248974 0.718830
轉置:
df.T
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A 0.828772 -1.744554 1.022257 -0.818469 0.418984 0.949965
B -0.681941 1.840190 0.956576 0.017786 1.301549 -0.778907
C -0.736688 1.108693 -1.538469 0.365621 1.248974 0.029515
D 0.497738 0.718830 -0.097789 0.687680 -0.712357 0.200063
按列的倒序排序:
df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 0.497738 -0.736688 -0.681941 0.828772
2013-01-02 0.718830 1.108693 1.840190 -1.744554
2013-01-03 -0.097789 -1.538469 0.956576 1.022257
2013-01-04 0.687680 0.365621 0.017786 -0.818469
2013-01-05 -0.712357 1.248974 1.301549 0.418984
2013-01-06 0.200063 0.029515 -0.778907 0.949965
按某一列的值排序:
df.sort_values(by='D')
A B C D
2013-01-05 0.418984 1.301549 1.248974 -0.712357
2013-01-03 1.022257 0.956576 -1.538469 -0.097789
2013-01-06 0.949965 -0.778907 0.029515 0.200063
2013-01-01 0.828772 -0.681941 -0.736688 0.497738
2013-01-04 -0.818469 0.017786 0.365621 0.687680
2013-01-02 -1.744554 1.840190 1.108693 0.718830
選擇資料
注:df[‘A’]表示的是無列名的一個列表,而df[['A‘]]表示帶列名的一列’
標籤切片
達到隨心所欲的境界,想怎麼切就怎麼切:
df.loc[:, ['A', 'B']]
A B
2013-01-01 0.828772 -0.681941
2013-01-02 -1.744554 1.840190
2013-01-03 1.022257 0.956576
2013-01-04 -0.818469 0.017786
2013-01-05 0.418984 1.301549
2013-01-06 0.949965 -0.778907
df.loc[[dates[1],dates[3]],:]
A B C D
2013-01-02 -1.744554 1.840190 1.108693 0.71883
2013-01-04 -0.818469 0.017786 0.365621 0.68768
兩種方法確定一點的值:
df.loc[dates[0], 'C']
df.at[dates[0], 'C']
-0.7366877311446127
位置切片
df.iloc[[1,3], [2,3]]
C D
2013-01-02 1.108693 0.71883
2013-01-04 0.365621 0.68768
兩種方法確定一點值:
df.iloc[1,3]
df.iat[1,3]
0.7188297799502227