1. 程式人生 > >pandas縱向學習之10 minutes to pandas(一)

pandas縱向學習之10 minutes to pandas(一)

10 Minutes to pandas

必要的庫匯入:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

建立物件

pandas常用資料型別有兩個:series和dataframe。
建立一個series:

s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通過數列建立一個dataframe:

dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
				A			B			C			D
2013-01-01	0.828772
-0.681941 -0.736688 0.497738 2013-01-02 -1.744554 1.840190 1.108693 0.718830 2013-01-03 1.022257 0.956576 -1.538469 -0.097789 2013-01-04 -0.818469 0.017786 0.365621 0.687680 2013-01-05 0.418984 1.301549 1.248974 -0.712357 2013-01-06 0.949965 -0.778907 0.029515 0.200063

通過字典建立一個dataframe:

df2 = pd.DataFrame({ 'A' : 1.,
   ....:             'B'
: pd.Timestamp('20130102'), ....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), ....: 'D' : np.array([3] * 4,dtype='int32'), ....: 'E' : pd.Categorical(["test","train","test","train"]), ....: 'F' : 'foo' }) A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo

每一列有不同的資料型別:

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

使用IPython時,可以通過按下tab自動完成要輸入的函式。

檢視資料

檢視頭幾行(預設5行)、後幾行的資料:

df.head()
	A	B	C	D
2013-01-01	0.828772	-0.681941	-0.736688	0.497738
2013-01-02	-1.744554	1.840190	1.108693	0.718830
2013-01-03	1.022257	0.956576	-1.538469	-0.097789
2013-01-04	-0.818469	0.017786	0.365621	0.687680
2013-01-05	0.418984	1.301549	1.248974	-0.712357
df.tail(2)
	A	B	C	D
2013-01-05	0.418984	1.301549	1.248974	-0.712357
2013-01-06	0.949965	-0.778907	0.029515	0.200063

檢視標籤、列、值:

df.index
	A	B	C	D
2013-01-05	0.418984	1.301549	1.248974	-0.712357
2013-01-06	0.949965	-0.778907	0.029515	0.200063
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[ 0.82877213, -0.68194098, -0.73668773,  0.49773805],
       [-1.74455398,  1.84018976,  1.10869309,  0.71882978],
       [ 1.02225702,  0.95657618, -1.53846868, -0.0977893 ],
       [-0.81846875,  0.01778553,  0.36562078,  0.68767993],
       [ 0.41898388,  1.30154949,  1.24897433, -0.71235652],
       [ 0.94996532, -0.77890722,  0.02951456,  0.2000626 ]])

大致統計預覽:

df.describe()
	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.109493	0.442542	0.079608	0.215694
std	1.135895	1.085575	1.076593	0.550501
min	-1.744554	-0.778907	-1.538469	-0.712357
25%	-0.509106	-0.507009	-0.545137	-0.023326
50%	0.623878	0.487181	0.197568	0.348900
75%	0.919667	1.215306	0.922925	0.640194
max	1.022257	1.840190	1.248974	0.718830

轉置:

df.T
	2013-01-01 00:00:00	2013-01-02 00:00:00	2013-01-03 00:00:00	2013-01-04 00:00:00	2013-01-05 00:00:00	2013-01-06 00:00:00
A	0.828772	-1.744554	1.022257	-0.818469	0.418984	0.949965
B	-0.681941	1.840190	0.956576	0.017786	1.301549	-0.778907
C	-0.736688	1.108693	-1.538469	0.365621	1.248974	0.029515
D	0.497738	0.718830	-0.097789	0.687680	-0.712357	0.200063

按列的倒序排序:

df.sort_index(axis=1, ascending=False)
	D	C	B	A
2013-01-01	0.497738	-0.736688	-0.681941	0.828772
2013-01-02	0.718830	1.108693	1.840190	-1.744554
2013-01-03	-0.097789	-1.538469	0.956576	1.022257
2013-01-04	0.687680	0.365621	0.017786	-0.818469
2013-01-05	-0.712357	1.248974	1.301549	0.418984
2013-01-06	0.200063	0.029515	-0.778907	0.949965

按某一列的值排序:

df.sort_values(by='D')
	A	B	C	D
2013-01-05	0.418984	1.301549	1.248974	-0.712357
2013-01-03	1.022257	0.956576	-1.538469	-0.097789
2013-01-06	0.949965	-0.778907	0.029515	0.200063
2013-01-01	0.828772	-0.681941	-0.736688	0.497738
2013-01-04	-0.818469	0.017786	0.365621	0.687680
2013-01-02	-1.744554	1.840190	1.108693	0.718830

選擇資料

注:df[‘A’]表示的是無列名的一個列表,而df[['A‘]]表示帶列名的一列’

標籤切片

達到隨心所欲的境界,想怎麼切就怎麼切:

df.loc[:, ['A', 'B']]
				A			B
2013-01-01	0.828772	-0.681941
2013-01-02	-1.744554	1.840190
2013-01-03	1.022257	0.956576
2013-01-04	-0.818469	0.017786
2013-01-05	0.418984	1.301549
2013-01-06	0.949965	-0.778907
df.loc[[dates[1],dates[3]],:]
				A			B			C			D
2013-01-02	-1.744554	1.840190	1.108693	0.71883
2013-01-04	-0.818469	0.017786	0.365621	0.68768

兩種方法確定一點的值:

df.loc[dates[0], 'C']
df.at[dates[0], 'C']
-0.7366877311446127

位置切片

df.iloc[[1,3], [2,3]]
				C			D
2013-01-02	1.108693	0.71883
2013-01-04	0.365621	0.68768

兩種方法確定一點值:

df.iloc[1,3]
df.iat[1,3]
0.7188297799502227