pandas資料處理常用函式demo之建立/行列操作/檢視/檔案操作
阿新 • • 發佈:2019-02-04
pandas是Python下強大的資料分析工具,這篇文章程式碼主要來自於
10 Minutes to pandas,我將示例程式碼進行了重跑和修改,基本可以滿足所有操作,但是使用更高階的功能可以達到事半功倍的效果:原文如下:
http://pandas.pydata.org/pandas-docs/stable/10min.html
初次使用pandas,很多人最頭痛的就是Merge, join等表的操作了,下面這個官方手冊用圖形的形式形象的展示出來了表操作的方式:
http://pandas.pydata.org/pandas-docs/stable/merging.html
建立dataframe
DataFrame和Series作為padans兩個主要的資料結構,是資料處理的載體和基礎。
def create():
#create Series
s = pd.Series([1,3,5,np.nan,6,8])
print s
#create dataframe
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print df
#Creating a DataFrame by passing a dict of objects that can be converted to series-like.
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test" ,"train"]),
'F' : 'foo' })
print df2
#Having specific dtypes
print df2.dtypes
檢視dataframe屬性
我們生成資料或者從檔案加在資料後,首先要看資料是否符合我們的需求,比如行和列數目,每列的基本統計資訊等,這些資訊可以讓我們認識資料的特點或者檢查資料的正確性:
def see():
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print df
#See the top & bottom rows of the frame'''
print df.head(2)
print df.tail(1)
#Display the index, columns, and the underlying numpy data,num of line and col
print df.index
print df.columns
print df.values
print df.shape[0]
print df.shape[1]
#Describe shows a quick statistic summary of your data
print df.describe()
#Transposing your data
print df.T
#Sorting by an axis,0 is y,1 is x,ascending True is zhengxv,false is daoxv
print df.sort_index(axis=0, ascending=False)
#Sorting by values
print df.sort(column='B')
#see valuenums
print df[0].value_counts()
print df[u'hah'].value_counts()
#see type and change
df.dtypes
df[['two', 'three']] = df[['two', 'three']].astype(float)
選取資料
瞭解了資料基本資訊後,我們可能要對資料進行一些裁剪。很多情況下,我們並不需要資料的全部資訊,因此我們要學會選取出我們感興趣的資料和行列,接下來的例子就是對資料的裁剪:
def selection():
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print df
#Selecting a single column, which yields a Series, equivalent to df.A
print df['A']
print df.A
#Selecting via [], which slices the rows.
print df[0:3]
print df['20130102':'20130104']
#Selection by Label
#For getting a cross section using a label
print df.loc[dates[0]]
#Selecting on a multi-axis by label
print df.loc[:,['A','B']]
#Showing label slicing, both endpoints are included
print df.loc['20130102':'20130104',['A','B']]
#For getting a scalar value
print df.loc[dates[0],'A']
print df.at[dates[0],'A']
#Selection by Position
#Select via the position of the passed integers
print df.iloc[3]
#By integer slices, acting similar to numpy/python
print df.iloc[3:5,0:2]
#By lists of integer position locations, similar to the numpy/python style
print df.iloc[[1,2,4],[0,2]]
#For slicing rows explicitly
print df.iloc[1:3,:]
#For getting a value explicitly
print df.iloc[1,1]
print df.iat[1,1]
#Boolean Indexing
#Using a single column's values to select data.
print df[df.A > 0]
#Using the isin() method for filtering:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
print df2[df2['E'].isin(['two','four'])]
#A where operation for getting.
print df[df > 0]
df2[df2 > 0] = -df2
#Setting
#Setting a new column automatically aligns the data by the indexes
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df['F'] = s1
print df
#Setting values by label/index
df.at[dates[0],'A'] = 0
df.iat[0,1] = 0
print df
#Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
print df
檔案操作
很多時候,我們的資料並不是自己生成的,而是從檔案中讀取的,資料檔案則具有各種各樣的來源,下面就展示如何載入和儲存資料。pandas提供了多種API,可以載入txt/csv/libsvm等各個格式的資料,完全可以滿足資料分析的需求
def file():
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
pd.read_csv('foo.csv')
df.to_csv('foo.csv')