pandas資料處理常用函式demo之建立/行列操作/檢視/檔案操作

阿新 • • 發佈：2019-02-04

pandas是Python下強大的資料分析工具，這篇文章程式碼主要來自於
10 Minutes to pandas，我將示例程式碼進行了重跑和修改，基本可以滿足所有操作，但是使用更高階的功能可以達到事半功倍的效果：原文如下：
http://pandas.pydata.org/pandas-docs/stable/10min.html
初次使用pandas，很多人最頭痛的就是Merge, join等表的操作了，下面這個官方手冊用圖形的形式形象的展示出來了表操作的方式：
http://pandas.pydata.org/pandas-docs/stable/merging.html

建立dataframe

DataFrame和Series作為padans兩個主要的資料結構，是資料處理的載體和基礎。

def create():

    #create Series
    s = pd.Series([1,3,5,np.nan,6,8])
    print s

    #create dataframe
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    print df

#Creating a DataFrame by passing a dict of objects that can be converted to series-like. 

    df2 = pd.DataFrame({ 'A' : 1.,
                        'B' : pd.Timestamp('20130102'),
                        'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                        'D' : np.array([3] * 4,dtype='int32'),
                        'E' : pd.Categorical(["test","train","test" 
,"train"]),
                        'F' : 'foo' })
    print df2
    #Having specific dtypes
    print df2.dtypes

檢視dataframe屬性

我們生成資料或者從檔案加在資料後，首先要看資料是否符合我們的需求，比如行和列數目，每列的基本統計資訊等，這些資訊可以讓我們認識資料的特點或者檢查資料的正確性：

def see():

    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    print df

    #See the top & bottom rows of the frame'''
    print df.head(2)
    print df.tail(1)

    #Display the index, columns, and the underlying numpy data,num of line and col
    print df.index
    print df.columns
    print df.values
    print df.shape[0]
    print df.shape[1]

    #Describe shows a quick statistic summary of your data
    print df.describe()

    #Transposing your data
    print df.T

    #Sorting by an axis,0 is y,1 is x,ascending True is zhengxv,false is daoxv
    print df.sort_index(axis=0, ascending=False)

    #Sorting by values
    print df.sort(column='B')

    #see valuenums
    print df[0].value_counts()
    print df[u'hah'].value_counts()

    #see type and change
    df.dtypes
    df[['two', 'three']] = df[['two', 'three']].astype(float)

選取資料

瞭解了資料基本資訊後，我們可能要對資料進行一些裁剪。很多情況下，我們並不需要資料的全部資訊，因此我們要學會選取出我們感興趣的資料和行列，接下來的例子就是對資料的裁剪：

def selection():

    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    print df

    #Selecting a single column, which yields a Series, equivalent to df.A
    print df['A']
    print df.A

    #Selecting via [], which slices the rows.
    print df[0:3]
    print df['20130102':'20130104']

    #Selection by Label

    #For getting a cross section using a label
    print df.loc[dates[0]]

    #Selecting on a multi-axis by label
    print df.loc[:,['A','B']]

    #Showing label slicing, both endpoints are included
    print df.loc['20130102':'20130104',['A','B']]

    #For getting a scalar value
    print df.loc[dates[0],'A']
    print df.at[dates[0],'A']


    #Selection by Position

    #Select via the position of the passed integers
    print df.iloc[3]

    #By integer slices, acting similar to numpy/python
    print df.iloc[3:5,0:2]

    #By lists of integer position locations, similar to the numpy/python style
    print df.iloc[[1,2,4],[0,2]]

    #For slicing rows explicitly
    print df.iloc[1:3,:]

    #For getting a value explicitly
    print df.iloc[1,1]
    print df.iat[1,1]


    #Boolean Indexing

    #Using a single column's values to select data.
    print df[df.A > 0]

    #Using the isin() method for filtering:
    df2 = df.copy()
    df2['E'] = ['one', 'one','two','three','four','three']
    print df2[df2['E'].isin(['two','four'])]

    #A where operation for getting.
    print df[df > 0]
    df2[df2 > 0] = -df2

    #Setting
    #Setting a new column automatically aligns the data by the indexes
    s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
    df['F'] = s1
    print df

    #Setting values by label/index
    df.at[dates[0],'A'] = 0
    df.iat[0,1] = 0
    print df

    #Setting by assigning with a numpy array
    df.loc[:,'D'] = np.array([5] * len(df))
    print df

檔案操作

很多時候，我們的資料並不是自己生成的，而是從檔案中讀取的，資料檔案則具有各種各樣的來源，下面就展示如何載入和儲存資料。pandas提供了多種API，可以載入txt/csv/libsvm等各個格式的資料，完全可以滿足資料分析的需求

def file():
    ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
    df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                      columns=['A', 'B', 'C', 'D'])
    pd.read_csv('foo.csv')
    df.to_csv('foo.csv')

pandas資料處理常用函式demo之建立/行列操作/檢視/檔案操作

pandas是Python下強大的資料分析工具，這篇文章程式碼主要來自於 10 Minutes to pandas，我將示例程式碼進行了重跑和修改，基本可以滿足所有操作，但是使用更高階的功能可以達到事半功倍的效果：原文如下： http://pandas.py

python資料處理----常用資料檔案的處理

資料處理時，常用資料儲存形式主要有：CSV、JSON、XML、EXCEL、資料庫儲存。一、CSV檔案 csv檔案簡介 CSV是一種通用的、相對簡單的檔案格式，被使用者、商業和科學廣泛應用。最廣泛的應用是在程式之間轉移表格資料，而這些程式本身是在不相容的格式上進行操作的（往往是私有的和/或無規

Python資料處理常用操作

Python資料處理常用操作垃圾回收機制計數機制垃圾回收機制計數機制 python裡每一個東西都是物件，它們的核心就是一個結構體：PyObject。PyObject是每個物件必有的內容，其中ob_refcnt就是

Pandas資料框索引函式 iloc、loc和ix學習使用

在資料科學領域python逐漸火熱起來，超越了原有R的地位，這裡豐富的第三方包的貢獻功不可沒，數值計算中Numpy和Pandas絕對是必備的神器，最近使用到Pandas來做資料的操作，今天正好有時間就簡單地總結記錄一點自己學習使用Pandas的體會，主要是對幾個主要的資料

Numpy，Pandas，Matplotlib常用函式

import numpy as np import pandas as pd import matplotlib.pyplot as plt&n

第3章 Pandas資料處理(3.1-3.2)_Python資料科學手冊學習筆記

第2章介紹的NumPy和它的ndarray物件. 為多維陣列提供了高效的儲存和處理方法. Pandas是在NumPy的基礎上建立的新程式庫, 提供DataFrame資料結構. DataFrame帶行標籤(索引),列標籤(變數名),支援相同資料型別和缺失值的多維陣

第3章 Pandas資料處理(3.4-3.5)_Python資料科學手冊學習筆記

3.4 Pandas 數值運算方法對於一元運算(像函式與三角函式),這些通用函式將在輸出結果中保留索引和列標籤; 而對於二元運算(如加法和乘法), Pandas在傳遞通用函式時會自動對齊索引進行計算. 這就意味著,儲存資料內容和組合不同來源的資料—兩處在Num

第3章 Pandas資料處理(3.3)_Python資料科學手冊學習筆記

3.3 資料取值與選擇第2章回顧: - NumPy中取值操作: arr[2,1] - 切片操作: arr[:,1:5] - 掩碼操作: arr[arr>0] - 花哨的索引操作: arr[0,[1,5]] - 組合操作: arr[:,[1:5]] 3.3

pandas資料處理實踐三（DataFrame.apply資料預處理、DataFrame.drop_duplicates去重）

通過apply進行資料的預處理： DataFrame.apply（func，axis = 0，broadcast = None，raw = False，reduce = None，result_type = None，args =（），** kwds ） In [70

pandas資料處理實踐四（時間序列date_range、資料分箱cut、分組技術GroupBy）

時間序列：關鍵函式 pandas.date_range（start = None，end = None，periods = None，freq = None，tz = None，normalize = False，name = None，closed = None，**

pandas資料處理實踐五（透視表pivot_table、分組和透視表實戰Grouper和pivot_table）

透視表： DataFrame.pivot_table（values = None，index = None，columns = None，aggfunc ='mean'，fill_value = None，margin = False，dropna = True，margi

Matlab影象處理常用函式

目錄一、Matlab常用的統計函式求和 sum(X) 最小值 min(X) 均值 mean(X) 最大值 max(X) x的平方根

Hive函式分類、CLI命令、簡單函式、聚合函式、集合函式、特殊函式(分析函式、視窗函式、混合函式，UDTF)，常用函式Demo

1.1 Hive函式分類 1.2 Hive CLI命令顯示當前會話有多少函式可用 show functions; 顯示函式的描述資訊： DESC FUNCTION concat; 顯示函式的擴充套

Spark常用函式講解之Action操作+例項

RDD：彈性分散式資料集，是一種特殊集合 ‚ 支援多種來源 ‚ 有容錯機制 ‚ 可以被快取 ‚ 支援並行操作，一個RDD代表一個分割槽裡的資料集RDD有兩種操作運算元： Transformatio

第3章 Pandas資料處理(3.9-3.10)_Python資料科學手冊學習筆記

3.9 累計與分組 3.9.1 行星資料 import seaborn as sns planets = sns.load_dataset('planets') planets.shape (1035, 6) planets.head()

python的pandas資料處理

1、numpy 純屬組，有一維二維三維陣列，但是無索引與列名，所以計算速度快 2、series 一維陣列，有標籤，（主要是用在時間序列的資料上） 3、dataframe 二維資料表格裡橫向A B ，縱向A B 4、panel 三維資料由items major

Python資料處理(二) | Pandas資料處理

本篇部落格所有示例使用Jupyter NoteBook演示。 Python資料處理系列筆記基於：Python資料科學手冊電子版下載密碼：ovnh 示例程式碼下載密碼:02f4 目錄

pandas資料處理（一）pymongo資料庫量大插入時去重速度慢

　　之前寫指令碼爬鬥魚主播資訊時用了一個pymongo的去重語句 db['host_info'].update({'主播': data['主播'], '時間': data['時間']}, {'$set': data}, True): 　　這句話以主播和時間為索引判斷資料庫中如果沒有同一主播同一時

python資料分析常用函式

for語句的基本格式 python for迴圈的一般格式：第一行是要先定義一個賦值目標（迭代變數），和要遍歷（迭代）的對像；首行後面是要執行的語句塊。 for 目標 in 對像: print 賦值目標 1.for迴圈字串操作 >>>a = 'i

#python#DataFrame 時間序列資料處理常用操作

有X個機組以15分鐘為步長的長系列（年月日時分）出力的資料，想處理成每個機組的，以“年月日”為索引值，每行顯示1天96個點出力的形式。先利用df.head()把dataframe按96切割成Y份，然後將Y份的第x列（x號機組的出力）提取出來，放到list裡，再利用concat

pandas資料處理常用函式demo之建立/行列操作/檢視/檔案操作

相關推薦