pandas系列學習（三）：DataFrame

阿新 • • 發佈：2018-12-17

作者：chen_h 微訊號 & QQ：862251340 微信公眾號：coderpai

如果你正在進行資料科學，從基於 Excel 的分析轉向 Python 指令碼和自動分析領域，你將會遇到非常流行的資料處理方式 Pandas。Pandas 的開發始於 2008 年，主要開發人員是 Wes McKinney，該庫已經成為使用 Python 進行資料分析和管理的標準。對於任何基於 Python 的資料專業人士，pandas 都是必不可少的工具。

這篇文章的目的是幫助初學者掌握 pandas 的基本資料格式 —— DataFrame。我們將研究建立資料框架的基本方法，DataFrame 是如何工作的。

本文中的主題是以下內容：

將資料從檔案載入到 Python Pandas DataFrame 中；
檢查資料的基本統計資訊；
修改一些數值；
最後將結果輸出到新檔案；

什麼是 DataFrame？

pandas 庫將 DataFrame 定義為具有行和列的二維資料，大小可變的資料結構。簡而言之，你可以將 DataFrame 視為資料表，即一維格式化的二維資料，它具有以下特徵：

資料中可以有多個行和列；
每行代表一個數據樣本；
每列包含描述樣本的不同變數；
每列中的資料通常是相同型別的資料 —— 例如，數字，字串，日期；
通常，與 Excel 資料集不同的是，DataFrame 避免丟失值，並且行或列之間沒有間隙和空值；

舉例來說，以下資料集很適合 Pandas DataFrame：

在學校系統 DataFrame 中 —— 每行可以代表學校中的單個學生，列可以表示學生姓名（字串），年齡（數字），出生日期（日期）和地址（字串）；
在經濟學資料框架中，每一行可以代表一個城市或者地理區域，列可能包括區域名稱（字串），人口（數量），人口平均年齡（數量），住戶數量（數量），每個地區的學校數量（數量）等；
在電子商務系統或者商店中，DataFrame 中的每一行都可用於表示客戶，其中有購買商品數量（數量），原始註冊日期（日期）和信用卡（字串）；

建立 Pandas DataFrame

我們將研究兩種建立 DataFrame 的方法 —— 手動建立和逗號分隔值（CSV）檔案。

手動輸入資料

每個資料科學專案的開始將包括將有用的資料匯入分析環境，在本例中為 Python 。有多種方法可以在 Python 中建立 DataFrame 資料，最簡單的犯法是手動將資料輸入 Python，這顯然只適用於微小的資料集。

data = {"column_1": [1,2,3,4,5], 
        "another_column": ["this", "column", "has", "strings", "inside"], 
        "float_column": [0.1,0.5,33,48,42.5558],
        "binary_solo": [True, False, True, True, False]
       }
new_dataframe = pd.DataFrame(data)
new_dataframe

another_column	binary_solo	column_1	float_column
0	this	True	1	0.1000
1	column	False	2	0.5000
2	has	True	3	33.0000
3	strings	True	4	48.0000
4	inside	False	5	42.5558

使用 Python 詞典和列表建立 DataFrame 僅適用於你可以手動輸入的小型資料集。還有其他方法可以格式化手動輸入的資料，你可以檢視官網。

請注意，我們一般都是預定將 pandas 庫載入為 pd，這種方式也是官網推薦的方式，也會我們日常習慣用到的方式。

將 CSV 資料載入到 pandas 中

一旦知道檔案的路徑，使用 pandas 中的 read_csv() 函式就可以非常簡單的從 csv 檔案建立 DataFrame 。csv 檔案是包含表格形式資料的文字檔案，其中列使用“，”逗號字元分割，行位於不同的行上。

如果你的資料是採用其他形式，例如 SQL資料庫或者 Excel（XLS / XLSX）檔案，則額可以檢視其他函式以從這些源讀取到 DataFrame 中，即 read_xlsx，read_sql 。但是，為簡單起見，有時候最好將資料直接提取到 csv 然後再使用它們。

我們來舉個例子，我們將從 Data Science 競賽網站 kaggle 下載的資料來記性實驗，你可以直接點選這個連結進行下載。這個資料的格式非常的好，你可以現在 Excel 中開啟它進行預覽：

在這裡插入圖片描述

樣本資料包含 21478 行資料，每行對應於來自特定國家地區的食物來源，前 10 列代表樣本國家和食品的資訊，其餘欄代表 1963 年至 2013 年每年的糧食產量（總共 63 列）。

接下來，我們可以使用 pandas 來來載入這個 csv 資料，如下所示：

path_to_file = './Downloads/FAO+database.csv'
data = pd.read_csv(path_to_file, encoding='ISO-8859-1')
print(type(data))

預覽並檢查 pandas DataFrame 中的資料

在 Python 中有資料之後，你肯定希望看到資料已經載入，並確認存在預期的行和列。

列印資料

如果你使用的是 Jupyter ，只需要輸入資料庫的名稱即可獲得輸出良好的輸出。列印是預覽載入資料的便捷方式，你可以確認列名是否已經正確匯入，資料格式是否符合預期，以及是否有任何缺失值。

data.head()

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
0	AF	2	Afghanistan	2511	Wheat and products	5142	Food	1000 tonnes	33.94	67.71	…	3249.0	3486.0	3704.0	4164.0	4252.0	4538.0	4605.0	4711.0	4810	4895
1	AF	2	Afghanistan	2805	Rice (Milled Equivalent)	5142	Food	1000 tonnes	33.94	67.71	…	419.0	445.0	546.0	455.0	490.0	415.0	442.0	476.0	425	422
2	AF	2	Afghanistan	2513	Barley and products	5521	Feed	1000 tonnes	33.94	67.71	…	58.0	236.0	262.0	263.0	230.0	379.0	315.0	203.0	367	360
3	AF	2	Afghanistan	2513	Barley and products	5142	Food	1000 tonnes	33.94	67.71	…	185.0	43.0	44.0	48.0	62.0	55.0	60.0	72.0	78	89
4	AF	2	Afghanistan	2514	Maize and products	5521	Feed	1000 tonnes	33.94	67.71	…	120.0	208.0	233.0	249.0	247.0	195.0	178.0	191.0	200	200

獲得 DataFrame 的行和列

shape 命令提供有關於資料集大小的資訊 —— shape 返回一個包含行數的元祖，以及 DataFrame 中資料的列數。另一個描述屬性是 ‘ndim’，它給出了資料中的維數，通常為 2 。

data.shape

(21477, 63)

data.ndim

從上面的結果中，我們可以看到我們的食品生產資料包含 21477 行，每行有 63 列，如 .shape 的輸出所示。我們有兩個維度 —— 即具有高度和寬度的2D資料幀。如果你的資料只有一列，則 ndim 將返回 1。

使用 head() 和 tail() 預覽 DataFrame

預設情況下，DataFrame.head() 函式向你顯示 DataFrame 中的前5行資料，相反的是 DataFrame.tail() 函式向你顯示 DataFrame 中的最後5行資料。

如果你想列印特定的行數，那麼你只需要向 head() 和 tail() 函式中傳入特定的數字就行了。比如你想列印最開始的 10 行資料，那麼你只需要呼叫 head(10) 就可以列印最開始的10行資料了。

data.head()

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
0	AF	2	Afghanistan	2511	Wheat and products	5142	Food	1000 tonnes	33.94	67.71	…	3249.0	3486.0	3704.0	4164.0	4252.0	4538.0	4605.0	4711.0	4810	4895
1	AF	2	Afghanistan	2805	Rice (Milled Equivalent)	5142	Food	1000 tonnes	33.94	67.71	…	419.0	445.0	546.0	455.0	490.0	415.0	442.0	476.0	425	422
2	AF	2	Afghanistan	2513	Barley and products	5521	Feed	1000 tonnes	33.94	67.71	…	58.0	236.0	262.0	263.0	230.0	379.0	315.0	203.0	367	360
3	AF	2	Afghanistan	2513	Barley and products	5142	Food	1000 tonnes	33.94	67.71	…	185.0	43.0	44.0	48.0	62.0	55.0	60.0	72.0	78	89
4	AF	2	Afghanistan	2514	Maize and products	5521	Feed	1000 tonnes	33.94	67.71	…	120.0	208.0	233.0	249.0	247.0	195.0	178.0	191.0	200	200

5 rows × 63 columns

data.tail()

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
21472	ZW	181	Zimbabwe	2948	Milk - Excluding Butter	5142	Food	1000 tonnes	-19.02	29.15	…	373.0	357.0	359.0	356.0	341.0	385.0	418.0	457.0	426	451
21473	ZW	181	Zimbabwe	2960	Fish, Seafood	5521	Feed	1000 tonnes	-19.02	29.15	…	5.0	4.0	9.0	6.0	9.0	5.0	15.0	15.0	15	15
21474	ZW	181	Zimbabwe	2960	Fish, Seafood	5142	Food	1000 tonnes	-19.02	29.15	…	18.0	14.0	17.0	14.0	15.0	18.0	29.0	40.0	40	40
21475	ZW	181	Zimbabwe	2961	Aquatic Products, Other	5142	Food	1000 tonnes	-19.02	29.15	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0	0
21476	ZW	181	Zimbabwe	2928	Miscellaneous	5142	Food	1000 tonnes	-19.02	29.15	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0	0

5 rows × 63 columns

data.tail(10)

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
21467	ZW	181	Zimbabwe	2943	Meat	5142	Food	1000 tonnes	-19.02	29.15	…	222.0	228.0	233.0	238.0	242.0	265.0	262.0	277.0	280	258
21468	ZW	181	Zimbabwe	2945	Offals	5142	Food	1000 tonnes	-19.02	29.15	…	20.0	20.0	21.0	21.0	21.0	21.0	21.0	21.0	22	22
21469	ZW	181	Zimbabwe	2946	Animal fats	5142	Food	1000 tonnes	-19.02	29.15	…	26.0	26.0	29.0	29.0	27.0	31.0	30.0	25.0	26	20
21470	ZW	181	Zimbabwe	2949	Eggs	5142	Food	1000 tonnes	-19.02	29.15	…	15.0	18.0	18.0	21.0	22.0	27.0	27.0	24.0	24	25
21471	ZW	181	Zimbabwe	2948	Milk - Excluding Butter	5521	Feed	1000 tonnes	-19.02	29.15	…	21.0	21.0	21.0	21.0	21.0	23.0	25.0	25.0	30	31
21472	ZW	181	Zimbabwe	2948	Milk - Excluding Butter	5142	Food	1000 tonnes	-19.02	29.15	…	373.0	357.0	359.0	356.0	341.0	385.0	418.0	457.0	426	451
21473	ZW	181	Zimbabwe	2960	Fish, Seafood	5521	Feed	1000 tonnes	-19.02	29.15	…	5.0	4.0	9.0	6.0	9.0	5.0	15.0	15.0	15	15
21474	ZW	181	Zimbabwe	2960	Fish, Seafood	5142	Food	1000 tonnes	-19.02	29.15	…	18.0	14.0	17.0	14.0	15.0	18.0	29.0	40.0	40	40
21475	ZW	181	Zimbabwe	2961	Aquatic Products, Other	5142	Food	1000 tonnes	-19.02	29.15	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0	0
21476	ZW	181	Zimbabwe	2928	Miscellaneous	5142	Food	1000 tonnes	-19.02	29.15	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0	0

10 rows × 63 columns

列的資料型別（dtypes）

許多 DataFrame 都有混合資料型別，也就是說，有些列是數字，有些是字串，有些是日期等。在內部，csv 檔案不包含每列中包含哪些資料型別的資訊；所有資料都只是字元。pandas 在載入資料時推斷資料型別，例如，如果列只包含數字，則 pandas 會將該列的資料型別設定為 numeric：integer 或者 float。

你可以使用資料框的 .dtypes 屬性來檢查示例中每列的型別。

data.dtypes

Area Abbreviation     object
Area Code              int64
Area                  object
Item Code              int64
Item                  object
Element Code           int64
Element               object
Unit                  object
latitude             float64
longitude            float64
Y1961                float64
Y1962                float64
Y1963                float64
Y1964                float64
Y1965                float64
Y1966                float64
Y1967                float64
Y1968                float64
Y1969                float64
Y1970                float64
Y1971                float64
Y1972                float64
Y1973                float64
Y1974                float64
Y1975                float64
Y1976                float64
Y1977                float64
Y1978                float64
Y1979                float64
Y1980                float64
....

在某些情況下，自動推斷資料型別可能會產生意外結果。請注意，字串作為“物件”資料型別載入，要更改特定列的資料型別，請使用 .astype() 函式。例如，要將 “專案程式碼” 列視為字串，請使用：

data['Item Code'].astype(str)

使用 .describe() 描述資料

最後，要檢視有關特定列的一些核心統計資訊，我們可以使用 describe() 函式。

對於數字列，describe() 返回基本統計資訊：列中資料的值計數，平均值，標準差，最小值，最大值以及第 25，第 50和第75的中位數；
對於字串列，describe() 返回值計數，唯一條目數，最常出現的值(top value) 以及最高值出現的次數(freq)；

利用 [] 選擇要進行描述的列，並呼叫 describe() ，如下所示：

data['Y2013'].describe()

count     21477.000000
mean        575.557480
std        6218.379479
min        -246.000000
25%           0.000000
50%           8.000000
75%          90.000000
max      489299.000000
Name: Y2013, dtype: float64

data['Area'].describe()

count     21477
unique      174
top       Spain
freq        150
Name: Area, dtype: object

使用 describe() 函式獲取 DataFrame 中列的基本統計資訊。請注意具有 numeric 資料型別的列與字串和字元列之間的差異。

請注意，如果在整個 DataFrame 上呼叫 describe，則僅返回具有 numeric 資料型別的列的統計資訊，並返回 DataFrame 格式。

data.describe()

Area Code	Item Code	Element Code	latitude	longitude	Y1961	Y1962	Y1963	Y1964	Y1965	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
count	21477.000000	21477.000000	21477.000000	21477.000000	21477.000000	17938.000000	17938.000000	17938.000000	17938.000000	17938.000000	…	21128.000000	21128.000000	21373.000000	21373.000000	21373.000000	21373.000000	21373.000000	21373.000000	21477.000000	21477.000000
mean	125.449411	2694.211529	5211.687154	20.450613	15.794445	195.262069	200.782250	205.464600	209.925577	217.556751	…	486.690742	493.153256	496.319328	508.482104	522.844898	524.581996	535.492069	553.399242	560.569214	575.557480
std	72.868149	148.973406	146.820079	24.628336	66.012104	1864.124336	1884.265591	1861.174739	1862.000116	2014.934333	…	5001.782008	5100.057036	5134.819373	5298.939807	5496.697513	5545.939303	5721.089425	5883.071604	6047.950804	6218.379479
min	1.000000	2511.000000	5142.000000	-40.900000	-172.100000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-169.000000	-246.000000
25%	63.000000	2561.000000	5142.000000	6.430000	-11.780000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	120.000000	2640.000000	5142.000000	20.590000	19.150000	1.000000	1.000000	1.000000	1.000000	1.000000	…	6.000000	6.000000	7.000000	7.000000	7.000000	7.000000	7.000000	8.000000	8.000000	8.000000
75%	188.000000	2782.000000	5142.000000	41.150000	46.870000	21.000000	22.000000	23.000000	24.000000	25.000000	…	75.000000	77.000000	78.000000	80.000000	82.000000	83.000000	83.000000	86.000000	88.000000	90.000000
max	276.000000	2961.000000	5521.000000	64.960000	179.410000	112227.000000	109130.000000	106356.000000	104234.000000	119378.000000	…	360767.000000	373694.000000	388100.000000	402975.000000	425537.000000	434724.000000	451838.000000	462696.000000	479028.000000	489299.000000

8 rows × 58 columns

describe() 最後返回的是一個統計資訊，格式是另一個 DataFrame 。

選擇和操作資料

pandas 的資料選擇方法非常靈活。在本文章中，我們來檢視列和行的基本操作。

選擇列

在 pandas 中選擇列有三種主要方式：

使用點符號，例如，data.column_name；
使用方括號和列的名稱作為字串，例如 data[‘column_name’]；
使用數字索引和 iloc 選擇器 data.iloc[:, <column_number>]；

data.Area.head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: Area, dtype: object

data['Area'].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: Area, dtype: object

data.iloc[:,2].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: Area, dtype: object

使用任何這些方法選擇列時，最後生成的都是 Series 資料型別。pandas Series 是一維資料結構。瞭解可以對這些 Series 資料執行的基本操作是非常有用的，包括求和（ .sum() ），求平均值（ .mean() ），計數（ .count() ），得到中位數（ .median() ），並替換缺失值（ .fillna(new_value) ）。

[data['Y2007'].sum(), # Total sum of the column values
 data['Y2007'].mean(), # Mean of the column values
 data['Y2007'].median(), # Median of the column values
 data['Y2007'].nunique(), # Number of unique entries
 data['Y2007'].max(), # Maximum of the column values
 data['Y2007'].min()] # Minimum of the column values

[10867788.0, 508.48210358863986, 7.0, 1994, 402975.0, 0.0]

同時選擇多個列會從現有 DataFrame 中提取新的 DataFrame 。要選擇多列，語法為：

帶有列名列表的方括號選擇，例如：data[ [ ‘column_name_1’, ‘column_name_2’ ] ]；
使用帶有 iloc 選擇器的數字索引和列號列表，例如：data.iloc[:, [0,1,3,4]]；

data[['Area Code', 'Area']].head()

Area Code	Area
0	2	Afghanistan
1	2	Afghanistan
2	2	Afghanistan
3	2	Afghanistan
4	2	Afghanistan

data.iloc[:,[1,2]].head()

Area Code	Area
0	2	Afghanistan
1	2	Afghanistan
2	2	Afghanistan
3	2	Afghanistan
4	2	Afghanistan

選擇行

通常使用 iloc / loc 選擇方法或使用邏輯選擇器（基於另一列或者變數的值進行選擇）來選擇 DataFrame 中的行。以下是一些基本選擇行的方式：

使用 iloc 選擇器進行數字選擇，例如 data.iloc[0:10, : ] ，這就能選擇前 10 行；
使用 loc 選擇器進行基於標籤的行選擇，例如 data.loc[2, : ]；
使用評估語句的基於邏輯的行選擇，例如 data[ data[ “Area” ] == “Ireland” ] 選擇 Area 值為 Ireland 的行；

data.iloc[[1,2], : ].head()

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
1	AF	2	Afghanistan	2805	Rice (Milled Equivalent)	5142	Food	1000 tonnes	33.94	67.71	…	419.0	445.0	546.0	455.0	490.0	415.0	442.0	476.0	425	422
2	AF	2	Afghanistan	2513	Barley and products	5521	Feed	1000 tonnes	33.94	67.71	…	58.0	236.0	262.0	263.0	230.0	379.0	315.0	203.0	367	360

2 rows × 63 columns

data.loc[2, : ]

Area Abbreviation                     AF
Area Code                              2
Area                         Afghanistan
Item Code                           2513
Item                 Barley and products
Element Code                        5521
Element                             Feed
Unit                         1000 tonnes
latitude                           33.94
longitude                          67.71
Y1961                                 76
....

data[ data["Area"] == 'Ireland' ].head()

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
9533	IE	104	Ireland	2511	Wheat and products	5521	Feed	1000 tonnes	53.41	-8.24	…	968.0	976.0	902.0	685.0	1063.0	804.0	783.0	760.0	650	600
9534	IE	104	Ireland	2511	Wheat and products	5142	Food	1000 tonnes	53.41	-8.24	…	395.0	423.0	501.0	449.0	470.0	493.0	512.0	502.0	494	500
9535	IE	104	Ireland	2805	Rice (Milled Equivalent)	5521	Feed	1000 tonnes	53.41	-8.24	…	3.0	3.0	3.0	4.0	5.0	4.0	4.0	4.0	4	4
9536	IE	104	Ireland	2805	Rice (Milled Equivalent)	5142	Food	1000 tonnes	53.41	-8.24	…	11.0	6.0	6.0	9.0	14.0	15.0	16.0	14.0	14	14
9537	IE	104	Ireland	2513	Barley and products	5521	Feed	1000 tonnes	53.41	-8.24	…	993.0	908.0	1047.0	904.0	1242.0	1290.0	1283.0	1182.0	1146	1380

5 rows × 63 columns

我們可以靈活使用多種方式對行和列的組合選擇，以實現對資料的操作。

刪除行和列（drop）

要從 DataFrame 中刪除行和列，pandas 給我們準備了 drop 函式。

要刪除一列或者多列，請使用列的名稱，並且將軸（axis）指定為 1。或者，如下面的例子所示，在 pandas 中添加了 “columns” 引數，從而不需要指定軸。drop 函式返回的是一個新的 DataFrame，並且刪除了列。如果你需要編輯原始 DataFrame，可以將 inplace 引數設定為 True，並且沒有返回值。

# Deleting columns
# Delete the "Area" column from the dataframe
data = data.drop("Area", axis=1)
# alternatively, delete columns using the columns parameter of drop
data = data.drop(columns="area")
# Delete the Area column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
data.drop("Area", axis=1, inplace=True). 
# Delete multiple columns from the dataframe
data = data.drop(["Y2001", "Y2002", "Y2003"], axis=1)

也可以使用 drop 函式刪除行，方法是指定 axis = 0。drop() 根據標籤刪除行，而不是數字索引，要根據數字位置 / 索引刪除行，請使用 iloc 重新分配資料框值，如下所示：

data.head(3)

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
0	AF	2	Afghanistan	2511	Wheat and products	5142	Food	1000 tonnes	33.94	67.71	…	3249.0	3486.0	3704.0	4164.0	4252.0	4538.0	4605.0	4711.0	4810	4895
1	AF	2	Afghanistan	2805	Rice (Milled Equivalent)	5142	Food	1000 tonnes	33.94	67.71	…	419.0	445.0	546.0	455.0	490.0	415.0	442.0	476.0	425	422
2	AF	2	Afghanistan	2513	Barley and products	5521	Feed	1000 tonnes	33.94	67.71	…	58.0	236.0	262.0	263.0	230.0	379.0	315.0	203.0	367	360

3 rows × 63 columns

data.drop([0,1], axis=0).head(3)

Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	…	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
2	AF	2	Afghanistan	2513	Barley and products	5521	Feed	1000 tonnes	33.94	67.71	…	58.0	236.0	262.0	263.0	230.0	379.0	315.0	203.0	367	360
3	AF	2	Afghanistan	2513	Barley and products	5142	Food	1000 tonnes	33.94	67.71	…	185.0	43.0	44.0	48.0	62.0	55.0	60.0	72.0	78	89
4	AF	2	Afghanistan	2514	Maize and products	5521	Feed	1000 tonnes	33.94	67.71	…	120.0	208.0	233.0	249.0	247.0	195.0	178.0	191.0	200	200

3 rows × 63 columns

Pandas 中的 drop() 函式用於從 DataFrame 中刪除行，軸設定為 0。如前所述，inplace 引數可用於更改 DataFrame 而無需重新分配。

# Delete the rows with labels 0,1,5
data = data.drop([0,1,2], axis=0)
# Delete the rows with label "Ireland"
# For label-based deletion, set the index first on the dataframe:
data = data.set_index("Area")
data = data.drop("Ireland", axis=0). # Delete all rows with label "Ireland"
# Delete the first five rows using iloc selector
data = data.iloc[5:,]

重新命名列

使用 DataFrame 重新命名功能可以在 pandas 中輕鬆實現列重新命名。重新命名功能易於使用，而且非常靈活。以這兩種方式重新命名列：

通過使用字典將舊名稱對映到新名稱進行重新命名，格式為 {“old_column_name”: “new_column_name”, …}；
通過提供更改列名稱的函式重新命名。函式應用於每個列名稱。

# Rename columns using a dictionary to map values
# Rename the Area columnn to 'place_name'
data = data.rename(columns={"Area": "place_name"})
# Again, the inplace parameter will change the dataframe without assignment
data.rename(columns={"Area": "place_name"}, inplace=True)
# Rename multiple columns in one go with a larger dictionary
data.rename(
    columns={
        "Area": "place_name",
        "Y2001": "year_2001"
    },
    inplace=True
)
# Rename all columns using a function, e.g. convert all column names to lower case:
data.rename(columns=str.lower)

在許多情況下，我使用列名稱的整理函式來確保變數名稱的標準 camel-case 格式。從可能非結構化資料集載入資料時，使用 lambda 函式刪除空格和小寫所有列名稱會很有用：

# Quickly lowercase and camelcase all column names in a DataFrame
data = pd.read_csv("/path/to/csv/file.csv")
data.rename(columns=lambda x: x.lower().replace(' ', '_'))

匯出和儲存 pandas DataFrame

在操作或者計算之後，下一步是將資料儲存回 csv 檔案，pandas 中的資料輸出就像載入資料一樣簡單。

你只需要知道兩個函式：第一個 to_csv 函式將 DataFrame 寫入 csv 檔案，to_excel 函式將 DataFrame 資訊寫入 Microsoft Excel 檔案。

# Output data to a CSV file
# Typically, I don't want row numbers in my output file, hence index=False.
# To avoid character issues, I typically use utf8 encoding for input/output.
data.to_csv("output_filename.csv", index=False, encoding='utf8')
# Output data to an Excel file.
# For the excel output to work, you may need to install the "xlsxwriter" package.
data.to_csv("output_excel_file.xlsx", sheet_name="Sheet 1", index=False)

其他有用的函式功能

資料分組和聚合

載入資料之後，你需要將其按一個或另一個值分組，然後執行一些計算。這個我們會在後續文章中介紹。

繪製 pandas DataFrame —— 條形圖和線條

pandas 內建了一個相對廣泛的繪圖功能，可用於初步圖形化探索 —— 尤其是當你使用 Jupyter 進行資料分析。

你需要安裝 matplotlib 繪圖包以生成圖形，並且匯入 matplotlib.pyplot 作為 plt，以便為圖示新增圖形標籤和軸標籤。pandas 原生的 plot() 命令提供了大量功能。

import matplotlib.pyplot as plt

data['latitude'].plot(kind='hist', bins=100)
plt.xlabel('Latitude Value')
plt.show()

在這裡插入圖片描述

plot_data = data[data["Element"] == 'Food']
plot_data = plot_data.groupby('Area')['Y2013'].sum()
plot_data.sort_values()[-10:].plot(kind='bar')
plt.title("Top Ten Food Producers")
plt.ylabel("Food produced (tonnes)")
plt.show()

在這裡插入圖片描述

使用 pandas DataFrame 繪圖命令，結合資料分析，資料分組和最終繪圖。

pandas系列學習（三）：DataFrame

什麼是 DataFrame？

建立 Pandas DataFrame

手動輸入資料

將 CSV 資料載入到 pandas 中

預覽並檢查 pandas DataFrame 中的資料

列印資料

獲得 DataFrame 的行和列

使用 head() 和 tail() 預覽 DataFrame

列的資料型別（dtypes）

使用 .describe() 描述資料

選擇和操作資料

選擇列

選擇行

刪除行和列（drop）

重新命名列

匯出和儲存 pandas DataFrame

其他有用的函式功能

資料分組和聚合

繪製 pandas DataFrame —— 條形圖和線條

pandas系列學習（三）：DataFrame

pandas系列學習（一）：pandas入門

pandas系列學習（五）：資料連線

pandas系列學習（六）：資料聚合

TensorFlow系列專題（三）：深度學習簡介

vue移動音樂app開發學習（三）：輪播圖組件的開發

【Android開發—智慧家居系列】（三）：手機連線WIFI模組

PE檔案格式學習（三）：匯出表

TensorFlow學習（三）：tf.reduce_sum()

網頁開發學習（三）：表單

VSphere系列教程（三）：ESXI 主機設定開機自動啟動虛擬機器

ionic學習（三）：建立pages頁面

rabbitmq學習（三）：rabbitmq之扇形交換機、主題交換機

MFC學習（三）：專案學習

webpack學習（三）：配置HtmlWebpackPlugin

日系插畫學習（三）：光影與結構

Vue學習（三）：數據綁定語法

執行緒學習（三）：執行緒的互斥

Pandas入門基礎（二）：DataFrame的行、列與資料型別

zookeeper學習（三）：配置zookeeper叢集

pandas系列學習（三）：DataFrame

什麼是 DataFrame？

建立 Pandas DataFrame

手動輸入資料

將 CSV 資料載入到 pandas 中

預覽並檢查 pandas DataFrame 中的資料

列印資料

獲得 DataFrame 的行和列

使用 head() 和 tail() 預覽 DataFrame

列的資料型別（dtypes）

使用 .describe() 描述資料

選擇和操作資料

選擇列

選擇行

刪除行和列（drop）

重新命名列

匯出和儲存 pandas DataFrame

其他有用的函式功能

資料分組和聚合

繪製 pandas DataFrame —— 條形圖和線條

相關推薦