資料科學和人工智慧技術筆記 十九、資料整理(下)
阿新 • • 發佈:2019-01-05
十九、資料整理(下)
作者:Chris Albon
譯者:飛龍
連線和合並資料幀
# 匯入模組
import pandas as pd
from IPython.display import display
from IPython.display import Image
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung' ],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a
subject_id | first_name | last_name | |
---|---|---|---|
0 | 1 | Alex | Anderson |
1 | 2 | Amy | Ackerman |
2 | 3 | Allen | Ali |
3 | 4 | Alice | Aoni |
4 | 5 | Ayoung | Atiches |
# 建立第二個資料幀
raw_data = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran' , 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_b
subject_id | first_name | last_name | |
---|---|---|---|
0 | 4 | Billy | Bonder |
1 | 5 | Brian | Black |
2 | 6 | Bran | Balwner |
3 | 7 | Bryce | Brice |
4 | 8 | Betty | Btisan |
# 建立第三個資料幀
raw_data = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id'])
df_n
subject_id | test_id | |
---|---|---|
0 | 1 | 51 |
1 | 2 | 15 |
2 | 3 | 15 |
3 | 4 | 61 |
4 | 5 | 16 |
5 | 7 | 14 |
6 | 8 | 15 |
7 | 9 | 1 |
8 | 10 | 61 |
9 | 11 | 16 |
# 將兩個資料幀按行連線
df_new = pd.concat([df_a, df_b])
df_new
subject_id | first_name | last_name | |
---|---|---|---|
0 | 1 | Alex | Anderson |
1 | 2 | Amy | Ackerman |
2 | 3 | Allen | Ali |
3 | 4 | Alice | Aoni |
4 | 5 | Ayoung | Atiches |
0 | 4 | Billy | Bonder |
1 | 5 | Brian | Black |
2 | 6 | Bran | Balwner |
3 | 7 | Bryce | Brice |
4 | 8 | Betty | Btisan |
# 將兩個資料幀按列連線
pd.concat([df_a, df_b], axis=1)
subject_id | first_name | last_name | subject_id | first_name | last_name | |
---|---|---|---|---|---|---|
0 | 1 | Alex | Anderson | 4 | Billy | Bonder |
1 | 2 | Amy | Ackerman | 5 | Brian | Black |
2 | 3 | Allen | Ali | 6 | Bran | Balwner |
3 | 4 | Alice | Aoni | 7 | Bryce | Brice |
4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan |
# 按兩個資料幀按 subject_id 連線
pd.merge(df_new, df_n, on='subject_id')
subject_id | first_name | last_name | test_id | |
---|---|---|---|---|
0 | 1 | Alex | Anderson | 51 |
1 | 2 | Amy | Ackerman | 15 |
2 | 3 | Allen | Ali | 15 |
3 | 4 | Alice | Aoni | 61 |
4 | 4 | Billy | Bonder | 61 |
5 | 5 | Ayoung | Atiches | 16 |
6 | 5 | Brian | Black | 16 |
7 | 7 | Bryce | Brice | 14 |
8 | 8 | Betty | Btisan | 15 |
# 將兩個資料幀按照左和右資料幀的 subject_id 連線
pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')
subject_id | first_name | last_name | test_id | |
---|---|---|---|---|
0 | 1 | Alex | Anderson | 51 |
1 | 2 | Amy | Ackerman | 15 |
2 | 3 | Allen | Ali | 15 |
3 | 4 | Alice | Aoni | 61 |
4 | 4 | Billy | Bonder | 61 |
5 | 5 | Ayoung | Atiches | 16 |
6 | 5 | Brian | Black | 16 |
7 | 7 | Bryce | Brice | 14 |
8 | 8 | Betty | Btisan | 15 |
使用外連線來合併。
“全外連線產生表 A 和表 B 中所有記錄的集合,帶有來自兩側的匹配記錄。如果沒有匹配,則缺少的一側將包含空值。” – [來源](http://blog .codinghorror.com/a-visual-explanation-of-sql-joins/)
pd.merge(df_a, df_b, on='subject_id', how='outer')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 1 | Alex | Anderson | NaN | NaN |
1 | 2 | Amy | Ackerman | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN |
3 | 4 | Alice | Aoni | Billy | Bonder |
4 | 5 | Ayoung | Atiches | Brian | Black |
5 | 6 | NaN | NaN | Bran | Balwner |
6 | 7 | NaN | NaN | Bryce | Brice |
7 | 8 | NaN | NaN | Betty | Btisan |
使用內連線來合併。
“內聯接只生成匹配表 A 和表 B 的記錄集。” – 來源
pd.merge(df_a, df_b, on='subject_id', how='inner')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 4 | Alice | Aoni | Billy | Bonder |
1 | 5 | Ayoung | Atiches | Brian | Black |
# 使用右連線來合併
pd.merge(df_a, df_b, on='subject_id', how='right')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 4 | Alice | Aoni | Billy | Bonder |
1 | 5 | Ayoung | Atiches | Brian | Black |
2 | 6 | NaN | NaN | Bran | Balwner |
3 | 7 | NaN | NaN | Bryce | Brice |
4 | 8 | NaN | NaN | Betty | Btisan |
使用左連線來合併。
“左外連線從表 A 中生成一組完整的記錄,它們在表 B 中有匹配的記錄。如果沒有匹配,右側將包含空。” – 來源
pd.merge(df_a, df_b, on='subject_id', how='left')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 1 | Alex | Anderson | NaN | NaN |
1 | 2 | Amy | Ackerman | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN |
3 | 4 | Alice | Aoni | Billy | Bonder |
4 | 5 | Ayoung | Atiches | Brian | Black |
# 合併時新增字尾以複製列名稱
pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right'))
subject_id | first_name_left | last_name_left | first_name_right | last_name_right | |
---|---|---|---|---|---|
0 | 1 | Alex | Anderson | NaN | NaN |
1 | 2 | Amy | Ackerman | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN |
3 | 4 | Alice | Aoni | Billy | Bonder |
4 | 5 | Ayoung | Atiches | Brian | Black |
# 基於索引的合併
pd.merge(df_a, df_b, right_index=True, left_index=True)
subject_id_x | first_name_x | last_name_x | subject_id_y | first_name_y | last_name_y | |
---|---|---|---|---|---|---|
0 | 1 | Alex | Anderson | 4 | Billy | Bonder |
1 | 2 | Amy | Ackerman | 5 | Brian | Black |
2 | 3 | Allen | Ali | 6 | Bran | Balwner |
3 | 4 | Alice | Aoni | 7 | Bryce | Brice |
4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan |
列出 pandas 列中的唯一值
特別感謝 Bob Haffner 指出了一種更好的方法。
# 匯入模組
import pandas as pd
# 設定 ipython 的最大行顯示
pd.set_option('display.max_row', 1000)
# 設定 ipython 的最大列寬
pd.set_option('display.max_columns', 50)
# 建立示例資料幀
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
name | reports | year | |
---|---|---|---|
Cochice | Jason | 4 | 2012 |
Pima | Molly | 24 | 2012 |
Santa Cruz | Tina | 31 | 2013 |
Maricopa | Jake | 2 | 2014 |
Yuma | Amy | 3 | 2014 |
# 列出 df['name'] 的唯一值
df.name.unique()
# array(['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], dtype=object)
載入 JSON 檔案
# 載入庫
import pandas as pd
# 建立 JSON 檔案的 URL(或者可以是檔案路徑)
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json'
# 將 JSON 檔案載入到資料框中
df = pd.read_json(url, orient='columns')
# 檢視前十行
df.head(10)
category | datetime | integer | |
---|---|---|---|
0 | 0 | 2015-01-01 00:00:00 | 5 |
1 | 0 | 2015-01-01 00:00:01 | 5 |
10 | 0 | 2015-01-01 00:00:10 | 5 |
11 | 0 | 2015-01-01 00:00:11 | 5 |
12 | 0 | 2015-01-01 00:00:12 | 8 |
13 | 0 | 2015-01-01 00:00:13 | 9 |
14 | 0 | 2015-01-01 00:00:14 | 8 |
15 | 0 | 2015-01-01 00:00:15 | 8 |
16 | 0 | 2015-01-01 00:00:16 | 2 |
17 | 0 | 2015-01-01 00:00:17 | 1 |
載入 Excel 檔案
# 載入庫
import pandas as pd
# 建立 Excel 檔案的 URL(或者可以是檔案路徑)
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.xlsx'
# 將 Excel 檔案的第一頁載入到資料框中
df = pd.read_excel(url, sheetname=0, header=1)
# 檢視前十行
df.head(10)
5 | 2015-01-01 00:00:00 | 0 | |
---|---|---|---|
0 | 5 | 2015-01-01 00:00:01 | 0 |
1 | 9 | 2015-01-01 00:00:02 | 0 |
2 | 6 | 2015-01-01 00:00:03 | 0 |
3 | 6 | 2015-01-01 00:00:04 | 0 |
4 | 9 | 2015-01-01 00:00:05 | 0 |
5 | 7 | 2015-01-01 00:00:06 | 0 |
6 | 1 | 2015-01-01 00:00:07 | 0 |
7 | 6 | 2015-01-01 00:00:08 | 0 |
8 | 9 | 2015-01-01 00:00:09 | 0 |
9 | 5 | 2015-01-01 00:00:10 | 0 |
將 Excel 表格載入為資料幀
# 匯入模組
import pandas as pd
# 載入 excel 檔案並賦給 xls_file
xls_file = pd.ExcelFile('../data/example.xls')
xls_file
# <pandas.io.excel.ExcelFile at 0x111912be0>
# 檢視電子表格的名稱
xls_file.sheet_names
# ['Sheet1']
# 將 xls 檔案 的 Sheet1 載入為資料幀
df = xls_file.parse('Sheet1')
df
year | deaths_attacker | deaths_defender | soldiers_attacker | soldiers_defender | wounded_attacker | wounded_defender | |
---|---|---|---|---|---|---|---|
0 | 1945 | 425 | 423 | 2532 | 37235 | 41 | 14 |
1 | 1956 | 242 | 264 | 6346 | 2523 | 214 | 1424 |
2 | 1964 | 323 | 1231 | 3341 | 2133 | 131 | 131 |
3 | 1969 | 223 | 23 | 6732 | 1245 | 12 | 12 |
4 | 1971 | 783 | 23 | 12563 | 2671 | 123 | 34 |
5 | 1981 | 436 | 42 | 2356 | 7832 | 124 | 124 |
6 | 1982 | 324 | 124 | 253 | 2622 | 264 | 1124 |
7 | 1992 | 3321 | 631 | 5277 | 3331 | 311 | 1431 |
8 | 1999 | 262 | 232 | 2732 | 2522 | 132 | 122 |
9 | 2004 | 843 | 213 | 6278 | 26773 | 623 | 2563 |
載入 CSV
# 匯入模組
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, ".", "."],
'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
first_name | last_name | age | preTestScore | postTestScore | |
---|---|---|---|---|---|
0 | Jason | Miller | 42 | 4 | 25,000 |
1 | Molly | Jacobson | 52 | 24 | 94,000 |
2 | Tina | . | 36 | 31 | 57 |
3 | Jake | Milner | 24 | . | 62 |
4 | Amy | Cooze | 73 | . | 70 |
# 將資料幀儲存為工作目錄中的 csv
df.to_csv('pandas_dataframe_importing_csv/example.csv')
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
df
Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | |
---|---|---|---|---|---|---|
0 | 0 | Jason | Miller | 42 | 4 | 25,000 |
1 | 1 | Molly | Jacobson | 52 | 24 | 94,000 |
2 | 2 | Tina | . | 36 | 31 | 57 |
3 | 3 | Jake | Milner | 24 | . | 62 |
4 | 4 | Amy | Cooze | 73 | . | 70 |
# 載入無頭 CSV
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', header=None)
df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | NaN | first_name | last_name | age | preTestScore | postTestScore |
1 | 0.0 | Jason | Miller | 42 | 4 | 25,000 |
2 | 1.0 | Molly | Jacobson | 52 | 24 | 94,000 |
3 | 2.0 | Tina | . | 36 | 31 | 57 |
4 | 3.0 | Jake | Milner | 24 | . | 62 |
5 | 4.0 | Amy | Cooze | 73 | . | 70 |
# 在載入 csv 時指定列名稱
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
UID | First Name | Last Name | Age | Pre-Test Score | Post-Test Score | |
---|---|---|---|---|---|---|
0 | NaN | first_name | last_name | age | preTestScore | postTestScore |
1 | 0.0 | Jason | Miller | 42 | 4 | 25,000 |
2 | 1.0 | Molly | Jacobson | 52 | 24 | 94,000 |
3 | 2.0 | Tina | . | 36 | 31 | 57 |
4 | 3.0 | Jake | Milner | 24 | . | 62 |
5 | 4.0 | Amy | Cooze | 73 | . | 70 |
# 通過將索引列設定為 UID 來載入 csv
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col='UID', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
First Name | Last Name | Age | Pre-Test Score | Post-Test Score | |
---|---|---|---|---|---|
UID | |||||
NaN | first_name | last_name | age | preTestScore | postTestScore |
0.0 | Jason | Miller | 42 | 4 | 25,000 |
1.0 | Molly | Jacobson | 52 | 24 | 94,000 |
2.0 | Tina | . | 36 | 31 | 57 |
3.0 | Jake | Milner | 24 | . | 62 |
4.0 | Amy | Cooze | 73 | . | 70 |
# 在載入 csv 時將索引列設定為名字和姓氏
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col=['First Name', 'Last Name'], names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
UID | Age | Pre-Test Score | Post-Test Score | ||
---|---|---|---|---|---|
First Name | Last Name | ||||
first_name | last_name | NaN | age | preTestScore | postTestScore |
Jason | Miller | 0.0 | 42 | 4 | 25,000 |
Molly | Jacobson | 1.0 | 52 | 24 | 94,000 |
Tina | . | 2.0 | 36 | 31 | 57 |
Jake | Milner | 3.0 | 24 | . | 62 |
Amy | Cooze | 4.0 | 73 | . | 70 |
# 在載入 csv 時指定 '.' 為缺失值
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.'])
pd.isnull(df)
Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | True | False | False | False |
3 | False | False | False | False | True | False |
4 | False | False | False | False | True | False |
# 載入csv,同時指定 '.' 和 'NA' 為“姓氏”列的缺失值,指定 '.' 為 preTestScore 列的缺失值
sentinels = {'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']}
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels)
df
Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | |
---|---|---|---|---|---|---|
0 | 0 | Jason | Miller | 42 | 4 | 25,000 |
1 | 1 | Molly | Jacobson | 52 | 24 | 94,000 |
2 | 2 | Tina | . | 36 | 31 | 57 |
3 | 3 | Jake | Milner | 24 | . | 62 |
4 | 4 | Amy | Cooze | 73 | . | 70 |
# 在載入 csv 時跳過前 3 行
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels, skiprows=3)
df
2 | Tina | . | 36 | 31 | 57 | |
---|---|---|---|---|---|---|
0 | 3 | Jake | Milner | 24 | . | 62 |
1 | 4 | Amy | Cooze | 73 | . | 70 |
# 載入 csv,同時將數字字串中的 ',' 解釋為千位分隔符
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', thousands=',')
df
Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | |
---|---|---|---|---|---|---|
0 | 0 | Jason | Miller | 42 | 4 | 25000 |
1 | 1 | Molly | Jacobson | 52 | 24 | 94000 |
2 | 2 | Tina | . | 36 | 31 | 57 |
3 | 3 | Jake | Milner | 24 | . | 62 |
4 | 4 | Amy | Cooze | 73 | . | 70 |
長到寬的格式
# 匯入模組
import pandas as pd
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
patient | obs | treatment | score | |
---|---|---|---|---|
0 | 1 | 1 | 0 | 6252 |
1 | 1 | 2 | 1 | 24243 |
2 | 1 | 3 | 0 | 2345 |
3 | 2 | 1 | 1 | 2342 |
4 | 2 | 2 | 0 | 23525 |
製作“寬的”資料。
現在,我們將建立一個“寬的”資料幀,其中行數按患者編號,列按觀測編號,單元格值為得分值。
df.pivot(index='patient', columns='obs', values='score')
obs | 1 | 2 | 3 |
---|---|---|---|
patient | |||
1 | 6252.0 | 24243.0 | 2345.0 |
2 | 2342.0 | 23525.0 | NaN |
在資料幀中小寫列名
# 匯入模組
import pan