1. 程式人生 > >資料科學和人工智慧技術筆記 十九、資料整理(下)

資料科學和人工智慧技術筆記 十九、資料整理(下)

十九、資料整理(下)

作者:Chris Albon

譯者:飛龍

協議:CC BY-NC-SA 4.0

連線和合並資料幀

# 匯入模組
import pandas as pd
from IPython.display import display
from IPython.display import Image

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'
], 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']} df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name']) df_a
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
# 建立第二個資料幀
raw_data = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran'
, 'Bryce', 'Betty'], 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']} df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name']) df_b
subject_id first_name last_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
# 建立第三個資料幀
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id'])
df_n
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
# 將兩個資料幀按行連線
df_new = pd.concat([df_a, df_b])
df_new
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
# 將兩個資料幀按列連線
pd.concat([df_a, df_b], axis=1)
subject_id first_name last_name subject_id first_name last_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
# 按兩個資料幀按 subject_id 連線
pd.merge(df_new, df_n, on='subject_id')
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15
# 將兩個資料幀按照左和右資料幀的 subject_id 連線
pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15

使用外連線來合併。

“全外連線產生表 A 和表 B 中所有記錄的集合,帶有來自兩側的匹配記錄。如果沒有匹配,則缺少的一側將包含空值。” – [來源](http://blog .codinghorror.com/a-visual-explanation-of-sql-joins/)

pd.merge(df_a, df_b, on='subject_id', how='outer')
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan

使用內連線來合併。

“內聯接只生成匹配表 A 和表 B 的記錄集。” – 來源

pd.merge(df_a, df_b, on='subject_id', how='inner')
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black
# 使用右連線來合併
pd.merge(df_a, df_b, on='subject_id', how='right')
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black
2 6 NaN NaN Bran Balwner
3 7 NaN NaN Bryce Brice
4 8 NaN NaN Betty Btisan

使用左連線來合併。

“左外連線從表 A 中生成一組完整的記錄,它們在表 B 中有匹配的記錄。如果沒有匹配,右側將包含空。” – 來源

pd.merge(df_a, df_b, on='subject_id', how='left')
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
# 合併時新增字尾以複製列名稱
pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right'))
subject_id first_name_left last_name_left first_name_right last_name_right
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
# 基於索引的合併
pd.merge(df_a, df_b, right_index=True, left_index=True)
subject_id_x first_name_x last_name_x subject_id_y first_name_y last_name_y
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan

列出 pandas 列中的唯一值

特別感謝 Bob Haffner 指出了一種更好的方法。

# 匯入模組
import pandas as pd

# 設定 ipython 的最大行顯示
pd.set_option('display.max_row', 1000)

# 設定 ipython 的最大列寬
pd.set_option('display.max_columns', 50)

# 建立示例資料幀
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
name reports year
Cochice Jason 4 2012
Pima Molly 24 2012
Santa Cruz Tina 31 2013
Maricopa Jake 2 2014
Yuma Amy 3 2014
# 列出 df['name'] 的唯一值
df.name.unique()

# array(['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], dtype=object) 

載入 JSON 檔案

# 載入庫
import pandas as pd

# 建立 JSON 檔案的 URL(或者可以是檔案路徑)
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json'

# 將 JSON 檔案載入到資料框中
df = pd.read_json(url, orient='columns')

# 檢視前十行
df.head(10)
category datetime integer
0 0 2015-01-01 00:00:00 5
1 0 2015-01-01 00:00:01 5
10 0 2015-01-01 00:00:10 5
11 0 2015-01-01 00:00:11 5
12 0 2015-01-01 00:00:12 8
13 0 2015-01-01 00:00:13 9
14 0 2015-01-01 00:00:14 8
15 0 2015-01-01 00:00:15 8
16 0 2015-01-01 00:00:16 2
17 0 2015-01-01 00:00:17 1

載入 Excel 檔案

# 載入庫
import pandas as pd

# 建立 Excel 檔案的 URL(或者可以是檔案路徑)
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.xlsx'

# 將 Excel 檔案的第一頁載入到資料框中
df = pd.read_excel(url, sheetname=0, header=1)

# 檢視前十行
df.head(10)
5 2015-01-01 00:00:00 0
0 5 2015-01-01 00:00:01 0
1 9 2015-01-01 00:00:02 0
2 6 2015-01-01 00:00:03 0
3 6 2015-01-01 00:00:04 0
4 9 2015-01-01 00:00:05 0
5 7 2015-01-01 00:00:06 0
6 1 2015-01-01 00:00:07 0
7 6 2015-01-01 00:00:08 0
8 9 2015-01-01 00:00:09 0
9 5 2015-01-01 00:00:10 0

將 Excel 表格載入為資料幀

# 匯入模組
import pandas as pd

# 載入 excel 檔案並賦給 xls_file
xls_file = pd.ExcelFile('../data/example.xls')
xls_file

# <pandas.io.excel.ExcelFile at 0x111912be0> 

# 檢視電子表格的名稱
xls_file.sheet_names

# ['Sheet1'] 

# 將 xls 檔案 的 Sheet1 載入為資料幀
df = xls_file.parse('Sheet1')
df
year deaths_attacker deaths_defender soldiers_attacker soldiers_defender wounded_attacker wounded_defender
0 1945 425 423 2532 37235 41 14
1 1956 242 264 6346 2523 214 1424
2 1964 323 1231 3341 2133 131 131
3 1969 223 23 6732 1245 12 12
4 1971 783 23 12563 2671 123 34
5 1981 436 42 2356 7832 124 124
6 1982 324 124 253 2622 264 1124
7 1992 3321 631 5277 3331 311 1431
8 1999 262 232 2732 2522 132 122
9 2004 843 213 6278 26773 623 2563

載入 CSV

# 匯入模組
import pandas as pd
import numpy as np

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, ".", "."],
        'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25,000
1 Molly Jacobson 52 24 94,000
2 Tina . 36 31 57
3 Jake Milner 24 . 62
4 Amy Cooze 73 . 70
# 將資料幀儲存為工作目錄中的 csv
df.to_csv('pandas_dataframe_importing_csv/example.csv')

df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
df
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 0 Jason Miller 42 4 25,000
1 1 Molly Jacobson 52 24 94,000
2 2 Tina . 36 31 57
3 3 Jake Milner 24 . 62
4 4 Amy Cooze 73 . 70
# 載入無頭 CSV
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', header=None)
df
0 1 2 3 4 5
0 NaN first_name last_name age preTestScore postTestScore
1 0.0 Jason Miller 42 4 25,000
2 1.0 Molly Jacobson 52 24 94,000
3 2.0 Tina . 36 31 57
4 3.0 Jake Milner 24 . 62
5 4.0 Amy Cooze 73 . 70
# 在載入 csv 時指定列名稱
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
UID First Name Last Name Age Pre-Test Score Post-Test Score
0 NaN first_name last_name age preTestScore postTestScore
1 0.0 Jason Miller 42 4 25,000
2 1.0 Molly Jacobson 52 24 94,000
3 2.0 Tina . 36 31 57
4 3.0 Jake Milner 24 . 62
5 4.0 Amy Cooze 73 . 70
# 通過將索引列設定為 UID 來載入 csv
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col='UID', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
First Name Last Name Age Pre-Test Score Post-Test Score
UID
NaN first_name last_name age preTestScore postTestScore
0.0 Jason Miller 42 4 25,000
1.0 Molly Jacobson 52 24 94,000
2.0 Tina . 36 31 57
3.0 Jake Milner 24 . 62
4.0 Amy Cooze 73 . 70
# 在載入 csv 時將索引列設定為名字和姓氏
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col=['First Name', 'Last Name'], names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
UID Age Pre-Test Score Post-Test Score
First Name Last Name
first_name last_name NaN age preTestScore postTestScore
Jason Miller 0.0 42 4 25,000
Molly Jacobson 1.0 52 24 94,000
Tina . 2.0 36 31 57
Jake Milner 3.0 24 . 62
Amy Cooze 4.0 73 . 70
# 在載入 csv 時指定 '.' 為缺失值
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.'])
pd.isnull(df)
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 False False False False False False
1 False False False False False False
2 False False True False False False
3 False False False False True False
4 False False False False True False
# 載入csv,同時指定 '.' 和 'NA' 為“姓氏”列的缺失值,指定 '.' 為 preTestScore 列的缺失值
sentinels = {'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']}

df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels)
df
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 0 Jason Miller 42 4 25,000
1 1 Molly Jacobson 52 24 94,000
2 2 Tina . 36 31 57
3 3 Jake Milner 24 . 62
4 4 Amy Cooze 73 . 70
# 在載入 csv 時跳過前 3 行
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels, skiprows=3)
df
2 Tina . 36 31 57
0 3 Jake Milner 24 . 62
1 4 Amy Cooze 73 . 70
# 載入 csv,同時將數字字串中的 ',' 解釋為千位分隔符
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', thousands=',')
df
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 0 Jason Miller 42 4 25000
1 1 Molly Jacobson 52 24 94000
2 2 Tina . 36 31 57
3 3 Jake Milner 24 . 62
4 4 Amy Cooze 73 . 70

長到寬的格式

# 匯入模組
import pandas as pd

raw_data = {'patient': [1, 1, 1, 2, 2], 
        'obs': [1, 2, 3, 1, 2], 
        'treatment': [0, 1, 0, 1, 0],
        'score': [6252, 24243, 2345, 2342, 23525]} 
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
patient obs treatment score
0 1 1 0 6252
1 1 2 1 24243
2 1 3 0 2345
3 2 1 1 2342
4 2 2 0 23525

製作“寬的”資料。

現在,我們將建立一個“寬的”資料幀,其中行數按患者編號,列按觀測編號,單元格值為得分值。

df.pivot(index='patient', columns='obs', values='score')
obs 1 2 3
patient
1 6252.0 24243.0 2345.0
2 2342.0 23525.0 NaN

在資料幀中小寫列名

# 匯入模組
import pan