1. 程式人生 > >Pandas Cookbook -- 09合併Pandas物件及資料庫

Pandas Cookbook -- 09合併Pandas物件及資料庫

合併Pandas物件及資料庫

簡書大神SeanCheney的譯作,我作了些格式調整和文章目錄結構的變化,更適合自己閱讀,以後翻閱是更加方便自己查詢吧

import pandas as pd
import numpy as np

DataFrame插入

讀取names資料集

names = pd.read_csv('data/names.csv')
names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2

用loc直接賦值新的行

new_data_list = ['Aria', 1]
names.loc[4] = new_data_list
names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1

也可以用字典賦值新行

names.loc[len(names)] = {'Name':'Zayd', 'Age':2}

字典可以打亂列名的順序

names.loc[len(names)] = pd.Series({'Age':32, 'Name':'Dean'})

用append新增新行

1.直接append一個字典

names.append({'Name':'Aria2222', 'Age':12222},ignore_index=True)
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
5 Zayd 2
6 Dean 32
7 Aria2222 12222

2.append方法可以將DataFrame和Series相連

建立一個Series物件

s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
names.append(s)
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
5 Zayd 2
6 Dean 32
7 Zach 3

append方法可以同時連線多行,只要將物件放到列表中

s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
names.append([s1, s2])
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
5 Zayd 2
6 Dean 32
7 Zach 3
USA Zayd 2

用insert新增新列

DataFrame.insert(loc, column, value, allow_duplicates=False)

Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

向dataframe指定的位置插入新列,如果該列與已有列重複則會丟擲異常,除非allow_duplicates=True.

  • loc : int
    • Insertion index. Must verify 0 <= loc <= len(columns)
  • column : string, number, or hashable object
    • label of the inserted column
  • value : int, Series, or array-like
  • allow_duplicates : bool, optional
names.insert(0,'id',np.random.randint(len(names)))
names
id Name Age
0 6 Cornelia 70
1 6 Abbas 69
2 6 Penelope 4
3 6 Niko 2
4 6 Aria 1
5 6 Zayd 2
6 6 Dean 32

DataFrame合併

讀取stocks_2016和stocks_2017兩個資料集,用Symbol作為行索引名

years = 2016, 2017, 2018
stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year), index_col='Symbol') for year in years]
stocks_2016, stocks_2017, stocks_2018 = stock_tables

names = ['prices', 'transactions']
food_tables = [pd.read_csv('data/food_{}.csv'.format(name)) for name in names]

concat

將兩個DataFrame放到一個列表中,用pandas的concat方法將它們連線起來

s_list = [stocks_2016, stocks_2017]
pd.concat(s_list)
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300

concat是唯一一個可以將DataFrames垂直連線起來的函式

keys引數可以給兩個DataFrame命名,該標籤會出現在行索引的最外層,會生成多層索引,names引數可以重新命名每個索引層

pd.concat(s_list, keys=['2016', '2017'], names=['Year', 'Symbol'])
Shares Low High
Year Symbol
2016 AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
2017 AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300

也可以橫向連線。只要將axis引數設為columns或1

pd.concat(s_list, keys=['2016', '2017'], axis='columns', names=['Year', None],sort=True)
Year 2016 2017
Shares Low High Shares Low High
AAPL 80.0 95.0 110.0 50.0 120.0 140.0
GE NaN NaN NaN 100.0 30.0 40.0
IBM NaN NaN NaN 87.0 75.0 95.0
SLB NaN NaN NaN 20.0 55.0 85.0
TSLA 50.0 80.0 130.0 100.0 100.0 300.0
TXN NaN NaN NaN 500.0 15.0 23.0
WMT 40.0 55.0 70.0 NaN NaN NaN

concat函式預設使用的是外連線,會保留每個DataFrame中的所有行。也可以通過設定join引數,使用內連線:

pd.concat(s_list, join='inner', keys=['2016', '2017'], axis='columns', names=['Year', None])
Year 2016 2017
Shares Low High Shares Low High
Symbol
AAPL 80 95 110 50 120 140
TSLA 50 80 130 100 100 300

concat方法的TIPS

給dataframe命名後,再連線

pd.concat(dict(zip(years,stock_tables)), axis='columns',sort=True)
2016 2017 2018
Shares Low High Shares Low High Shares Low High
AAPL 80.0 95.0 110.0 50.0 120.0 140.0 40.0 135.0 170.0
AMZN NaN NaN NaN NaN NaN NaN 8.0 900.0 1125.0
GE NaN NaN NaN 100.0 30.0 40.0 NaN NaN NaN
IBM NaN NaN NaN 87.0 75.0 95.0 NaN NaN NaN
SLB NaN NaN NaN 20.0 55.0 85.0 NaN NaN NaN
TSLA 50.0 80.0 130.0 100.0 100.0 300.0 50.0 220.0 400.0
TXN NaN NaN NaN 500.0 15.0 23.0 NaN NaN NaN
WMT 40.0 55.0 70.0 NaN NaN NaN NaN NaN NaN

append是concat方法的超簡化版本,append內部其實就是呼叫concat。前本節的第二個例子,pd.concat也可以如下實現:

stocks_2016.append(stocks_2017)
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300

join

用join將DataFrame連起來;如果列名有相同的,需要設定lsuffix或rsuffix以進行區分

stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')
Shares_2016 Low_2016 High_2016 Shares_2017 Low_2017 High_2017
Symbol
AAPL 80.0 95.0 110.0 50.0 120.0 140.0
GE NaN NaN NaN 100.0 30.0 40.0
IBM NaN NaN NaN 87.0 75.0 95.0
SLB NaN NaN NaN 20.0 55.0 85.0
TSLA 50.0 80.0 130.0 100.0 100.0 300.0
TXN NaN NaN NaN 500.0 15.0 23.0
WMT 40.0 55.0 70.0 NaN NaN NaN

merge

檢視food_prices和food_transactions兩個小資料集

food_prices, food_transactions = food_tables
food_prices
item store price Date
0 pear A 0.99 2017
1 pear B 1.99 2017
2 peach A 2.99 2017
3 peach B 3.49 2017
4 banana A 0.39 2017
5 banana B 0.49 2017
6 steak A 5.99 2017
7 steak B 6.99 2017
8 steak B 4.99 2015
food_transactions
custid item store quantity
0 1 pear A 5
1 1 banana A 10
2 2 steak B 3
3 2 pear B 1
4 2 peach B 2
5 2 steak B 1
6 2 coconut B 4

通過鍵item和store,將food_transactions和food_prices兩個資料集融合

food_transactions.merge(food_prices, on=['item', 'store'])
custid item store quantity price Date
0 1 pear A 5 0.99 2017
1 1 banana A 10 0.39 2017
2 2 steak B 3 6.99 2017
3 2 steak B 3 4.99 2015
4 2 steak B 1 6.99 2017
5 2 steak B 1 4.99 2015
6 2 pear B 1 1.99 2017
7 2 peach B 2 3.49 2017

因為steak在兩張表中分別出現了兩次,融合時產生了笛卡爾積,造成結果中出現了四行steak
因為coconut沒有對應的價格,造成結果中沒有coconut

下面只融合2017年的資料

food_transactions.merge(food_prices.query('Date == 2017'), how='left')
custid item store quantity price Date
0 1 pear A 5 0.99 2017.0
1 1 banana A 10 0.39 2017.0
2 2 steak B 3 6.99 2017.0
3 2 pear B 1 1.99 2017.0
4 2 peach B 2 3.49 2017.0
5 2 steak B 1 6.99 2017.0
6 2 coconut B 4 NaN NaN

concat/join/merge的區別

concat:

  1. Pandas函式
  2. 可以垂直和水平地連線兩個或多個pandas物件
  3. 只用索引對齊
  4. 索引出現重複值時會報錯
  5. 預設是外連線(也可以設為內連線)
  6. concat是唯一一個可以將DataFrames垂直連線起來的函式

join:

  1. DataFrame方法
  2. 只能水平連線兩個或多個pandas物件
  3. 對齊是靠被呼叫的DataFrame的列索引或行索引和另一個物件的行索引(不能是列索引)
  4. 通過笛卡爾積處理重複的索引值
  5. 預設是左連線(也可以設為內連線、外連線和右連線)

merge:

  1. DataFrame方法
  2. 只能水平連線兩個DataFrame物件
  3. 對齊是靠被呼叫的DataFrame的列或行索引和另一個DataFrame的列或行索引
  4. 通過笛卡爾積處理重複的索引值
  5. 預設是內連線(也可以設為左連線、外連線、右連線)

連線SQL資料庫

建立SQLAlchemy引擎

在讀取chinook資料庫之前,需要建立SQLAlchemy引擎

from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/chinook.db')

read_sql_table函式

read_sql_table函式可以讀取一張表,第一個引數是表名,第二個引數是引擎,返回一個dataframe

tracks = pd.read_sql_table('tracks', engine)
tracks.iloc[:5,:5]
TrackId Name AlbumId MediaTypeId GenreId
0 1 For Those About To Rock (We Salute You) 1 1 1
1 2 Balls to the Wall 2 2 1
2 3 Fast As a Shark 3 2 1
3 4 Restless and Wild 3 2 1
4 5 Princess of the Dawn 3 2 1
genres = pd.read_sql_table('genres', engine)
genres.head()
GenreId Name
0 1 Rock
1 2 Jazz
2 3 Metal
3 4 Alternative & Punk
4 5 Rock And Roll

找到每種型別歌曲的平均時長

genre_track = genres.merge(
    tracks[['GenreId', 'Milliseconds']], 
    on='GenreId', 
    how='left').drop('GenreId', axis='columns')
genre_track.head()
Name Milliseconds
0 Rock 343719
1 Rock 342562
2 Rock 230619
3 Rock 252051
4 Rock 375418

將Milliseconds列轉變為timedelta資料型別

genre_time = genre_track.groupby('Name')['Milliseconds'].mean()
pd.to_timedelta(genre_time, unit='ms').dt.floor('s').sort_values()[:10]
Name
Rock And Roll        00:02:14
Opera                00:02:54
Hip Hop/Rap          00:02:58
Easy Listening       00:03:09
Bossa Nova           00:03:39
R&B/Soul             00:03:40
World                00:03:44
Pop                  00:03:49
Latin                00:03:52
Alternative & Punk   00:03:54
Name: Milliseconds, dtype: timedelta64[ns]

找到每名顧客花費的總時長

cust = pd.read_sql_table('customers', engine, columns=['CustomerId', 'FirstName', 'LastName'])
invoice = pd.read_sql_table('invoices', engine, columns=['InvoiceId','CustomerId'])
ii = pd.read_sql_table('invoice_items', engine, columns=['InvoiceId', 'UnitPrice', 'Quantity'])
cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
cust_inv.head()
CustomerId FirstName LastName InvoiceId UnitPrice Quantity
0 1 Luís Gonçalves 98 1.99 1
1 1 Luís Gonçalves 98 1.99 1
2 1 Luís Gonçalves 121 0.99 1
3 1 Luís Gonçalves 121 0.99 1
4 1 Luís Gonçalves 121 0.99 1

現在可以用總量乘以單位價格,找到每名顧客的總消費

total = cust_inv['Quantity'] * cust_inv['UnitPrice']
cols = ['CustomerId', 'FirstName', 'LastName']
cust_inv.assign(Total = total).groupby(cols)['Total'].sum().sort_values(ascending=False).head()
CustomerId  FirstName  LastName  
6           Helena     Holý          49.62
26          Richard    Cunningham    47.62
57          Luis       Rojas         46.62
46          Hugh       O'Reilly      45.62
45          Ladislav   Kovács        45.62
Name: Total, dtype: float64

read_sql_query函式

pd.read_sql_query('select * from tracks limit 5', engine).iloc[:,:5]
TrackId Name AlbumId MediaTypeId GenreId
0 1 For Those About To Rock (We Salute You) 1 1 1
1 2 Balls to the Wall 2 2 1
2 3 Fast As a Shark 3 2 1
3 4 Restless and Wild 3 2 1
4 5 Princess of the Dawn 3 2 1

可以將長字串傳給read_sql_query

sql_string1 = '''
          select 
              Name, 
              time(avg(Milliseconds) / 1000, 'unixepoch') as avg_time
          from (
                  select 
                      g.Name, 
                      t.Milliseconds
                  from 
                      genres as g 
                  join
                      tracks as t
                      on 
                          g.genreid == t.genreid
              )
          group by 
              Name
          order by 
              avg_time
          '''
pd.read_sql_query(sql_string1, engine)[:10]
Name avg_time
0 Rock And Roll 00:02:14
1 Opera 00:02:54
2 Hip Hop/Rap 00:02:58
3 Easy Listening 00:03:09
4 Bossa Nova 00:03:39
5 R&B/Soul 00:03:40
6 World 00:03:44
7 Pop 00:03:49
8 Latin 00:03:52
9 Alternative & Punk 00:03:54
sql_string2 = '''
          select 
                c.customerid, 
                c.FirstName, 
                c.LastName, 
                sum(ii.quantity *  ii.unitprice) as Total
          from
                customers as c
          join
                invoices as i
                     on c.customerid = i.customerid
          join
                invoice_items as ii
                     on i.invoiceid = ii.invoiceid
          group by
                c.customerid, c.FirstName, c.LastName
          order by
                Total desc
          '''
pd.read_sql_query(sql_string2, engine)[:10]
CustomerId FirstName LastName Total
0 6 Helena Holý 49.62
1 26 Richard Cunningham 47.62
2 57 Luis Rojas 46.62
3 45 Ladislav Kovács 45.62
4 46 Hugh O'Reilly 45.62
5 37 Fynn Zimmermann 43.62
6 24 Frank Ralston 43.62
7 28 Julia Barnett 43.62
8 25 Victor Stevens 42.62
9 7 Astrid Gruber 42.62