Python基礎 | pandas中dataframe的整合與形變(merge & reshape)
阿新 • • 發佈:2020-04-05
[toc]
[本文示例資料下載](https://pan.baidu.com/s/1lQIpvwThXRkUJ16Fl4ERNA),密碼:**vwy3**
```python
import pandas as pd
# 資料是之前在cnblog上抓取的部分文章資訊
df = pd.read_csv('./data/SQL測試用資料_20200325.csv',encoding='utf-8')
# 為了後續演示,抽樣生成兩個資料集
df1 = df.sample(n=500,random_state=123)
df2 = df.sample(n=600,random_state=234)
# 保證有較多的交集
# 比例抽樣是有順序的,不加random_state,那麼兩個資料集是一樣的
```
## 行的union
[pandas 官方教程](https://pandas.pydata.org/docs/user_guide/merging.html)
### pd.concat
**pd.concat**主要引數說明:
- 要合併的dataframe,可以用`[]`進行包裹,e.g. `[df1,df2,df3]`;
- **axis**=0,axis是拼接的方向,0代表行,1代表列,不過很少用pd.concat來做列的join
- **join**='outer'
- **ignore_index**: bool = False,看是否需要重置index
如果要達到`union all`的效果,那麼要拼接的多個dataframe,必須:
- 列名名稱及順序都需要保持一致
- 每列的資料型別要對應
如果列名不一致就會產生新的列
如果資料型別不一致,不一定報錯,要看具體的相容場景
```python
df2.columns
```
輸出:
`Index(['href', 'title', 'create_time', 'read_cnt', 'blog_name', 'date',
'weekday', 'hour'],
dtype='object')`
```python
# 這裡故意修改下第2列的名稱
df2.columns = ['href', 'title_2', 'create_time', 'read_cnt', 'blog_name', 'date','weekday', 'hour']
print(df1.shape,df2.shape)
# inner方法將無法配對的列刪除
# 拼接的方向,預設是就行(axis=0)
df_m = pd.concat([df1,df2],axis=0,join='inner')
print(df_m.shape)
```
輸出:
(500, 8) (600, 8)
(1100, 7)
```python
# 檢視去重後的資料集大小
df_m.drop_duplicates(subset='href').shape
```
輸出:
(849, 7)
### df.append
和pd.concat方法的區別:
- append只能做行的union
- append方法是**outer join**
相同點:
- append可以支援多個dataframe的union
- append大致等同於 `pd.concat([df1,df2],axis=0,join='outer')`
```python
df1.append(df2).shape
```
輸出:
(1100, 9)
```python
df1.append([df2,df2]).shape
```
輸出:
(1700, 9)
## 列的join
### pd.concat
**pd.concat**也可以做join,不過關聯的欄位不是列的值,而是**index**
也因為是基於index的關聯,所以pd.concat可以對超過2個以上的dataframe做join操作
```python
# 按列拼接,設定axis=1
# inner join
print(df1.shape,df2.shape)
df_m_c = pd.concat([df1,df2], axis=1, join='inner')
print(df_m_c.shape)
```
輸出:
(500, 8) (600, 8)
(251, 16)
這裡是251行,可以取兩個dataframe的index然後求交集看下
```python
set1 = set(df1.index)
set2 = set(df2.index)
set_join = set1.intersection(set2)
print(len(set1), len(set2), len(set_join))
```
輸出:
500 600 251
### pd.merge
**pd.merge**主要引數說明:
- **left**, join操作左側的那一個dataframe
- **right**, join操作左側的那一個dataframe, merge方法只能對2個dataframe做join
- **how**: join方式,預設是inner,str = 'inner'
- **on**=None 關聯的欄位,如果兩個dataframe**關聯欄位一樣**時,設定on就行,不用管left_on,right_on
- **left_on**=None 左表的關聯欄位
- **right_on**=None 右表的關聯欄位,如果兩個dataframe關聯欄位名稱不一樣的時候就設定左右欄位
- **suffixes**=('_x', '_y'), join後給左右表字段加的字首,除關聯欄位外
```python
print(df1.shape,df2.shape)
df_m = pd.merge(left=df1, right=df2\
,how='inner'\
,on=['href','blog_name']
)
print(df_m.shape)
```
輸出:
(500, 8) (600, 8)
(251, 14)
```python
print(df1.shape,df2.shape)
df_m = pd.merge(left=df1, right=df2\
,how='inner'\
,left_on = 'href',right_on='href'
)
print(df_m.shape)
```
輸出:
(500, 8) (600, 8)
(251, 15)
```python
# 對比下不同join模式的區別
print(df1.shape,df2.shape)
# inner join
df_inner = pd.merge(left=df1, right=df2\
,how='inner'\
,on=['href','blog_name']
)
# full outer join
df_full_outer = pd.merge(left=df1, right=df2\
,how='outer'\
,on=['href','blog_name']
)
# left outer join
df_left_outer = pd.merge(left=df1, right=df2\
,how='left'\
,on=['href','blog_name']
)
# right outer join
df_right_outer = pd.merge(left=df1, right=df2\
,how='right'\
,on=['href','blog_name']
)
print('inner join 左表∩右表:' + str(df_inner.shape))
print('full outer join 左表∪右表:' + str(df_full_outer.shape))
print('left outer join 左表包含右表:' + str(df_left_outer.shape))
print('right outer join 右表包含左表:' + str(df_right_outer.shape))
```
輸出:
(500, 8) (600, 8)
inner join 左表∩右表:(251, 14)
full outer join 左表∪右表:(849, 14)
left outer join 左表包含右表:(500, 14)
right outer join 右表包含左表:(600, 14)
### df.join
**df.join**主要引數說明:
- other 右表
- on 關聯欄位,這個和pd.concat做列join一樣,是關聯index的
- how='left'
- lsuffix='' 左表字尾
- rsuffix='' 右表字尾
```python
print(df1.shape,df2.shape)
df_m = df1.join(df2, how='inner',lsuffix='1',rsuffix='2')
df_m.shape
```
輸出:
(500, 8) (600, 8)
(251, 16)
## 行列轉置
[pandas 官方教程](https://pandas.pydata.org/docs/user_guide/reshaping.html)
```python
# 資料準備
import math
df['time_mark'] = df['hour'].apply(lambda x:math.ceil(int(x)/8))
df_stat_raw = df.pivot_table(values= ['read_cnt','href']\
,index=['weekday','time_mark']\
,aggfunc={'read_cnt':'sum','href':'count'})
df_stat = df_stat_raw.reset_index()
```
```python
df_stat.head(3)
```
如上所示,df_stat是兩個維度weekday,time_mark
以及兩個計量指標 href, read_cnt
### pivot
![](https://img2020.cnblogs.com/blog/1977069/202004/1977069-20200404224719083-1086734497.png)
```python
# pivot操作中,index和columns都是維度
res = df_stat.pivot(index='weekday',columns='time_mark',values='href').reset_index(drop=True)
res
```
### stack & unstack
- stack則是將層級最低(預設)的column轉化為index
- unstack預設是將排位最靠後的index轉成column(column放到下面)
![](https://img2020.cnblogs.com/blog/1977069/202004/1977069-20200404224754525-1496237473.png)
![](https://img2020.cnblogs.com/blog/1977069/202004/1977069-20200404224803192-1465526029.png)
![](https://img2020.cnblogs.com/blog/1977069/202004/1977069-20200404224815129-1283620786.png)
```python
# pandas.pivot_table生成的結果如下
df_stat_raw
```
```python
# unstack預設是將排位最靠後的index轉成column(column放到下面)
df_stat_raw.unstack()
# unstack也可以指定index,然後轉成最底層的column
df_stat_raw.unstack('weekday')
# 這個語句的效果是一樣的,可以指定`index`的位置
# stat_raw.unstack(0)
```
```python
# stack則是將層級醉倒的column轉化為index
df_stat_raw.unstack().stack().head(5)
```
```python
# 經過兩次stack後就成為多維表了
# 每次stack都會像洋蔥一樣將column放到左側的index來(放到index序列最後)
df_stat_raw.unstack().stack().stack().head(5)
```
輸出:
weekday time_mark
1 0 href 4
read_cnt 2386
1 href 32
read_cnt 31888
2 href 94
dtype: int64
```python
pd.DataFrame(df_stat_raw.unstack().stack().stack()).reset_index().head(5)
```
![](https://img2020.cnblogs.com/blog/1977069/202004/1977069-20200404224834242-705351825.png)
### melt
melt方法中`id_vals`是指保留哪些作為**維度(index)**,剩下的都看做是**數值(value)**
除此之外,會另外生成一個維度叫**variable**,列轉行後記錄被轉的的變數名稱
![](https://img2020.cnblogs.com/blog/1977069/202004/1977069-20200404224848327-1711812023.png)
```python
print(df_stat.head(5))
df_stat.melt(id_vars=['weekday']).head(5)
```
```python
df_stat.melt(id_vars=['weekday','time_mark']).head