和上文一樣,先匯入後面會頻繁使用到的模組:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt plt.rc('font', family='Arial Unicode MS')
plt.rc('axes', unicode_minus='False') pd.__version__
Out[1]:
'1.1.3'
 

注意:我這裡是Mac系統,用matplotlib畫圖時設定字型為Arial Unicode MS支援中文顯示,如果是deepin系統可以設定字型為WenQuanYi Micro Hei,即:

In [2]:
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt # plt.rc('font', family='WenQuanYi Micro Hei')
# plt.rc('axes', unicode_minus='False')
 

如果其他系統畫圖時中文亂碼,可以用以下幾行程式碼檢視系統字型,然後自行尋找支援中文的字型:

In [3]:
# from matplotlib.font_manager import FontManager
# fonts = set([x.name for x in FontManager().ttflist])
# print(fonts)
 

話不多說,繼續pandas的學習。

 
 

資料合併

 

在實際的業務處理中,往往需要將多個數據集、文件合併後再進行分析。

 

concat

 
Signature:
pd.concat(
objs: Union[Iterable[~FrameOrSeries], Mapping[Union[Hashable, NoneType], ~FrameOrSeries]],
axis=0,
join='outer',
ignore_index: bool = False,
keys=None,
levels=None,
names=None,
verify_integrity: bool = False,
sort: bool = False,
copy: bool = True,
) -> Union[ForwardRef('DataFrame'), ForwardRef('Series')] Docstring:
Concatenate pandas objects along a particular axis with optional set logic
along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis,
which may be useful if the labels are the same (or overlapping) on
the passed axis number.
 

資料準備:

In [4]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df1
Out[4]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
In [5]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df2
Out[5]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
In [6]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df3
Out[6]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

基本連線:

 

In [7]:
# 將三個有相同列的表合併到一起
pd.concat([df1, df2, df3])
Out[7]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

可以再給每個表給一個一級索引,形成多層索引:

 

In [8]:
pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
Out[8]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
z 8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

也等同於下面這種方式:

In [9]:
pd.concat({'x': df1, 'y': df2, 'z': df3})
Out[9]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
z 8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

合併時不保留原索引,啟用新的自然索引:

 

In [10]:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
df4
Out[10]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
In [11]:
pd.concat([df1, df4], ignore_index=True, sort=False)
Out[11]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 NaN B2 NaN D2 F2
5 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
 

有沒有類似於資料庫中的outer join呢?

 

In [12]:
pd.concat([df1, df4], axis=1, sort=False)
Out[12]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
6 NaN NaN NaN NaN B6 D6 F6
7 NaN NaN NaN NaN B7 D7 F7
 

很自然聯想到inner join

 

In [13]:
pd.concat([df1, df4], axis=1, join='inner')
Out[13]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
 
join : {'inner', 'outer'}, default 'outer'
How to handle indexes on other axis (or axes).
 

這裡並沒有看到left join或者right join,那麼如何達到left join的效果呢?

 

In [14]:
pd.concat([df1, df4.reindex(df1.index)], axis=1)
Out[14]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
 

與序列合併:

 

In [15]:
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
pd.concat([df1, s1], axis=1)
Out[15]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D X
0 A0 B0 C0 D0 X0
1 A1 B1 C1 D1 X1
2 A2 B2 C2 D2 X2
3 A3 B3 C3 D3 X3
 

當然,也是可以使用df.assign()來定義一個新列。

 

如果序列沒名稱,會自動給自然索引名稱,如下:

In [16]:
s2 = pd.Series(['_A', '_B', '_C', '_D'])
s3 = pd.Series(['_a', '_b', '_c', '_d'])
pd.concat([df1, s2, s3], axis=1)
Out[16]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D 0 1
0 A0 B0 C0 D0 _A _a
1 A1 B1 C1 D1 _B _b
2 A2 B2 C2 D2 _C _c
3 A3 B3 C3 D3 _D _d
 

ignore_index=True會取消原有列名:

In [17]:
pd.concat([df1, s1], axis=1, ignore_index=True)
Out[17]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  0 1 2 3 4
0 A0 B0 C0 D0 X0
1 A1 B1 C1 D1 X1
2 A2 B2 C2 D2 X2
3 A3 B3 C3 D3 X3
 

同理,多個Series也可以合併:

In [18]:
s3 = pd.Series(['李尋歡', '令狐沖', '張無忌', '花無缺'])
s4 = pd.Series(['多情劍客無情劍', '笑傲江湖', '倚天屠龍記', '絕代雙驕'])
s5 = pd.Series(['小李飛刀', '獨孤九劍', '九陽神功', '移花接玉'])
pd.concat([s3, s4, s5], axis=1)
Out[18]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  0 1 2
0 李尋歡 多情劍客無情劍 小李飛刀
1 令狐沖 笑傲江湖 獨孤九劍
2 張無忌 倚天屠龍記 九陽神功
3 花無缺 絕代雙驕 移花接玉
 

也可以指定keys使用新的列名:

In [19]:
pd.concat([s3, s4, s5], axis=1, keys=['name', 'book', 'skill'])
Out[19]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name book skill
0 李尋歡 多情劍客無情劍 小李飛刀
1 令狐沖 笑傲江湖 獨孤九劍
2 張無忌 倚天屠龍記 九陽神功
3 花無缺 絕代雙驕 移花接玉
 

merge

 
Signature:
pd.merge(
left,
right,
how: str = 'inner',
on=None,
left_on=None,
right_on=None,
left_index: bool = False,
right_index: bool = False,
sort: bool = False,
suffixes=('_x', '_y'),
copy: bool = True,
indicator: bool = False,
validate=None,
) -> 'DataFrame' Docstring:
Merge DataFrame or named Series objects with a database-style join. The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on. Parameters
----------
left : DataFrame
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
Type of merge to be performed. * left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
* right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
* outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on : label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index : bool, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sort : bool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixes : list-like, default is ("_x", "_y")
A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
`left` and `right` respectively. Pass a value of `None` instead
of a string to indicate that the column name from `left` or
`right` should be left as-is, with no suffix. At least one of the
values must not be None.
copy : bool, default True
If False, avoid copy if possible.
indicator : bool or str, default False
If True, adds a column to the output DataFrame called "_merge" with
information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical
type with the value of "left_only" for observations whose merge key only
appears in the left DataFrame, "right_only" for observations
whose merge key only appears in the right DataFrame, and "both"
if the observation's merge key is found in both DataFrames. validate : str, optional
If specified, checks if merge is of specified type. * "one_to_one" or "1:1": check if merge keys are unique in both
left and right datasets.
* "one_to_many" or "1:m": check if merge keys are unique in left
dataset.
* "many_to_one" or "m:1": check if merge keys are unique in right
dataset.
* "many_to_many" or "m:m": allowed, but does not result in checks.
 

這裡的merge與關係型資料庫中的join非常類似。下面根據例項看看如何使用。

 
  • on:根據某個欄位進行連線,必須存在於兩個DateFrame中(若未同時存在,則需要分別使用left_onright_on來設定)
 

In [20]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, on='key')
Out[20]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
 

也可以有多個連線鍵:

 

In [21]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, on=['key1', 'key2'])
Out[21]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
 
  • how: 可以指定資料用哪種方式進行合併,沒有的內容會為NaN,預設值inner
 

上面沒有指定how,預設就是inner,下面分別看看left, right, outer的效果。

 

左外連線:

 

In [22]:
pd.merge(left, right, how='left', on=['key1', 'key2'])
Out[22]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
 

右外連線:

 

In [23]:
pd.merge(left, right, how='right', on=['key1', 'key2'])
Out[23]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
 

全外連線:

 

In [24]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])
Out[24]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
 

如果設定indicatorTrue, 則會增加名為_merge的一列,顯示這列是如何而來,其中left_only表示只在左表中, right_only表示只在右表中, both表示兩個表中都有:

In [25]:
pd.merge(left, right, how='outer', on=['key1', 'key2'], indicator=True)
Out[25]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K0 K1 A1 B1 NaN NaN left_only
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
4 K2 K1 A3 B3 NaN NaN left_only
5 K2 K0 NaN NaN C3 D3 right_only
 

如果左、右兩邊連線的欄位名稱不同時,可以分別設定left_onright_on

In [26]:
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, left_on='key1', right_on='key2')
Out[26]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 A B key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 K3 C3 D3
 

非關聯欄位名稱相同時,會怎樣?

In [27]:
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'B': [30, 50, 70, 90]}) pd.merge(left, right, left_on='key1', right_on='key2')
Out[27]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 A B_x key2 C B_y
0 K0 A0 B0 K0 C0 30
1 K1 A1 B1 K1 C1 50
2 K2 A2 B2 K2 C2 70
3 K3 A3 B3 K3 C3 90
 

預設suffixes=('_x', '_y'),也可以自行修改:

In [28]:
pd.merge(left, right, left_on='key1', right_on='key2',
suffixes=('_left', '_right'))
Out[28]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 A B_left key2 C B_right
0 K0 A0 B0 K0 C0 30
1 K1 A1 B1 K1 C1 50
2 K2 A2 B2 K2 C2 70
3 K3 A3 B3 K3 C3 90
 

append

 

append可以追加資料,並返回一個新物件,也是一種簡單常用的資料合併方式。

 
Signature:
df.append(other, ignore_index=False, verify_integrity=False, sort=False) -> 'DataFrame' Docstring:
Append rows of `other` to the end of caller, returning a new object.
 

引數解釋:

  • other: 要追加的其他DataFrameSeries
  • ignore_index: 如果為True則重新進行自然索引
  • verify_integrity: 如果為True則遇到重複索引內容時報錯
  • sort: 是否進行排序
 

追加同結構的資料:

 

In [29]:
df1.append(df2)
Out[29]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
 

追加不同結構的資料,沒有的列會增加,沒有對應內容的會為NAN

 

In [30]:
df1.append(df4, sort=False)
Out[30]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
2 NaN B2 NaN D2 F2
3 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
 

追加多個DataFrame:

In [31]:
df1.append([df2, df3])
Out[31]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

忽略原索引:

 

In [32]:
df1.append(df4, ignore_index=True, sort=False)
Out[32]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 NaN B2 NaN D2 F2
5 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
 

追加Series

 

In [33]:
s2 = pd.Series(['X0', 'X1', 'X2', 'X3'],
index=['A', 'B', 'C', 'D']) df1.append(s2, ignore_index=True)
Out[33]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 X0 X1 X2 X3
 

追加字典列表:

 

In [34]:
d = [{'A': 1, 'B': 2, 'C': 3, 'X': 4},
{'A': 5, 'B': 6, 'C': 7, 'Y': 8}] df1.append(d, ignore_index=True, sort=False)
Out[34]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D X Y
0 A0 B0 C0 D0 NaN NaN
1 A1 B1 C1 D1 NaN NaN
2 A2 B2 C2 D2 NaN NaN
3 A3 B3 C3 D3 NaN NaN
4 1 2 3 NaN 4.0 NaN
5 5 6 7 NaN NaN 8.0
 

來個實戰案例。在使用Excel的時候,常常會在資料最後,增加一行彙總資料,比如求和,求平均值等。現在用Pandas如何實現呢?

In [35]:
df = pd.DataFrame(np.random.randint(1, 10, size=(3, 4)),
columns=['a', 'b', 'c', 'd'])
df
Out[35]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a b c d
0 8 5 6 1
1 6 2 8 5
2 7 2 2 3
In [36]:
df.append(pd.Series(df.sum(), name='total'))
Out[36]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a b c d
0 8 5 6 1
1 6 2 8 5
2 7 2 2 3
total 21 9 16 9
 

資料清洗

 

資料清洗是指發現並糾正資料集中可識別的錯誤的一個過程,包括檢查資料一致性,處理無效值、缺失值等。資料清洗是為了最大限度地提高資料集的準確性。

 

缺失值

 

由於資料來源的複雜性、不確定性,資料中難免會存在欄位值不全、缺失等情況,下面先介紹如何找出這些缺失的值。

 

以這裡的電影資料舉例,其中資料集來自github,為了方便測試,下載壓縮後上傳到部落格園,原始資料鏈接:

https://github.com/LearnDataSci/articles/blob/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv

In [37]:
movies = pd.read_csv('https://files.cnblogs.com/files/blogs/478024/IMDB-Movie-Data.csv.zip')
movies.tail(2)
Out[37]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0
 

上面可以看到Revenue (Millions)列有個NaN,這裡的NaN是一個缺值標識。

 

NaN (not a number) is the standard missing data marker used in pandas.

 

判斷資料集中是否有缺失值:

In [38]:
np.any(movies.isna())
Out[38]:
True
 

或者判斷資料集中是否不含缺失值:

In [39]:
np.all(movies.notna())
Out[39]:
False
 

統計 每列有多少個缺失值:

In [40]:
movies.isna().sum(axis=0)
Out[40]:
Rank                    0
Title 0
Genre 0
Description 0
Director 0
Actors 0
Year 0
Runtime (Minutes) 0
Rating 0
Votes 0
Revenue (Millions) 128
Metascore 64
dtype: int64
 

統計 每行有多少個缺失值:

In [41]:
movies.isna().sum(axis=1)
Out[41]:
0      0
1 0
2 0
3 0
4 0
..
995 1
996 0
997 0
998 1
999 0
Length: 1000, dtype: int64
 

統計 一共有多少個缺失值:

In [42]:
movies.isna().sum().sum()
Out[42]:
192
 

篩選出有缺失值的列:

In [43]:
movies.isnull().any(axis=0)
Out[43]:
Rank                  False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
 

統計有缺失值的列的個數:

In [44]:
movies.isnull().any(axis=0).sum()
Out[44]:
2
 

篩選出有缺失值的行:

In [45]:
movies.isnull().any(axis=1)
Out[45]:
0      False
1 False
2 False
3 False
4 False
...
995 True
996 False
997 False
998 True
999 False
Length: 1000, dtype: bool
 

統計有缺失值的行的個數:

In [46]:
movies.isnull().any(axis=1).sum()
Out[46]:
162
 

檢視Metascore列缺失的資料:

In [47]:
movies[movies['Metascore'].isna()][['Rank', 'Title', 'Votes', 'Metascore']]
Out[47]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Rank Title Votes Metascore
25 26 Paris pieds nus 222 NaN
26 27 Bahubali: The Beginning 76193 NaN
27 28 Dead Awake 523 NaN
39 40 5- 25- 77 241 NaN
42 43 Don't Fuck in the Woods 496 NaN
... ... ... ... ...
967 968 The Walk 92378 NaN
969 970 The Lone Ranger 190855 NaN
971 972 Disturbia 193491 NaN
989 990 Selma 67637 NaN
992 993 Take Me Home Tonight 45419 NaN

64 rows × 4 columns

In [48]:
movies.shape
Out[48]:
(1000, 12)
In [49]:
movies.count()
Out[49]:
Rank                  1000
Title 1000
Genre 1000
Description 1000
Director 1000
Actors 1000
Year 1000
Runtime (Minutes) 1000
Rating 1000
Votes 1000
Revenue (Millions) 872
Metascore 936
dtype: int64
 

可以看出,每列應該有1000個數據的,count()時缺失值並沒有算進來。

 

對於缺失值的處理,應根據具體的業務場景以及資料完整性的要求選擇較合適的方案。常見的處理方案包括刪除存在缺失值的資料(dropna)、替換缺失值(fillna)。

 

刪除

 

某些場景下,有缺失值會認為該樣本資料無效,就需要對整行或者整列資料進行刪除(dropna)。

 

統計無缺失值的行的個數:

In [50]:
movies.notna().all(axis=1).sum()
Out[50]:
838
 

刪除所有含缺失值的行:

In [51]:
data = movies.dropna()
data.shape
Out[51]:
(838, 12)
In [52]:
movies.shape
Out[52]:
(1000, 12)
 

data裡838行資料都是無缺失值的。但是,值得注意的是,dropna()預設並不會在原來的資料集上刪除,除非指定dropna(inplace=True)。下面演示一下:

In [53]:
# 複製一份完整的資料做原地刪除演示
movies_copy = movies.copy()
print(movies_copy.shape) # 原地刪除
movies_copy.dropna(inplace=True) # 檢視原地刪除是否生效
movies_copy.shape
 
(1000, 12)
Out[53]:
(838, 12)
 

統計無缺失值的列的個數:

In [54]:
movies.notna().all(axis=0).sum()
Out[54]:
10
 

刪除含缺失值的列:

In [55]:
data = movies.dropna(axis=1)
data.shape
Out[55]:
(1000, 10)
 

how : {'any', 'all'}, default 'any'. Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

  • 'any' : If any NA values are present, drop that row or column.
  • 'all' : If all values are NA, drop that row or column.
 

刪除所有值都缺失的行:

In [56]:
# Drop the rows where all elements are missing
data = movies.dropna(how='all')
data.shape
Out[56]:
(1000, 12)
 

這裡的資料不存在所有值都缺失的行,所以how='all'dropna()對此處的資料集無任何影響。

 

subset: Define in which columns to look for missing values.

 

subset引數可以指定在哪些列中尋找缺失值:

In [57]:
# 指定在Title、Metascore兩列中尋找缺失值
data = movies.dropna(subset=['Title', 'Metascore'])
data.shape
Out[57]:
(936, 12)
 

統計TitleMetascore這兩列無缺失值的行數。講道理,肯定是等於上面刪除缺失值後的行數。

In [58]:
movies[['Title', 'Metascore']].notna().all(axis=1).sum()
Out[58]:
936
 

thresh : int, optional. Require that many non-NA values.

In [59]:
# Keep only the rows with at least 2 non-NA values.
data = movies[['Title', 'Metascore', 'Revenue (Millions)']].dropna(thresh=2)
data.shape
Out[59]:
(970, 3)
 

由於Title列沒有缺失值,相當於刪除掉Metascore,Revenue (Millions)兩列都為缺失值的行,如下:

In [60]:
data = movies[['Metascore', 'Revenue (Millions)']].dropna(how='all')
data.shape
Out[60]:
(970, 2)
 

填充

 

處理缺失值的另外一種常用方法是填充(fillna)。填充值雖然不絕對準確,但對獲得真實結果的影響並不大時,可以嘗試一用。

 

先看個簡單的例子,然後再應用到上面的電影資料中去。因為電影資料比較多,演示起來並不直觀。

In [61]:
hero = pd.DataFrame(data={'score': [97, np.nan, 96, np.nan, 95],
'wins': [np.nan, 9, np.nan, 11, 10],
'author': ['古龍', '金庸', np.nan, np.nan, np.nan],
'book': ['多情劍客無情劍', '笑傲江湖', '倚天屠龍記', '射鵰英雄傳', '絕代雙驕'],
'skill': ['小李飛刀', '獨孤九劍', '九陽神功', '降龍十八掌', '移花接玉'],
'wife': [np.nan, '任盈盈', np.nan, '黃蓉', np.nan],
'child': [np.nan, np.nan, np.nan, '郭襄', np.nan]},
index=['李尋歡', '令狐沖', '張無忌', '郭靖', '花無缺']) hero
Out[61]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97.0 NaN 古龍 多情劍客無情劍 小李飛刀 NaN NaN
令狐沖 NaN 9.0 金庸 笑傲江湖 獨孤九劍 任盈盈 NaN
張無忌 96.0 NaN NaN 倚天屠龍記 九陽神功 NaN NaN
郭靖 NaN 11.0 NaN 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95.0 10.0 NaN 絕代雙驕 移花接玉 NaN NaN
 

全部填充為unknown

In [62]:
hero.fillna('unknown')
Out[62]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97 unknown 古龍 多情劍客無情劍 小李飛刀 unknown unknown
令狐沖 unknown 9 金庸 笑傲江湖 獨孤九劍 任盈盈 unknown
張無忌 96 unknown unknown 倚天屠龍記 九陽神功 unknown unknown
郭靖 unknown 11 unknown 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95 10 unknown 絕代雙驕 移花接玉 unknown unknown
 

只替換第一個:

In [63]:
hero.fillna('unknown', limit=1)
Out[63]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97 unknown 古龍 多情劍客無情劍 小李飛刀 unknown unknown
令狐沖 unknown 9 金庸 笑傲江湖 獨孤九劍 任盈盈 NaN
張無忌 96 NaN unknown 倚天屠龍記 九陽神功 NaN NaN
郭靖 NaN 11 NaN 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95 10 NaN 絕代雙驕 移花接玉 NaN NaN
 

不同列替換不同的值:

In [64]:
hero.fillna(value={'score': 100, 'author': '匿名', 'wife': '保密'})
Out[64]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97.0 NaN 古龍 多情劍客無情劍 小李飛刀 保密 NaN
令狐沖 100.0 9.0 金庸 笑傲江湖 獨孤九劍 任盈盈 NaN
張無忌 96.0 NaN 匿名 倚天屠龍記 九陽神功 保密 NaN
郭靖 100.0 11.0 匿名 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95.0 10.0 匿名 絕代雙驕 移花接玉 保密 NaN
 

method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None

上面是填充固定值,此外還能指定填充方法。

 

pad / ffill: propagate last valid observation forward to next valid

In [65]:
# 使用前一個有效值填充
hero.fillna(method='ffill')
Out[65]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97.0 NaN 古龍 多情劍客無情劍 小李飛刀 NaN NaN
令狐沖 97.0 9.0 金庸 笑傲江湖 獨孤九劍 任盈盈 NaN
張無忌 96.0 9.0 金庸 倚天屠龍記 九陽神功 任盈盈 NaN
郭靖 96.0 11.0 金庸 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95.0 10.0 金庸 絕代雙驕 移花接玉 黃蓉 郭襄
 

backfill / bfill: use next valid observation to fill gap.

In [66]:
# 使用後一個有效值填充
hero.fillna(method='bfill')
Out[66]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97.0 9.0 古龍 多情劍客無情劍 小李飛刀 任盈盈 郭襄
令狐沖 96.0 9.0 金庸 笑傲江湖 獨孤九劍 任盈盈 郭襄
張無忌 96.0 11.0 NaN 倚天屠龍記 九陽神功 黃蓉 郭襄
郭靖 95.0 11.0 NaN 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95.0 10.0 NaN 絕代雙驕 移花接玉 NaN NaN
 

scorewins兩列的缺失值用平均值來填充:

In [67]:
hero.fillna(hero[['score', 'wins']].mean())
Out[67]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97.0 10.0 古龍 多情劍客無情劍 小李飛刀 NaN NaN
令狐沖 96.0 9.0 金庸 笑傲江湖 獨孤九劍 任盈盈 NaN
張無忌 96.0 10.0 NaN 倚天屠龍記 九陽神功 NaN NaN
郭靖 96.0 11.0 NaN 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95.0 10.0 NaN 絕代雙驕 移花接玉 NaN NaN
 

指定列填充為unknown,並原地替換:

In [68]:
hero.child.fillna('unknown', inplace=True)
hero
Out[68]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李尋歡 97.0 NaN 古龍 多情劍客無情劍 小李飛刀 NaN unknown
令狐沖 NaN 9.0 金庸 笑傲江湖 獨孤九劍 任盈盈 unknown
張無忌 96.0 NaN NaN 倚天屠龍記 九陽神功 NaN unknown
郭靖 NaN 11.0 NaN 射鵰英雄傳 降龍十八掌 黃蓉 郭襄
花無缺 95.0 10.0 NaN 絕代雙驕 移花接玉 NaN unknown
 

再回到上面的實際的電影資料案例。現在用平均值替換缺失值:

In [69]:
filled_movies = movies.fillna(
movies[['Revenue (Millions)', 'Metascore']].mean()) np.any(filled_movies.isna())
Out[69]:
False
 

可見,填充後的電影資料中已經不存在缺失值了。

 

資料替換

 

資料替換常用於資料清洗整理、列舉轉換、資料修正等場景。

 

先看下replace()方法的介紹:

 
Signature:
replace(
to_replace=None,
value=None,
inplace=False,
limit=None,
regex=False,
method='pad',
) Docstring:
Replace values given in `to_replace` with `value`.
 

再看幾個例子:

In [70]:
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
s
Out[70]:
0    a
1 b
2 c
3 d
4 e
dtype: object
In [71]:
s.replace('a', 'aa')
Out[71]:
0    aa
1 b
2 c
3 d
4 e
dtype: object
In [72]:
s.replace({'d': 'dd', 'e': 'ee'})
Out[72]:
0     a
1 b
2 c
3 dd
4 ee
dtype: object
In [73]:
s.replace(['a', 'b', 'c'], ['aa', 'bb', 'cc'])
Out[73]:
0    aa
1 bb
2 cc
3 d
4 e
dtype: object
In [74]:
# 將c替換為它前一個值
s.replace('c', method='ffill')
Out[74]:
0    a
1 b
2 b
3 d
4 e
dtype: object
In [75]:
# 將c替換為它後一個值
s.replace('c', method='bfill')
Out[75]:
0    a
1 b
2 d
3 d
4 e
dtype: object
In [76]:
df = pd.DataFrame({'A': [0, -11, 2, 3, 35],
'B': [0, -20, 2, 5, 16]})
df
Out[76]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -11 -20
2 2 2
3 3 5
4 35 16
In [77]:
df.replace(0, 5)
Out[77]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 5 5
1 -11 -20
2 2 2
3 3 5
4 35 16
In [78]:
df.replace([0, 2, 3, 5], 10)
Out[78]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 10 10
1 -11 -20
2 10 10
3 10 10
4 35 16
In [79]:
df.replace([0, 2], [100, 200])
Out[79]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 100 100
1 -11 -20
2 200 200
3 3 5
4 35 16
In [80]:
df.replace({0: 10, 2: 22})
Out[80]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 10 10
1 -11 -20
2 22 22
3 3 5
4 35 16
In [81]:
df.replace({'A': 0, 'B': 2}, 100)
Out[81]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 100 0
1 -11 -20
2 2 100
3 3 5
4 35 16
In [82]:
df.replace({'A': {2: 200, 3: 300}})
Out[82]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -11 -20
2 200 2
3 300 5
4 35 16
 

對一些極端值,如過大或者過小,可以使用df.clip(lower, upper)來修剪,當資料大於upper時,使用upper的值,小於lower時用lower 的值,類似numpy.clip的方法。

 

在修剪之前,再看一眼原始資料:

In [83]:
df
Out[83]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -11 -20
2 2 2
3 3 5
4 35 16
In [84]:
# 修剪成最小為2,最大為10
df.clip(2, 10)
Out[84]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 2 2
1 2 2
2 2 2
3 3 5
4 10 10
In [85]:
# 對每列元素的最小值和最大值進行不同的限制

# 將A列數值修剪成[-3, 3]之間
# 將B列數值修剪成[-5, 5]之間
df.clip([-3, -5], [3, 5], axis=1)
Out[85]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -3 -5
2 2 2
3 3 5
4 3 5
In [86]:
# 對每行元素的最小值和最大值進行不同的限制

# 將第1行數值修剪成[5, 10]之間
# 將第2行數值修剪成[-15, -12]之間
# 將第3行數值修剪成[6, 10]之間
# 將第4行數值修剪成[4, 10]之間
# 將第5行數值修剪成[20, 30]之間
df.clip([5, -15, 6, 4, 20],
[10, -12, 10, 10, 30],
axis=0)
Out[86]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 5 5
1 -12 -15
2 6 6
3 4 5
4 30 20
 

另外,可以將無效值先替換為nan,再做缺失值處理。這樣就能應用上前面講到的缺失值處理相關的知識。

 

比如這裡的df,我們認為小於0的資料都是無效資料,可以:

In [87]:
df.replace([-11, -20], np.nan)
Out[87]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0.0 0.0
1 NaN NaN
2 2.0 2.0
3 3.0 5.0
4 35.0 16.0
 

當然,也可以像下面這樣把無效資料變為nan:

In [88]:
df[df >= 0]
Out[88]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0.0 0.0
1 NaN NaN
2 2.0 2.0
3 3.0 5.0
4 35.0 16.0
 

此時,上面講到的缺失值處理就能派上用場了。

 

文字內容比較複雜時,可以使用正則進行匹配替換。下面看幾個例子:

In [89]:
df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
'B': ['abc', 'bar', 'xyz']})
df
Out[89]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 bat abc
1 foo bar
2 bait xyz
In [90]:
# 利用正則將ba開頭且總共3個字元的文字替換為new
df.replace(to_replace=r'^ba.$', value='new', regex=True)
Out[90]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 foo new
2 bait xyz
In [91]:
# 如果多列正則不同的情況下可以按以下格式對應傳入
df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
Out[91]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 foo bar
2 bait xyz
In [92]:
df.replace(regex=r'^ba.$', value='new')
Out[92]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 foo new
2 bait xyz
In [93]:
# 不同正則替換不同的值
df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
Out[93]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 xyz new
2 bait xyz
In [94]:
# 多個正則替換為同一個值
df.replace(regex=[r'^ba.$', 'foo'], value='new')
Out[94]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 new new
2 bait xyz
 

重複值

 

重複值在資料清洗中可能需要刪除。下面介紹Pandas如何識別重複值以及如何刪除重複值。

 
Signature:
df.duplicated(
subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
keep: Union[str, bool] = 'first',
) -> 'Series' Docstring:
Return boolean Series denoting duplicate rows. Considering certain columns is optional. Parameters
----------
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', False}, default 'first'
Determines which duplicates (if any) to mark. - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
- ``last`` : Mark duplicates as ``True`` except for the last occurrence.
- False : Mark all duplicates as ``True``.
 

看官方給的例子:

In [95]:
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
Out[95]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
In [96]:
# 預設情況下,對於每行重複的值,第一次出現都設定為False,其他為True
df.duplicated()
Out[96]:
0    False
1 True
2 False
3 False
4 False
dtype: bool
In [97]:
# 將每行重複值的最後一次出現設定為False,其他為True
df.duplicated(keep='last')
Out[97]:
0     True
1 False
2 False
3 False
4 False
dtype: bool
In [98]:
# 所有重複行都為True
df.duplicated(keep=False)
Out[98]:
0     True
1 True
2 False
3 False
4 False
dtype: bool
In [99]:
# 引數subset可以在指定列上查詢重複值
df.duplicated(subset=['brand'])
Out[99]:
0    False
1 True
2 False
3 True
4 True
dtype: bool
 

再看如何刪除重複值:

 
Signature:
df.drop_duplicates(
subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
keep: Union[str, bool] = 'first',
inplace: bool = False,
ignore_index: bool = False,
) -> Union[ForwardRef('DataFrame'), NoneType]
Docstring:
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes
are ignored. Parameters
----------
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', False}, default 'first'
Determines which duplicates (if any) to keep.
- ``first`` : Drop duplicates except for the first occurrence.
- ``last`` : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace : bool, default False
Whether to drop duplicates in place or to return a copy.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
 

同樣繼續官方給的例子:

In [100]:
# By default, it removes duplicate rows based on all columns
df.drop_duplicates()
Out[100]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
In [101]:
# To remove duplicates on specific column(s), use `subset`
df.drop_duplicates(subset=['brand'])
Out[101]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
In [102]:
# To remove duplicates and keep last occurences, use `keep`
df.drop_duplicates(subset=['brand', 'style'], keep='last')
Out[102]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
 

分組與聚合

 

 

在資料統計與分析中,分組與聚合非常常見。如果是SQL,對應的就是Group By和聚合函式(Aggregation Functions)。下面看看pandas是怎麼玩的。

 
Signature:
df.groupby(
by=None,
axis=0,
level=None,
as_index: bool = True,
sort: bool = True,
group_keys: bool = True,
squeeze: bool = <object object at 0x7f3df810e750>,
observed: bool = False,
dropna: bool = True,
) -> 'DataFrameGroupBy' Docstring:
Group DataFrame using a mapper or by a Series of columns.
 

groupby()方法可以按指定欄位對DataFrame進行分組,生成一個分組器物件,然後再把這個物件的各個欄位按一定的聚合方法輸出。

 

其中by為分組欄位,由於是第一個引數可以省略,可以按列表給多個。會返回一個DataFrameGroupBy物件,如果不給聚合方法,不會返回 DataFrame

 

準備演示資料:

In [103]:
df = pd.read_csv('https://files.cnblogs.com/files/blogs/478024/team.csv.zip')
df
Out[103]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name team Q1 Q2 Q3 Q4
0 Liver E 89 21 24 64
1 Arry C 36 37 37 57
2 Ack A 57 60 18 84
3 Eorge C 93 96 71 78
4 Oah D 65 49 61 86
... ... ... ... ... ... ...
95 Gabriel C 48 59 87 74
96 Austin7 C 21 31 30 43
97 Lincoln4 C 98 93 1 20
98 Eli E 11 74 58 91
99 Ben E 21 43 41 74

100 rows × 6 columns

In [104]:
# 按team分組後對應列求和
df.groupby('team').sum()
Out[104]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Q1 Q2 Q3 Q4
team        
A 1066 639 875 783
B 975 1218 1202 1136
C 1056 1194 1068 1127
D 860 1191 1241 1199
E 963 1013 881 1033
In [105]:
# 按team分組後對應列求平均值
df.groupby('team').mean()
Out[105]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Q1 Q2 Q3 Q4
team        
A 62.705882 37.588235 51.470588 46.058824
B 44.318182 55.363636 54.636364 51.636364
C 48.000000 54.272727 48.545455 51.227273
D 45.263158 62.684211 65.315789 63.105263
E 48.150000 50.650000 44.050000 51.650000
In [106]:
# 按team分組後不同列使用不同的聚合方式
df.groupby('team').agg({'Q1': sum, # 求和
'Q2': 'count', # 計數
'Q3': 'mean', # 求平均值
'Q4': max}) # 求最大值
Out[106]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Q1 Q2 Q3 Q4
team        
A 1066 17 51.470588 97
B 975 22 54.636364 99
C 1056 22 48.545455 98
D 860 19 65.315789 99
E 963 20 44.050000 98
 

If by is a function, it's called on each value of the object's index.

In [107]:
# team在C之前(包括C)分為一組,C之後的分為另外一組
df.set_index('team').groupby(lambda team: 'team1' if team <= 'C' else 'team2')['name'].count()
Out[107]:
team1    61
team2 39
Name: name, dtype: int64
 

或者下面這種寫法也行:

In [108]:
df.groupby(lambda idx: 'team1' if df.loc[idx]['team'] <= 'C' else 'team2')['name'].count()
Out[108]:
team1    61
team2 39
Name: name, dtype: int64
In [109]:
# 按name的長度(length)分組,並取出每組中name的第一個值和最後一個值
df.groupby(df['name'].apply(lambda x: len(x))).agg({'name': ['first', 'last']})
Out[109]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  name
  first last
name    
3 Ack Ben
4 Arry Leon
5 Liver Aiden
6 Harlie Jamie0
7 William Austin7
8 Harrison Lincoln4
9 Alexander Theodore3
In [110]:
# 只對部分分組
df.set_index('team').groupby({'A': 'A組', 'B': 'B組'})['name'].count()
Out[110]:
A組    17
B組 22
Name: name, dtype: int64
 

可以將以上方法混合組成列表進行分組:

In [111]:
# 按team,name長度分組,取分組中最後一行
df.groupby(['team', df['name'].apply(lambda x: len(x))]).last()
Out[111]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    name Q1 Q2 Q3 Q4
team name          
A 3 Ack 57 60 18 84
4 Toby 52 27 17 68
5 Aaron 96 75 55 8
6 Nathan 87 77 62 13
7 Stanley 69 71 39 97
B 3 Kai 66 45 13 48
4 Liam 2 80 24 25
5 Lewis 4 34 77 28
6 Jamie0 39 97 84 55
7 Albert0 85 38 41 17
8 Grayson7 59 84 74 33
C 4 Adam 90 32 47 39
5 Calum 14 91 16 82
6 Connor 62 38 63 46
7 Austin7 21 31 30 43
8 Lincoln4 98 93 1 20
9 Sebastian 1 14 68 48
D 3 Oah 65 49 61 86
4 Ezra 16 56 86 61
5 Aiden 20 31 62 68
6 Reuben 70 72 76 56
7 Hunter3 38 80 82 40
8 Benjamin 15 88 52 25
9 Theodore3 43 7 68 80
E 3 Ben 21 43 41 74
4 Leon 38 60 31 7
5 Roman 73 1 25 44
6 Dexter 73 94 53 20
7 Zachary 12 71 85 93
8 Jackson5 6 10 15 33
 

We can groupby different levels of a hierarchical index using the level parameter.

In [112]:
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
['Captive', 'Wild', 'Captive', 'Wild']] index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
index=index)
df
Out[112]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    Max Speed
Animal Type  
Falcon Captive 390.0
Wild 350.0
Parrot Captive 30.0
Wild 20.0
In [113]:
# df.groupby(level=0).mean()
df.groupby(level="Animal").mean()
Out[113]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Max Speed
Animal  
Falcon 370.0
Parrot 25.0
In [114]:
# df.groupby(level=1).mean()
df.groupby(level="Type").mean()
Out[114]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Max Speed
Type  
Captive 210.0
Wild 185.0
 

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True.

In [115]:
l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df
Out[115]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
In [116]:
df.groupby(by=["b"]).sum()
Out[116]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a c
b    
1.0 2 3
2.0 2 5
In [117]:
df.groupby(by=["b"], dropna=False).sum()
Out[117]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a c
b    
1.0 2 3
2.0 2 5
NaN 1 4
 

上面體驗了一下pandas分組聚合的基本使用後,接下來看看分組聚合的一些過程細節。

 

分組

 

有以下動物最大速度資料:

In [118]:
df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
('bird', 'Psittaciformes', 24.0),
('mammal', 'Carnivora', 80.2),
('mammal', 'Primates', np.nan),
('mammal', 'Carnivora', 58)],
index=['falcon', 'parrot', 'lion',
'monkey', 'leopard'],
columns=('class', 'order', 'max_speed')) df
Out[118]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
In [119]:
# 分組數
df.groupby('class').ngroups
Out[119]:
2
In [120]:
# 檢視分組
df.groupby('class').groups
Out[120]:
{'bird': ['falcon', 'parrot'], 'mammal': ['lion', 'monkey', 'leopard']}
In [121]:
df.groupby('class').size()
Out[121]:
class
bird 2
mammal 3
dtype: int64
In [122]:
# 檢視鳥類分組內容
df.groupby('class').get_group('bird')
Out[122]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
 

獲取分組中的第幾個值:

In [123]:
# 第一個
df.groupby('class').nth(1)
Out[123]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  order max_speed
class    
bird Psittaciformes 24.0
mammal Primates NaN
In [124]:
# 最後一個
df.groupby('class').nth(-1)
Out[124]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  order max_speed
class    
bird Psittaciformes 24.0
mammal Carnivora 58.0
In [125]:
# 第一個,第二個
df.groupby('class').nth([1, 2])
Out[125]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  order max_speed
class    
bird Psittaciformes 24.0
mammal Primates NaN
mammal Carnivora 58.0
In [126]:
# 每組顯示前2個
df.groupby('class').head(2)
Out[126]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
In [127]:
# 每組最後2個
df.groupby('class').tail(2)
Out[127]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
In [128]:
# 分組序號
df.groupby('class').ngroup()
Out[128]:
falcon     0
parrot 0
lion 1
monkey 1
leopard 1
dtype: int64
In [129]:
# 返回每個元素在所在組的序號的序列
df.groupby('class').cumcount(ascending=False)
Out[129]:
falcon     1
parrot 0
lion 2
monkey 1
leopard 0
dtype: int64
In [130]:
# 按鳥類首字母分組
df.groupby(df['class'].str[0]).groups
Out[130]:
{'b': ['falcon', 'parrot'], 'm': ['lion', 'monkey', 'leopard']}
In [131]:
# 按鳥類第一個字母和第二個字母分組
df.groupby([df['class'].str[0], df['class'].str[1]]).groups
Out[131]:
{('b', 'i'): ['falcon', 'parrot'], ('m', 'a'): ['lion', 'monkey', 'leopard']}
In [132]:
# 在組內的排名
df.groupby('class').rank()
Out[132]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed
falcon 2.0
parrot 1.0
lion 2.0
monkey NaN
leopard 1.0
 

聚合

 

對資料進行分組後,接下來就可以收穫果實了,給分組給定統計方法,最終得到分組聚合的結果。除了常見的數學統計方法,還可以使用 agg()transform()等函式進行操作。

In [133]:
# 描述性統計
df.groupby('class').describe()
Out[133]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  max_speed
  count mean std min 25% 50% 75% max
class                
bird 2.0 206.5 258.093975 24.0 115.25 206.5 297.75 389.0
mammal 2.0 69.1 15.697771 58.0 63.55 69.1 74.65 80.2
In [134]:
# 一列使用多個聚合方法
df.groupby('class').agg({'max_speed': ['min', 'max', 'sum']})
Out[134]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  max_speed
  min max sum
class      
bird 24.0 389.0 413.0
mammal 58.0 80.2 138.2
In [135]:
df.groupby('class')['max_speed'].agg(
Max='max', Min='min', Diff=lambda x: x.max() - x.min())
Out[135]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Max Min Diff
class      
bird 389.0 24.0 365.0
mammal 80.2 58.0 22.2
In [136]:
df.groupby('class').agg(max_speed=('max_speed', 'max'),
count_order=('order', 'count'))
Out[136]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed count_order
class    
bird 389.0 2
mammal 80.2 3
In [137]:
df.groupby('class').agg(
max_speed=pd.NamedAgg(column='max_speed', aggfunc='max'),
count_order=pd.NamedAgg(column='order', aggfunc='count')
)
Out[137]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed count_order
class    
bird 389.0 2
mammal 80.2 3
 

transform類似於agg,但不同的是它返回的是一個DataFrame,每個會將原來的值一一替換成統計後的值,比如按組計算平均值,那麼返回的新DataFrame中每個值就是它所在組的平均值。

In [138]:
df.groupby('class').agg(np.mean)
Out[138]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed
class  
bird 206.5
mammal 69.1
In [139]:
df.groupby('class').transform(np.mean)
Out[139]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed
falcon 206.5
parrot 206.5
lion 69.1
monkey 69.1
leopard 69.1
 

分組後篩選原始資料:

 
Signature:
DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs) Docstring:
Return a copy of a DataFrame excluding filtered elements. Elements from groups are filtered if they do not satisfy the
boolean criterion specified by func.
In [140]:
# 篩選出 按class分組後,分組內max_speed平均值大於100的元素
df.groupby(['class']).filter(lambda x: x['max_speed'].mean() > 100)
Out[140]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
In [141]:
# 取出分組後index
df.groupby('class').apply(lambda x: x.index.to_list())
Out[141]:
class
bird [falcon, parrot]
mammal [lion, monkey, leopard]
dtype: object
In [142]:
# 取出分組後每組中max_speed最大的前N個
df.groupby('class').apply(lambda x: x.sort_values(
by='max_speed', ascending=False).head(1))
Out[142]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    class order max_speed
class        
bird falcon bird Falconiformes 389.0
mammal lion mammal Carnivora 80.2
In [143]:
df.groupby('class').apply(lambda x: pd.Series({
'speed_max': x['max_speed'].max(),
'speed_min': x['max_speed'].min(),
'speed_mean': x['max_speed'].mean(),
}))
Out[143]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  speed_max speed_min speed_mean
class      
bird 389.0 24.0 206.5
mammal 80.2 58.0 69.1
 

按分組匯出Excel檔案:

In [144]:
for group, data in df.groupby('class'):
data.to_excel(f'data/{group}.xlsx')
In [145]:
# 每組去重值後數量
df.groupby('class').order.nunique()
Out[145]:
class
bird 2
mammal 2
Name: order, dtype: int64
In [146]:
# 每組去重後的值
df.groupby("class")['order'].unique()
Out[146]:
class
bird [Falconiformes, Psittaciformes]
mammal [Carnivora, Primates]
Name: order, dtype: object
In [147]:
# 統計每組資料值的數量
df.groupby("class")['order'].value_counts()
Out[147]:
class   order
bird Falconiformes 1
Psittaciformes 1
mammal Carnivora 2
Primates 1
Name: order, dtype: int64
In [148]:
# 每組最大的1個
df.groupby("class")['max_speed'].nlargest(1)
Out[148]:
class
bird falcon 389.0
mammal lion 80.2
Name: max_speed, dtype: float64
In [149]:
# 每組最小的2個
df.groupby("class")['max_speed'].nsmallest(2)
Out[149]:
class
bird parrot 24.0
falcon 389.0
mammal leopard 58.0
lion 80.2
Name: max_speed, dtype: float64
In [150]:
# 每組值是否單調遞增
df.groupby("class")['max_speed'].is_monotonic_increasing
Out[150]:
class
bird False
mammal False
Name: max_speed, dtype: bool
In [151]:
# 每組值是否單調遞減
df.groupby("class")['max_speed'].is_monotonic_decreasing
Out[151]:
class
bird True
mammal False
Name: max_speed, dtype: bool
 

堆疊與透視

 

實際生產中,我們拿到的原始資料的表現形狀可能並不符合當前需求,比如說不是期望的維度、資料不夠直觀、表現力不夠等等。此時,可以對原始資料進行適當的變形,比如堆疊、透視、行列轉置等。

 

堆疊

 

看個簡單的例子就能明白講的是什麼:

In [152]:
df = pd.DataFrame([[19, 136, 180, 98], [21, 122, 178, 96]], index=['令狐沖', '李尋歡'],
columns=['age', 'weight', 'height', 'score'])
df
Out[152]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  age weight height score
令狐沖 19 136 180 98
李尋歡 21 122 178 96
In [153]:
# 有點像寬表變高表, 我是這樣覺得的
df.stack()
Out[153]:
令狐沖  age        19
weight 136
height 180
score 98
李尋歡 age 21
weight 122
height 178
score 96
dtype: int64
In [154]:
# 有點像高表變寬表
df.stack().unstack()
Out[154]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  age weight height score
令狐沖 19 136 180 98
李尋歡 21 122 178 96
 

透視表

 

 
Signature:
df.pivot(index=None, columns=None, values=None) -> 'DataFrame' Docstring:
Return reshaped DataFrame organized by given index / column values. Reshape data (produce a "pivot" table) based on column values. Uses
unique values from specified `index` / `columns` to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns.
In [155]:
df = pd.DataFrame({'name': ['江小魚', '江小魚', '江小魚', '花無缺', '花無缺',
'花無缺'],
'bug_level': ['A', 'B', 'C', 'A', 'B', 'C'],
'bug_count': [2, 3, 5, 1, 5, 6]})
df
Out[155]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name bug_level bug_count
0 江小魚 A 2
1 江小魚 B 3
2 江小魚 C 5
3 花無缺 A 1
4 花無缺 B 5
5 花無缺 C 6
 

把上面的bug等級與bug數統計表變形如下,還是原來的資料,但是不是更加直觀呢?

In [156]:
df.pivot(index='name', columns='bug_level', values='bug_count')
Out[156]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

bug_level A B C
name      
江小魚 2 3 5
花無缺 1 5 6
 

如果原始資料中有重複的統計呢?就比如說上面的例子中來自不同產品線的bug統計,就可能出現兩行這樣的資料['江小魚','B',3]、['江小魚','B',4],先試下用pivot會怎樣?

In [157]:
df = pd.DataFrame({'name': ['江小魚', '江小魚', '江小魚', '江小魚', '江小魚', '花無缺', '花無缺',
'花無缺', '花無缺', '花無缺', ],
'bug_level': ['A', 'B', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
'bug_count': [2, 3, 5, 4, 6, 1, 5, 6, 3, 1],
'score': [70, 80, 90, 76, 86, 72, 82, 88, 68, 92]})
df
Out[157]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name bug_level bug_count score
0 江小魚 A 2 70
1 江小魚 B 3 80
2 江小魚 C 5 90
3 江小魚 B 4 76
4 江小魚 C 6 86
5 花無缺 A 1 72
6 花無缺 B 5 82
7 花無缺 C 6 88
8 花無缺 A 3 68
9 花無缺 B 1 92
In [158]:
try:
df.pivot(index='name', columns='bug_level', values='bug_count')
except ValueError as e:
print(e)
 
Index contains duplicate entries, cannot reshape
 

原來,pivot()只能將資料進行reshape,不支援聚合。遇到上面這種含重複值需進行聚合計算,應使用pivot_table()。它能實現類似Excel那樣的高階資料透視功能。

In [159]:
# 統計員工來自不同產品線不同級別的bug總數
df.pivot_table(index=['name'], columns=['bug_level'],
values='bug_count', aggfunc=np.sum)
Out[159]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

bug_level A B C
name      
江小魚 2 7 11
花無缺 4 6 6
 

當然,這裡的聚合可以非常靈活:

 
In [161]:
df.pivot_table(index=['name'], columns=['bug_level'], aggfunc={
'bug_count': np.sum, 'score': [max, np.mean]})
Out[161]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  bug_count score
  sum max mean
bug_level A B C A B C A B C
name                  
江小魚 2 7 11 70 80 90 70 78 88
花無缺 4 6 6 72 92 88 70 87 88
 

還可以給每列每行加個彙總,如下所示:

In [162]:
df.pivot_table(index=['name'], columns=['bug_level'],
values='bug_count', aggfunc=np.sum, margins=True, margins_name='彙總')
Out[162]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

bug_level A B C 彙總
name        
江小魚 2 7 11 20
花無缺 4 6 6 16
彙總 6 13 17 36
 

交叉表

 

交叉表是用於統計分組頻率的特殊透視表。簡單來說,就是將兩個或者多個列重中不重複的元素組成一個新的 DataFrame,新資料的行和列交叉的部分值為其組合在原資料中的數量。

 

還是來個例子比較直觀。有如下學生選專業資料:

In [163]:
df = pd.DataFrame({'name': ['楊過', '小龍女', '郭靖', '黃蓉', '李尋歡', '孫小紅', '張無忌',
'趙敏', '令狐沖', '任盈盈'],
'gender': ['男', '女', '男', '女', '男', '女', '男', '女', '男', '女'],
'major': ['機械工程', '軟體工程', '金融工程', '工商管理', '機械工程', '金融工程', '軟體工程', '工商管理', '軟體工程', '工商管理']})
df
Out[163]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name gender major
0 楊過 機械工程
1 小龍女 軟體工程
2 郭靖 金融工程
3 黃蓉 工商管理
4 李尋歡 機械工程
5 孫小紅 金融工程
6 張無忌 軟體工程
7 趙敏 工商管理
8 令狐沖 軟體工程
9 任盈盈 工商管理
 

若想了解學生選專業是否與性別有關,可以做如下統計:

In [164]:
pd.crosstab(df['gender'], df['major'])
Out[164]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 機械工程 軟體工程 金融工程
gender        
3 0 1 1
0 2 2 1
 

同時,回憶一下上篇講到的 https://www.cnblogs.com/bytesfly/p/pandas-1.html#畫圖

In [165]:
# 男、女生填報專業比例餅狀圖
pd.crosstab(df['gender'], df['major']).T.plot(
kind='pie', subplots=True, figsize=(12, 8), autopct="%.0f%%")
plt.show()
 
 

換個角度看下:

In [166]:
# 各專業男女生填報人數柱狀圖
pd.crosstab(df['gender'], df['major']).T.plot(
kind='bar', stacked=True, rot=0, title='各專業男女生填報人數柱狀圖', xlabel='', figsize=(10, 6))
plt.show()
 
 

再回到上面所講的交叉表相關知識。

In [167]:
# 對交叉結果進行歸一化
pd.crosstab(df['gender'], df['major'], normalize=True)
Out[167]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 機械工程 軟體工程 金融工程
gender        
0.3 0.0 0.1 0.1
0.0 0.2 0.2 0.1
In [168]:
# 對交叉結果按行進行歸一化
pd.crosstab(df['gender'], df['major'], normalize='index')
Out[168]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 機械工程 軟體工程 金融工程
gender        
0.6 0.0 0.2 0.2
0.0 0.4 0.4 0.2
In [169]:
# 對交叉結果按列進行歸一化
pd.crosstab(df['gender'], df['major'], normalize='columns')
Out[169]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 機械工程 軟體工程 金融工程
gender        
1.0 0.0 0.333333 0.5
0.0 1.0 0.666667 0.5
 

同樣,也可以給每列每行加個彙總,如下:

In [170]:
pd.crosstab(df['gender'], df['major'], margins=True, margins_name='彙總')
Out[170]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 機械工程 軟體工程 金融工程 彙總
gender          
3 0 1 1 5
0 2 2 1 5
彙總 3 2 3 2 10
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }