和上文一樣,先匯入後面會頻繁使用到的模組:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt plt.rc('font', family='Arial Unicode MS')
plt.rc('axes', unicode_minus='False') pd.__version__
'1.1.3'
注意:我這裡是Mac系統,用matplotlib
畫圖時設定字型為Arial Unicode MS
支援中文顯示,如果是deepin系統可以設定字型為WenQuanYi Micro Hei
,即:
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt # plt.rc('font', family='WenQuanYi Micro Hei')
# plt.rc('axes', unicode_minus='False')
如果其他系統畫圖時中文亂碼,可以用以下幾行程式碼檢視系統字型,然後自行尋找支援中文的字型:
# from matplotlib.font_manager import FontManager
# fonts = set([x.name for x in FontManager().ttflist])
# print(fonts)
話不多說,繼續pandas的學習。
資料合併
在實際的業務處理中,往往需要將多個數據集、文件合併後再進行分析。
concat
Signature:
pd.concat(
objs: Union[Iterable[~FrameOrSeries], Mapping[Union[Hashable, NoneType], ~FrameOrSeries]],
axis=0,
join='outer',
ignore_index: bool = False,
keys=None,
levels=None,
names=None,
verify_integrity: bool = False,
sort: bool = False,
copy: bool = True,
) -> Union[ForwardRef('DataFrame'), ForwardRef('Series')] Docstring:
Concatenate pandas objects along a particular axis with optional set logic
along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis,
which may be useful if the labels are the same (or overlapping) on
the passed axis number.
資料準備:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df2
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | |
---|---|---|---|---|
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df3
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | |
---|---|---|---|---|
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
基本連線:
# 將三個有相同列的表合併到一起
pd.concat([df1, df2, df3])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
可以再給每個表給一個一級索引,形成多層索引:
pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | ||
---|---|---|---|---|---|
x | 0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 | |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
y | 4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 | |
6 | A6 | B6 | C6 | D6 | |
7 | A7 | B7 | C7 | D7 | |
z | 8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 | |
10 | A10 | B10 | C10 | D10 | |
11 | A11 | B11 | C11 | D11 |
也等同於下面這種方式:
pd.concat({'x': df1, 'y': df2, 'z': df3})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | ||
---|---|---|---|---|---|
x | 0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 | |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
y | 4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 | |
6 | A6 | B6 | C6 | D6 | |
7 | A7 | B7 | C7 | D7 | |
z | 8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 | |
10 | A10 | B10 | C10 | D10 | |
11 | A11 | B11 | C11 | D11 |
合併時不保留原索引,啟用新的自然索引:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
df4
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
B | D | F | |
---|---|---|---|
2 | B2 | D2 | F2 |
3 | B3 | D3 | F3 |
6 | B6 | D6 | F6 |
7 | B7 | D7 | F7 |
pd.concat([df1, df4], ignore_index=True, sort=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | F | |
---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN |
1 | A1 | B1 | C1 | D1 | NaN |
2 | A2 | B2 | C2 | D2 | NaN |
3 | A3 | B3 | C3 | D3 | NaN |
4 | NaN | B2 | NaN | D2 | F2 |
5 | NaN | B3 | NaN | D3 | F3 |
6 | NaN | B6 | NaN | D6 | F6 |
7 | NaN | B7 | NaN | D7 | F7 |
有沒有類似於資料庫中的outer join
呢?
pd.concat([df1, df4], axis=1, sort=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | B | D | F | |
---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | B2 | D2 | F2 |
3 | A3 | B3 | C3 | D3 | B3 | D3 | F3 |
6 | NaN | NaN | NaN | NaN | B6 | D6 | F6 |
7 | NaN | NaN | NaN | NaN | B7 | D7 | F7 |
很自然聯想到inner join
:
pd.concat([df1, df4], axis=1, join='inner')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | B | D | F | |
---|---|---|---|---|---|---|---|
2 | A2 | B2 | C2 | D2 | B2 | D2 | F2 |
3 | A3 | B3 | C3 | D3 | B3 | D3 | F3 |
join : {'inner', 'outer'}, default 'outer'
How to handle indexes on other axis (or axes).
這裡並沒有看到left join
或者right join
,那麼如何達到left join
的效果呢?
pd.concat([df1, df4.reindex(df1.index)], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | B | D | F | |
---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | B2 | D2 | F2 |
3 | A3 | B3 | C3 | D3 | B3 | D3 | F3 |
與序列合併:
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
pd.concat([df1, s1], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | X | |
---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | X0 |
1 | A1 | B1 | C1 | D1 | X1 |
2 | A2 | B2 | C2 | D2 | X2 |
3 | A3 | B3 | C3 | D3 | X3 |
當然,也是可以使用df.assign()
來定義一個新列。
如果序列沒名稱,會自動給自然索引名稱,如下:
s2 = pd.Series(['_A', '_B', '_C', '_D'])
s3 = pd.Series(['_a', '_b', '_c', '_d'])
pd.concat([df1, s2, s3], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | 0 | 1 | |
---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | _A | _a |
1 | A1 | B1 | C1 | D1 | _B | _b |
2 | A2 | B2 | C2 | D2 | _C | _c |
3 | A3 | B3 | C3 | D3 | _D | _d |
ignore_index=True
會取消原有列名:
pd.concat([df1, s1], axis=1, ignore_index=True)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | X0 |
1 | A1 | B1 | C1 | D1 | X1 |
2 | A2 | B2 | C2 | D2 | X2 |
3 | A3 | B3 | C3 | D3 | X3 |
同理,多個Series
也可以合併:
s3 = pd.Series(['李尋歡', '令狐沖', '張無忌', '花無缺'])
s4 = pd.Series(['多情劍客無情劍', '笑傲江湖', '倚天屠龍記', '絕代雙驕'])
s5 = pd.Series(['小李飛刀', '獨孤九劍', '九陽神功', '移花接玉'])
pd.concat([s3, s4, s5], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
0 | 1 | 2 | |
---|---|---|---|
0 | 李尋歡 | 多情劍客無情劍 | 小李飛刀 |
1 | 令狐沖 | 笑傲江湖 | 獨孤九劍 |
2 | 張無忌 | 倚天屠龍記 | 九陽神功 |
3 | 花無缺 | 絕代雙驕 | 移花接玉 |
也可以指定keys
使用新的列名:
pd.concat([s3, s4, s5], axis=1, keys=['name', 'book', 'skill'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
name | book | skill | |
---|---|---|---|
0 | 李尋歡 | 多情劍客無情劍 | 小李飛刀 |
1 | 令狐沖 | 笑傲江湖 | 獨孤九劍 |
2 | 張無忌 | 倚天屠龍記 | 九陽神功 |
3 | 花無缺 | 絕代雙驕 | 移花接玉 |
merge
Signature:
pd.merge(
left,
right,
how: str = 'inner',
on=None,
left_on=None,
right_on=None,
left_index: bool = False,
right_index: bool = False,
sort: bool = False,
suffixes=('_x', '_y'),
copy: bool = True,
indicator: bool = False,
validate=None,
) -> 'DataFrame' Docstring:
Merge DataFrame or named Series objects with a database-style join. The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on. Parameters
----------
left : DataFrame
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
Type of merge to be performed. * left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
* right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
* outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on : label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index : bool, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sort : bool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixes : list-like, default is ("_x", "_y")
A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
`left` and `right` respectively. Pass a value of `None` instead
of a string to indicate that the column name from `left` or
`right` should be left as-is, with no suffix. At least one of the
values must not be None.
copy : bool, default True
If False, avoid copy if possible.
indicator : bool or str, default False
If True, adds a column to the output DataFrame called "_merge" with
information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical
type with the value of "left_only" for observations whose merge key only
appears in the left DataFrame, "right_only" for observations
whose merge key only appears in the right DataFrame, and "both"
if the observation's merge key is found in both DataFrames. validate : str, optional
If specified, checks if merge is of specified type. * "one_to_one" or "1:1": check if merge keys are unique in both
left and right datasets.
* "one_to_many" or "1:m": check if merge keys are unique in left
dataset.
* "many_to_one" or "m:1": check if merge keys are unique in right
dataset.
* "many_to_many" or "m:m": allowed, but does not result in checks.
這裡的merge
與關係型資料庫中的join
非常類似。下面根據例項看看如何使用。
on
:根據某個欄位進行連線,必須存在於兩個DateFrame中(若未同時存在,則需要分別使用left_on
和right_on
來設定)
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, on='key')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
也可以有多個連線鍵:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, on=['key1', 'key2'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
how
: 可以指定資料用哪種方式進行合併,沒有的內容會為NaN
,預設值inner
上面沒有指定how
,預設就是inner
,下面分別看看left
, right
, outer
的效果。
左外連線:
pd.merge(left, right, how='left', on=['key1', 'key2'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
右外連線:
pd.merge(left, right, how='right', on=['key1', 'key2'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
全外連線:
pd.merge(left, right, how='outer', on=['key1', 'key2'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K2 | K0 | NaN | NaN | C3 | D3 |
如果設定indicator
為True
, 則會增加名為_merge
的一列,顯示這列是如何而來,其中left_only
表示只在左表中, right_only
表示只在右表中, both
表示兩個表中都有:
pd.merge(left, right, how='outer', on=['key1', 'key2'], indicator=True)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | key2 | A | B | C | D | _merge | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
5 | K2 | K0 | NaN | NaN | C3 | D3 | right_only |
如果左、右兩邊連線的欄位名稱不同時,可以分別設定left_on
、right_on
:
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, left_on='key1', right_on='key2')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | A | B | key2 | C | D | |
---|---|---|---|---|---|---|
0 | K0 | A0 | B0 | K0 | C0 | D0 |
1 | K1 | A1 | B1 | K1 | C1 | D1 |
2 | K2 | A2 | B2 | K2 | C2 | D2 |
3 | K3 | A3 | B3 | K3 | C3 | D3 |
非關聯欄位名稱相同時,會怎樣?
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'B': [30, 50, 70, 90]}) pd.merge(left, right, left_on='key1', right_on='key2')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | A | B_x | key2 | C | B_y | |
---|---|---|---|---|---|---|
0 | K0 | A0 | B0 | K0 | C0 | 30 |
1 | K1 | A1 | B1 | K1 | C1 | 50 |
2 | K2 | A2 | B2 | K2 | C2 | 70 |
3 | K3 | A3 | B3 | K3 | C3 | 90 |
預設suffixes=('_x', '_y')
,也可以自行修改:
pd.merge(left, right, left_on='key1', right_on='key2',
suffixes=('_left', '_right'))
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
key1 | A | B_left | key2 | C | B_right | |
---|---|---|---|---|---|---|
0 | K0 | A0 | B0 | K0 | C0 | 30 |
1 | K1 | A1 | B1 | K1 | C1 | 50 |
2 | K2 | A2 | B2 | K2 | C2 | 70 |
3 | K3 | A3 | B3 | K3 | C3 | 90 |
append
append
可以追加資料,並返回一個新物件,也是一種簡單常用的資料合併方式。
Signature:
df.append(other, ignore_index=False, verify_integrity=False, sort=False) -> 'DataFrame' Docstring:
Append rows of `other` to the end of caller, returning a new object.
引數解釋:
other
: 要追加的其他DataFrame
或Series
ignore_index
: 如果為True
則重新進行自然索引verify_integrity
: 如果為True
則遇到重複索引內容時報錯sort
: 是否進行排序
追加同結構的資料:
df1.append(df2)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
追加不同結構的資料,沒有的列會增加,沒有對應內容的會為NAN
:
df1.append(df4, sort=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | F | |
---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN |
1 | A1 | B1 | C1 | D1 | NaN |
2 | A2 | B2 | C2 | D2 | NaN |
3 | A3 | B3 | C3 | D3 | NaN |
2 | NaN | B2 | NaN | D2 | F2 |
3 | NaN | B3 | NaN | D3 | F3 |
6 | NaN | B6 | NaN | D6 | F6 |
7 | NaN | B7 | NaN | D7 | F7 |
追加多個DataFrame
:
df1.append([df2, df3])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
忽略原索引:
df1.append(df4, ignore_index=True, sort=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | F | |
---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN |
1 | A1 | B1 | C1 | D1 | NaN |
2 | A2 | B2 | C2 | D2 | NaN |
3 | A3 | B3 | C3 | D3 | NaN |
4 | NaN | B2 | NaN | D2 | F2 |
5 | NaN | B3 | NaN | D3 | F3 |
6 | NaN | B6 | NaN | D6 | F6 |
7 | NaN | B7 | NaN | D7 | F7 |
追加Series
:
s2 = pd.Series(['X0', 'X1', 'X2', 'X3'],
index=['A', 'B', 'C', 'D']) df1.append(s2, ignore_index=True)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | X0 | X1 | X2 | X3 |
追加字典列表:
d = [{'A': 1, 'B': 2, 'C': 3, 'X': 4},
{'A': 5, 'B': 6, 'C': 7, 'Y': 8}] df1.append(d, ignore_index=True, sort=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | C | D | X | Y | |
---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN |
4 | 1 | 2 | 3 | NaN | 4.0 | NaN |
5 | 5 | 6 | 7 | NaN | NaN | 8.0 |
來個實戰案例。在使用Excel
的時候,常常會在資料最後,增加一行彙總資料,比如求和,求平均值等。現在用Pandas
如何實現呢?
df = pd.DataFrame(np.random.randint(1, 10, size=(3, 4)),
columns=['a', 'b', 'c', 'd'])
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
a | b | c | d | |
---|---|---|---|---|
0 | 8 | 5 | 6 | 1 |
1 | 6 | 2 | 8 | 5 |
2 | 7 | 2 | 2 | 3 |
df.append(pd.Series(df.sum(), name='total'))
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
a | b | c | d | |
---|---|---|---|---|
0 | 8 | 5 | 6 | 1 |
1 | 6 | 2 | 8 | 5 |
2 | 7 | 2 | 2 | 3 |
total | 21 | 9 | 16 | 9 |
資料清洗
資料清洗是指發現並糾正資料集中可識別的錯誤的一個過程,包括檢查資料一致性,處理無效值、缺失值等。資料清洗是為了最大限度地提高資料集的準確性。
缺失值
由於資料來源的複雜性、不確定性,資料中難免會存在欄位值不全、缺失等情況,下面先介紹如何找出這些缺失的值。
以這裡的電影資料舉例,其中資料集來自github,為了方便測試,下載壓縮後上傳到部落格園,原始資料鏈接:
https://github.com/LearnDataSci/articles/blob/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv
movies = pd.read_csv('https://files.cnblogs.com/files/blogs/478024/IMDB-Movie-Data.csv.zip')
movies.tail(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Rank | Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
998 | 999 | Search Party | Adventure,Comedy | A pair of friends embark on a mission to reuni... | Scot Armstrong | Adam Pally, T.J. Miller, Thomas Middleditch,Sh... | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
999 | 1000 | Nine Lives | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins... | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
上面可以看到Revenue (Millions)
列有個NaN
,這裡的NaN
是一個缺值標識。
NaN (not a number) is the standard missing data marker used in pandas.
判斷資料集中是否有缺失值:
np.any(movies.isna())
True
或者判斷資料集中是否不含缺失值:
np.all(movies.notna())
False
統計 每列有多少個缺失值:
movies.isna().sum(axis=0)
Rank 0
Title 0
Genre 0
Description 0
Director 0
Actors 0
Year 0
Runtime (Minutes) 0
Rating 0
Votes 0
Revenue (Millions) 128
Metascore 64
dtype: int64
統計 每行有多少個缺失值:
movies.isna().sum(axis=1)
0 0
1 0
2 0
3 0
4 0
..
995 1
996 0
997 0
998 1
999 0
Length: 1000, dtype: int64
統計 一共有多少個缺失值:
movies.isna().sum().sum()
192
篩選出有缺失值的列:
movies.isnull().any(axis=0)
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
統計有缺失值的列的個數:
movies.isnull().any(axis=0).sum()
2
篩選出有缺失值的行:
movies.isnull().any(axis=1)
0 False
1 False
2 False
3 False
4 False
...
995 True
996 False
997 False
998 True
999 False
Length: 1000, dtype: bool
統計有缺失值的行的個數:
movies.isnull().any(axis=1).sum()
162
檢視Metascore
列缺失的資料:
movies[movies['Metascore'].isna()][['Rank', 'Title', 'Votes', 'Metascore']]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Rank | Title | Votes | Metascore | |
---|---|---|---|---|
25 | 26 | Paris pieds nus | 222 | NaN |
26 | 27 | Bahubali: The Beginning | 76193 | NaN |
27 | 28 | Dead Awake | 523 | NaN |
39 | 40 | 5- 25- 77 | 241 | NaN |
42 | 43 | Don't Fuck in the Woods | 496 | NaN |
... | ... | ... | ... | ... |
967 | 968 | The Walk | 92378 | NaN |
969 | 970 | The Lone Ranger | 190855 | NaN |
971 | 972 | Disturbia | 193491 | NaN |
989 | 990 | Selma | 67637 | NaN |
992 | 993 | Take Me Home Tonight | 45419 | NaN |
64 rows × 4 columns
movies.shape
(1000, 12)
movies.count()
Rank 1000
Title 1000
Genre 1000
Description 1000
Director 1000
Actors 1000
Year 1000
Runtime (Minutes) 1000
Rating 1000
Votes 1000
Revenue (Millions) 872
Metascore 936
dtype: int64
可以看出,每列應該有1000個數據的,count()
時缺失值並沒有算進來。
對於缺失值的處理,應根據具體的業務場景以及資料完整性的要求選擇較合適的方案。常見的處理方案包括刪除存在缺失值的資料(dropna
)、替換缺失值(fillna
)。
刪除
某些場景下,有缺失值會認為該樣本資料無效,就需要對整行或者整列資料進行刪除(dropna
)。
統計無缺失值的行的個數:
movies.notna().all(axis=1).sum()
838
刪除所有含缺失值的行:
data = movies.dropna()
data.shape
(838, 12)
movies.shape
(1000, 12)
data
裡838行資料都是無缺失值的。但是,值得注意的是,dropna()
預設並不會在原來的資料集上刪除,除非指定dropna(inplace=True)
。下面演示一下:
# 複製一份完整的資料做原地刪除演示
movies_copy = movies.copy()
print(movies_copy.shape) # 原地刪除
movies_copy.dropna(inplace=True) # 檢視原地刪除是否生效
movies_copy.shape
(1000, 12)
(838, 12)
統計無缺失值的列的個數:
movies.notna().all(axis=0).sum()
10
刪除含缺失值的列:
data = movies.dropna(axis=1)
data.shape
(1000, 10)
how : {'any', 'all'}, default 'any'. Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
- 'any' : If any NA values are present, drop that row or column.
- 'all' : If all values are NA, drop that row or column.
刪除所有值都缺失的行:
# Drop the rows where all elements are missing
data = movies.dropna(how='all')
data.shape
(1000, 12)
這裡的資料不存在所有值都缺失的行,所以how='all'
時dropna()
對此處的資料集無任何影響。
subset: Define in which columns to look for missing values.
subset
引數可以指定在哪些列中尋找缺失值:
# 指定在Title、Metascore兩列中尋找缺失值
data = movies.dropna(subset=['Title', 'Metascore'])
data.shape
(936, 12)
統計Title
、Metascore
這兩列無缺失值的行數。講道理,肯定是等於上面刪除缺失值後的行數。
movies[['Title', 'Metascore']].notna().all(axis=1).sum()
936
thresh : int, optional. Require that many non-NA values.
# Keep only the rows with at least 2 non-NA values.
data = movies[['Title', 'Metascore', 'Revenue (Millions)']].dropna(thresh=2)
data.shape
(970, 3)
由於Title
列沒有缺失值,相當於刪除掉Metascore
,Revenue (Millions)
兩列都為缺失值的行,如下:
data = movies[['Metascore', 'Revenue (Millions)']].dropna(how='all')
data.shape
(970, 2)
填充
處理缺失值的另外一種常用方法是填充(fillna
)。填充值雖然不絕對準確,但對獲得真實結果的影響並不大時,可以嘗試一用。
先看個簡單的例子,然後再應用到上面的電影資料中去。因為電影資料比較多,演示起來並不直觀。
hero = pd.DataFrame(data={'score': [97, np.nan, 96, np.nan, 95],
'wins': [np.nan, 9, np.nan, 11, 10],
'author': ['古龍', '金庸', np.nan, np.nan, np.nan],
'book': ['多情劍客無情劍', '笑傲江湖', '倚天屠龍記', '射鵰英雄傳', '絕代雙驕'],
'skill': ['小李飛刀', '獨孤九劍', '九陽神功', '降龍十八掌', '移花接玉'],
'wife': [np.nan, '任盈盈', np.nan, '黃蓉', np.nan],
'child': [np.nan, np.nan, np.nan, '郭襄', np.nan]},
index=['李尋歡', '令狐沖', '張無忌', '郭靖', '花無缺']) hero
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97.0 | NaN | 古龍 | 多情劍客無情劍 | 小李飛刀 | NaN | NaN |
令狐沖 | NaN | 9.0 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | NaN |
張無忌 | 96.0 | NaN | NaN | 倚天屠龍記 | 九陽神功 | NaN | NaN |
郭靖 | NaN | 11.0 | NaN | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95.0 | 10.0 | NaN | 絕代雙驕 | 移花接玉 | NaN | NaN |
全部填充為unknown
:
hero.fillna('unknown')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97 | unknown | 古龍 | 多情劍客無情劍 | 小李飛刀 | unknown | unknown |
令狐沖 | unknown | 9 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | unknown |
張無忌 | 96 | unknown | unknown | 倚天屠龍記 | 九陽神功 | unknown | unknown |
郭靖 | unknown | 11 | unknown | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95 | 10 | unknown | 絕代雙驕 | 移花接玉 | unknown | unknown |
只替換第一個:
hero.fillna('unknown', limit=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97 | unknown | 古龍 | 多情劍客無情劍 | 小李飛刀 | unknown | unknown |
令狐沖 | unknown | 9 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | NaN |
張無忌 | 96 | NaN | unknown | 倚天屠龍記 | 九陽神功 | NaN | NaN |
郭靖 | NaN | 11 | NaN | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95 | 10 | NaN | 絕代雙驕 | 移花接玉 | NaN | NaN |
不同列替換不同的值:
hero.fillna(value={'score': 100, 'author': '匿名', 'wife': '保密'})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97.0 | NaN | 古龍 | 多情劍客無情劍 | 小李飛刀 | 保密 | NaN |
令狐沖 | 100.0 | 9.0 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | NaN |
張無忌 | 96.0 | NaN | 匿名 | 倚天屠龍記 | 九陽神功 | 保密 | NaN |
郭靖 | 100.0 | 11.0 | 匿名 | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95.0 | 10.0 | 匿名 | 絕代雙驕 | 移花接玉 | 保密 | NaN |
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
上面是填充固定值,此外還能指定填充方法。
pad / ffill: propagate last valid observation forward to next valid
# 使用前一個有效值填充
hero.fillna(method='ffill')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97.0 | NaN | 古龍 | 多情劍客無情劍 | 小李飛刀 | NaN | NaN |
令狐沖 | 97.0 | 9.0 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | NaN |
張無忌 | 96.0 | 9.0 | 金庸 | 倚天屠龍記 | 九陽神功 | 任盈盈 | NaN |
郭靖 | 96.0 | 11.0 | 金庸 | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95.0 | 10.0 | 金庸 | 絕代雙驕 | 移花接玉 | 黃蓉 | 郭襄 |
backfill / bfill: use next valid observation to fill gap.
# 使用後一個有效值填充
hero.fillna(method='bfill')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97.0 | 9.0 | 古龍 | 多情劍客無情劍 | 小李飛刀 | 任盈盈 | 郭襄 |
令狐沖 | 96.0 | 9.0 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | 郭襄 |
張無忌 | 96.0 | 11.0 | NaN | 倚天屠龍記 | 九陽神功 | 黃蓉 | 郭襄 |
郭靖 | 95.0 | 11.0 | NaN | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95.0 | 10.0 | NaN | 絕代雙驕 | 移花接玉 | NaN | NaN |
對score
、wins
兩列的缺失值用平均值來填充:
hero.fillna(hero[['score', 'wins']].mean())
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97.0 | 10.0 | 古龍 | 多情劍客無情劍 | 小李飛刀 | NaN | NaN |
令狐沖 | 96.0 | 9.0 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | NaN |
張無忌 | 96.0 | 10.0 | NaN | 倚天屠龍記 | 九陽神功 | NaN | NaN |
郭靖 | 96.0 | 11.0 | NaN | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95.0 | 10.0 | NaN | 絕代雙驕 | 移花接玉 | NaN | NaN |
指定列填充為unknown
,並原地替換:
hero.child.fillna('unknown', inplace=True)
hero
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
score | wins | author | book | skill | wife | child | |
---|---|---|---|---|---|---|---|
李尋歡 | 97.0 | NaN | 古龍 | 多情劍客無情劍 | 小李飛刀 | NaN | unknown |
令狐沖 | NaN | 9.0 | 金庸 | 笑傲江湖 | 獨孤九劍 | 任盈盈 | unknown |
張無忌 | 96.0 | NaN | NaN | 倚天屠龍記 | 九陽神功 | NaN | unknown |
郭靖 | NaN | 11.0 | NaN | 射鵰英雄傳 | 降龍十八掌 | 黃蓉 | 郭襄 |
花無缺 | 95.0 | 10.0 | NaN | 絕代雙驕 | 移花接玉 | NaN | unknown |
再回到上面的實際的電影資料案例。現在用平均值替換缺失值:
filled_movies = movies.fillna(
movies[['Revenue (Millions)', 'Metascore']].mean()) np.any(filled_movies.isna())
False
可見,填充後的電影資料中已經不存在缺失值了。
資料替換
資料替換常用於資料清洗整理、列舉轉換、資料修正等場景。
先看下replace()
方法的介紹:
Signature:
replace(
to_replace=None,
value=None,
inplace=False,
limit=None,
regex=False,
method='pad',
) Docstring:
Replace values given in `to_replace` with `value`.
再看幾個例子:
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
s
0 a
1 b
2 c
3 d
4 e
dtype: object
s.replace('a', 'aa')
0 aa
1 b
2 c
3 d
4 e
dtype: object
s.replace({'d': 'dd', 'e': 'ee'})
0 a
1 b
2 c
3 dd
4 ee
dtype: object
s.replace(['a', 'b', 'c'], ['aa', 'bb', 'cc'])
0 aa
1 bb
2 cc
3 d
4 e
dtype: object
# 將c替換為它前一個值
s.replace('c', method='ffill')
0 a
1 b
2 b
3 d
4 e
dtype: object
# 將c替換為它後一個值
s.replace('c', method='bfill')
0 a
1 b
2 d
3 d
4 e
dtype: object
df = pd.DataFrame({'A': [0, -11, 2, 3, 35],
'B': [0, -20, 2, 5, 16]})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 0 | 0 |
1 | -11 | -20 |
2 | 2 | 2 |
3 | 3 | 5 |
4 | 35 | 16 |
df.replace(0, 5)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 5 | 5 |
1 | -11 | -20 |
2 | 2 | 2 |
3 | 3 | 5 |
4 | 35 | 16 |
df.replace([0, 2, 3, 5], 10)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 10 | 10 |
1 | -11 | -20 |
2 | 10 | 10 |
3 | 10 | 10 |
4 | 35 | 16 |
df.replace([0, 2], [100, 200])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 100 | 100 |
1 | -11 | -20 |
2 | 200 | 200 |
3 | 3 | 5 |
4 | 35 | 16 |
df.replace({0: 10, 2: 22})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 10 | 10 |
1 | -11 | -20 |
2 | 22 | 22 |
3 | 3 | 5 |
4 | 35 | 16 |
df.replace({'A': 0, 'B': 2}, 100)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 100 | 0 |
1 | -11 | -20 |
2 | 2 | 100 |
3 | 3 | 5 |
4 | 35 | 16 |
df.replace({'A': {2: 200, 3: 300}})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 0 | 0 |
1 | -11 | -20 |
2 | 200 | 2 |
3 | 300 | 5 |
4 | 35 | 16 |
對一些極端值,如過大或者過小,可以使用df.clip(lower, upper)
來修剪,當資料大於upper
時,使用upper
的值,小於lower
時用lower
的值,類似numpy.clip
的方法。
在修剪之前,再看一眼原始資料:
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 0 | 0 |
1 | -11 | -20 |
2 | 2 | 2 |
3 | 3 | 5 |
4 | 35 | 16 |
# 修剪成最小為2,最大為10
df.clip(2, 10)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 2 | 2 |
1 | 2 | 2 |
2 | 2 | 2 |
3 | 3 | 5 |
4 | 10 | 10 |
# 對每列元素的最小值和最大值進行不同的限制 # 將A列數值修剪成[-3, 3]之間
# 將B列數值修剪成[-5, 5]之間
df.clip([-3, -5], [3, 5], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 0 | 0 |
1 | -3 | -5 |
2 | 2 | 2 |
3 | 3 | 5 |
4 | 3 | 5 |
# 對每行元素的最小值和最大值進行不同的限制 # 將第1行數值修剪成[5, 10]之間
# 將第2行數值修剪成[-15, -12]之間
# 將第3行數值修剪成[6, 10]之間
# 將第4行數值修剪成[4, 10]之間
# 將第5行數值修剪成[20, 30]之間
df.clip([5, -15, 6, 4, 20],
[10, -12, 10, 10, 30],
axis=0)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 5 | 5 |
1 | -12 | -15 |
2 | 6 | 6 |
3 | 4 | 5 |
4 | 30 | 20 |
另外,可以將無效值先替換為nan
,再做缺失值處理。這樣就能應用上前面講到的缺失值處理相關的知識。
比如這裡的df
,我們認為小於0的資料都是無效資料,可以:
df.replace([-11, -20], np.nan)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 0.0 | 0.0 |
1 | NaN | NaN |
2 | 2.0 | 2.0 |
3 | 3.0 | 5.0 |
4 | 35.0 | 16.0 |
當然,也可以像下面這樣把無效資料變為nan
:
df[df >= 0]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | 0.0 | 0.0 |
1 | NaN | NaN |
2 | 2.0 | 2.0 |
3 | 3.0 | 5.0 |
4 | 35.0 | 16.0 |
此時,上面講到的缺失值處理就能派上用場了。
文字內容比較複雜時,可以使用正則進行匹配替換。下面看幾個例子:
df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
'B': ['abc', 'bar', 'xyz']})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | bat | abc |
1 | foo | bar |
2 | bait | xyz |
# 利用正則將ba開頭且總共3個字元的文字替換為new
df.replace(to_replace=r'^ba.$', value='new', regex=True)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | new | abc |
1 | foo | new |
2 | bait | xyz |
# 如果多列正則不同的情況下可以按以下格式對應傳入
df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | new | abc |
1 | foo | bar |
2 | bait | xyz |
df.replace(regex=r'^ba.$', value='new')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | new | abc |
1 | foo | new |
2 | bait | xyz |
# 不同正則替換不同的值
df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | new | abc |
1 | xyz | new |
2 | bait | xyz |
# 多個正則替換為同一個值
df.replace(regex=[r'^ba.$', 'foo'], value='new')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
A | B | |
---|---|---|
0 | new | abc |
1 | new | new |
2 | bait | xyz |
重複值
重複值在資料清洗中可能需要刪除。下面介紹Pandas如何識別重複值以及如何刪除重複值。
Signature:
df.duplicated(
subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
keep: Union[str, bool] = 'first',
) -> 'Series' Docstring:
Return boolean Series denoting duplicate rows. Considering certain columns is optional. Parameters
----------
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', False}, default 'first'
Determines which duplicates (if any) to mark. - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
- ``last`` : Mark duplicates as ``True`` except for the last occurrence.
- False : Mark all duplicates as ``True``.
看官方給的例子:
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
brand | style | rating | |
---|---|---|---|
0 | Yum Yum | cup | 4.0 |
1 | Yum Yum | cup | 4.0 |
2 | Indomie | cup | 3.5 |
3 | Indomie | pack | 15.0 |
4 | Indomie | pack | 5.0 |
# 預設情況下,對於每行重複的值,第一次出現都設定為False,其他為True
df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
# 將每行重複值的最後一次出現設定為False,其他為True
df.duplicated(keep='last')
0 True
1 False
2 False
3 False
4 False
dtype: bool
# 所有重複行都為True
df.duplicated(keep=False)
0 True
1 True
2 False
3 False
4 False
dtype: bool
# 引數subset可以在指定列上查詢重複值
df.duplicated(subset=['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool
再看如何刪除重複值:
Signature:
df.drop_duplicates(
subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
keep: Union[str, bool] = 'first',
inplace: bool = False,
ignore_index: bool = False,
) -> Union[ForwardRef('DataFrame'), NoneType]
Docstring:
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes
are ignored. Parameters
----------
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', False}, default 'first'
Determines which duplicates (if any) to keep.
- ``first`` : Drop duplicates except for the first occurrence.
- ``last`` : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace : bool, default False
Whether to drop duplicates in place or to return a copy.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
同樣繼續官方給的例子:
# By default, it removes duplicate rows based on all columns
df.drop_duplicates()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
brand | style | rating | |
---|---|---|---|
0 | Yum Yum | cup | 4.0 |
2 | Indomie | cup | 3.5 |
3 | Indomie | pack | 15.0 |
4 | Indomie | pack | 5.0 |
# To remove duplicates on specific column(s), use `subset`
df.drop_duplicates(subset=['brand'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
brand | style | rating | |
---|---|---|---|
0 | Yum Yum | cup | 4.0 |
2 | Indomie | cup | 3.5 |
# To remove duplicates and keep last occurences, use `keep`
df.drop_duplicates(subset=['brand', 'style'], keep='last')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
brand | style | rating | |
---|---|---|---|
1 | Yum Yum | cup | 4.0 |
2 | Indomie | cup | 3.5 |
4 | Indomie | pack | 5.0 |
分組與聚合
在資料統計與分析中,分組與聚合非常常見。如果是SQL,對應的就是Group By
和聚合函式(Aggregation Functions
)。下面看看pandas是怎麼玩的。
Signature:
df.groupby(
by=None,
axis=0,
level=None,
as_index: bool = True,
sort: bool = True,
group_keys: bool = True,
squeeze: bool = <object object at 0x7f3df810e750>,
observed: bool = False,
dropna: bool = True,
) -> 'DataFrameGroupBy' Docstring:
Group DataFrame using a mapper or by a Series of columns.
groupby()
方法可以按指定欄位對DataFrame
進行分組,生成一個分組器物件,然後再把這個物件的各個欄位按一定的聚合方法輸出。
其中by
為分組欄位,由於是第一個引數可以省略,可以按列表給多個。會返回一個DataFrameGroupBy
物件,如果不給聚合方法,不會返回 DataFrame
。
準備演示資料:
df = pd.read_csv('https://files.cnblogs.com/files/blogs/478024/team.csv.zip')
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
name | team | Q1 | Q2 | Q3 | Q4 | |
---|---|---|---|---|---|---|
0 | Liver | E | 89 | 21 | 24 | 64 |
1 | Arry | C | 36 | 37 | 37 | 57 |
2 | Ack | A | 57 | 60 | 18 | 84 |
3 | Eorge | C | 93 | 96 | 71 | 78 |
4 | Oah | D | 65 | 49 | 61 | 86 |
... | ... | ... | ... | ... | ... | ... |
95 | Gabriel | C | 48 | 59 | 87 | 74 |
96 | Austin7 | C | 21 | 31 | 30 | 43 |
97 | Lincoln4 | C | 98 | 93 | 1 | 20 |
98 | Eli | E | 11 | 74 | 58 | 91 |
99 | Ben | E | 21 | 43 | 41 | 74 |
100 rows × 6 columns
# 按team分組後對應列求和
df.groupby('team').sum()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Q1 | Q2 | Q3 | Q4 | |
---|---|---|---|---|
team | ||||
A | 1066 | 639 | 875 | 783 |
B | 975 | 1218 | 1202 | 1136 |
C | 1056 | 1194 | 1068 | 1127 |
D | 860 | 1191 | 1241 | 1199 |
E | 963 | 1013 | 881 | 1033 |
# 按team分組後對應列求平均值
df.groupby('team').mean()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Q1 | Q2 | Q3 | Q4 | |
---|---|---|---|---|
team | ||||
A | 62.705882 | 37.588235 | 51.470588 | 46.058824 |
B | 44.318182 | 55.363636 | 54.636364 | 51.636364 |
C | 48.000000 | 54.272727 | 48.545455 | 51.227273 |
D | 45.263158 | 62.684211 | 65.315789 | 63.105263 |
E | 48.150000 | 50.650000 | 44.050000 | 51.650000 |
# 按team分組後不同列使用不同的聚合方式
df.groupby('team').agg({'Q1': sum, # 求和
'Q2': 'count', # 計數
'Q3': 'mean', # 求平均值
'Q4': max}) # 求最大值
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Q1 | Q2 | Q3 | Q4 | |
---|---|---|---|---|
team | ||||
A | 1066 | 17 | 51.470588 | 97 |
B | 975 | 22 | 54.636364 | 99 |
C | 1056 | 22 | 48.545455 | 98 |
D | 860 | 19 | 65.315789 | 99 |
E | 963 | 20 | 44.050000 | 98 |
If
by
is a function, it's called on each value of the object's index.
# team在C之前(包括C)分為一組,C之後的分為另外一組
df.set_index('team').groupby(lambda team: 'team1' if team <= 'C' else 'team2')['name'].count()
team1 61
team2 39
Name: name, dtype: int64
或者下面這種寫法也行:
df.groupby(lambda idx: 'team1' if df.loc[idx]['team'] <= 'C' else 'team2')['name'].count()
team1 61
team2 39
Name: name, dtype: int64
# 按name的長度(length)分組,並取出每組中name的第一個值和最後一個值
df.groupby(df['name'].apply(lambda x: len(x))).agg({'name': ['first', 'last']})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }
name | ||
---|---|---|
first | last | |
name | ||
3 | Ack | Ben |
4 | Arry | Leon |
5 | Liver | Aiden |
6 | Harlie | Jamie0 |
7 | William | Austin7 |
8 | Harrison | Lincoln4 |
9 | Alexander | Theodore3 |
# 只對部分分組
df.set_index('team').groupby({'A': 'A組', 'B': 'B組'})['name'].count()
A組 17
B組 22
Name: name, dtype: int64
可以將以上方法混合組成列表進行分組:
# 按team,name長度分組,取分組中最後一行
df.groupby(['team', df['name'].apply(lambda x: len(x))]).last()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
name | Q1 | Q2 | Q3 | Q4 | ||
---|---|---|---|---|---|---|
team | name | |||||
A | 3 | Ack | 57 | 60 | 18 | 84 |
4 | Toby | 52 | 27 | 17 | 68 | |
5 | Aaron | 96 | 75 | 55 | 8 | |
6 | Nathan | 87 | 77 | 62 | 13 | |
7 | Stanley | 69 | 71 | 39 | 97 | |
B | 3 | Kai | 66 | 45 | 13 | 48 |
4 | Liam | 2 | 80 | 24 | 25 | |
5 | Lewis | 4 | 34 | 77 | 28 | |
6 | Jamie0 | 39 | 97 | 84 | 55 | |
7 | Albert0 | 85 | 38 | 41 | 17 | |
8 | Grayson7 | 59 | 84 | 74 | 33 | |
C | 4 | Adam | 90 | 32 | 47 | 39 |
5 | Calum | 14 | 91 | 16 | 82 | |
6 | Connor | 62 | 38 | 63 | 46 | |
7 | Austin7 | 21 | 31 | 30 | 43 | |
8 | Lincoln4 | 98 | 93 | 1 | 20 | |
9 | Sebastian | 1 | 14 | 68 | 48 | |
D | 3 | Oah | 65 | 49 | 61 | 86 |
4 | Ezra | 16 | 56 | 86 | 61 | |
5 | Aiden | 20 | 31 | 62 | 68 | |
6 | Reuben | 70 | 72 | 76 | 56 | |
7 | Hunter3 | 38 | 80 | 82 | 40 | |
8 | Benjamin | 15 | 88 | 52 | 25 | |
9 | Theodore3 | 43 | 7 | 68 | 80 | |
E | 3 | Ben | 21 | 43 | 41 | 74 |
4 | Leon | 38 | 60 | 31 | 7 | |
5 | Roman | 73 | 1 | 25 | 44 | |
6 | Dexter | 73 | 94 | 53 | 20 | |
7 | Zachary | 12 | 71 | 85 | 93 | |
8 | Jackson5 | 6 | 10 | 15 | 33 |
We can groupby different levels of a hierarchical index using the
level
parameter.
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
['Captive', 'Wild', 'Captive', 'Wild']] index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
index=index)
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Max Speed | ||
---|---|---|
Animal | Type | |
Falcon | Captive | 390.0 |
Wild | 350.0 | |
Parrot | Captive | 30.0 |
Wild | 20.0 |
# df.groupby(level=0).mean()
df.groupby(level="Animal").mean()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Max Speed | |
---|---|
Animal | |
Falcon | 370.0 |
Parrot | 25.0 |
# df.groupby(level=1).mean()
df.groupby(level="Type").mean()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Max Speed | |
---|---|
Type | |
Captive | 210.0 |
Wild | 185.0 |
We can also choose to include NA in group keys or not by setting
dropna
parameter, the default setting isTrue
.
l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
a | b | c | |
---|---|---|---|
0 | 1 | 2.0 | 3 |
1 | 1 | NaN | 4 |
2 | 2 | 1.0 | 3 |
3 | 1 | 2.0 | 2 |
df.groupby(by=["b"]).sum()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
a | c | |
---|---|---|
b | ||
1.0 | 2 | 3 |
2.0 | 2 | 5 |
df.groupby(by=["b"], dropna=False).sum()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
a | c | |
---|---|---|
b | ||
1.0 | 2 | 3 |
2.0 | 2 | 5 |
NaN | 1 | 4 |
上面體驗了一下pandas分組聚合的基本使用後,接下來看看分組聚合的一些過程細節。
分組
有以下動物最大速度資料:
df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
('bird', 'Psittaciformes', 24.0),
('mammal', 'Carnivora', 80.2),
('mammal', 'Primates', np.nan),
('mammal', 'Carnivora', 58)],
index=['falcon', 'parrot', 'lion',
'monkey', 'leopard'],
columns=('class', 'order', 'max_speed')) df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
class | order | max_speed | |
---|---|---|---|
falcon | bird | Falconiformes | 389.0 |
parrot | bird | Psittaciformes | 24.0 |
lion | mammal | Carnivora | 80.2 |
monkey | mammal | Primates | NaN |
leopard | mammal | Carnivora | 58.0 |
# 分組數
df.groupby('class').ngroups
2
# 檢視分組
df.groupby('class').groups
{'bird': ['falcon', 'parrot'], 'mammal': ['lion', 'monkey', 'leopard']}
df.groupby('class').size()
class
bird 2
mammal 3
dtype: int64
# 檢視鳥類分組內容
df.groupby('class').get_group('bird')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
class | order | max_speed | |
---|---|---|---|
falcon | bird | Falconiformes | 389.0 |
parrot | bird | Psittaciformes | 24.0 |
獲取分組中的第幾個值:
# 第一個
df.groupby('class').nth(1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
order | max_speed | |
---|---|---|
class | ||
bird | Psittaciformes | 24.0 |
mammal | Primates | NaN |
# 最後一個
df.groupby('class').nth(-1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
order | max_speed | |
---|---|---|
class | ||
bird | Psittaciformes | 24.0 |
mammal | Carnivora | 58.0 |
# 第一個,第二個
df.groupby('class').nth([1, 2])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
order | max_speed | |
---|---|---|
class | ||
bird | Psittaciformes | 24.0 |
mammal | Primates | NaN |
mammal | Carnivora | 58.0 |
# 每組顯示前2個
df.groupby('class').head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
class | order | max_speed | |
---|---|---|---|
falcon | bird | Falconiformes | 389.0 |
parrot | bird | Psittaciformes | 24.0 |
lion | mammal | Carnivora | 80.2 |
monkey | mammal | Primates | NaN |
# 每組最後2個
df.groupby('class').tail(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
class | order | max_speed | |
---|---|---|---|
falcon | bird | Falconiformes | 389.0 |
parrot | bird | Psittaciformes | 24.0 |
monkey | mammal | Primates | NaN |
leopard | mammal | Carnivora | 58.0 |
# 分組序號
df.groupby('class').ngroup()
falcon 0
parrot 0
lion 1
monkey 1
leopard 1
dtype: int64
# 返回每個元素在所在組的序號的序列
df.groupby('class').cumcount(ascending=False)
falcon 1
parrot 0
lion 2
monkey 1
leopard 0
dtype: int64
# 按鳥類首字母分組
df.groupby(df['class'].str[0]).groups
{'b': ['falcon', 'parrot'], 'm': ['lion', 'monkey', 'leopard']}
# 按鳥類第一個字母和第二個字母分組
df.groupby([df['class'].str[0], df['class'].str[1]]).groups
{('b', 'i'): ['falcon', 'parrot'], ('m', 'a'): ['lion', 'monkey', 'leopard']}
# 在組內的排名
df.groupby('class').rank()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
max_speed | |
---|---|
falcon | 2.0 |
parrot | 1.0 |
lion | 2.0 |
monkey | NaN |
leopard | 1.0 |
聚合
對資料進行分組後,接下來就可以收穫果實了,給分組給定統計方法,最終得到分組聚合的結果。除了常見的數學統計方法,還可以使用 agg()
和transform()
等函式進行操作。
# 描述性統計
df.groupby('class').describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }
max_speed | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
class | ||||||||
bird | 2.0 | 206.5 | 258.093975 | 24.0 | 115.25 | 206.5 | 297.75 | 389.0 |
mammal | 2.0 | 69.1 | 15.697771 | 58.0 | 63.55 | 69.1 | 74.65 | 80.2 |
# 一列使用多個聚合方法
df.groupby('class').agg({'max_speed': ['min', 'max', 'sum']})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }
max_speed | |||
---|---|---|---|
min | max | sum | |
class | |||
bird | 24.0 | 389.0 | 413.0 |
mammal | 58.0 | 80.2 | 138.2 |
df.groupby('class')['max_speed'].agg(
Max='max', Min='min', Diff=lambda x: x.max() - x.min())
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
Max | Min | Diff | |
---|---|---|---|
class | |||
bird | 389.0 | 24.0 | 365.0 |
mammal | 80.2 | 58.0 | 22.2 |
df.groupby('class').agg(max_speed=('max_speed', 'max'),
count_order=('order', 'count'))
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
max_speed | count_order | |
---|---|---|
class | ||
bird | 389.0 | 2 |
mammal | 80.2 | 3 |
df.groupby('class').agg(
max_speed=pd.NamedAgg(column='max_speed', aggfunc='max'),
count_order=pd.NamedAgg(column='order', aggfunc='count')
)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
max_speed | count_order | |
---|---|---|
class | ||
bird | 389.0 | 2 |
mammal | 80.2 | 3 |
transform
類似於agg
,但不同的是它返回的是一個DataFrame
,每個會將原來的值一一替換成統計後的值,比如按組計算平均值,那麼返回的新DataFrame
中每個值就是它所在組的平均值。
df.groupby('class').agg(np.mean)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
max_speed | |
---|---|
class | |
bird | 206.5 |
mammal | 69.1 |
df.groupby('class').transform(np.mean)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
max_speed | |
---|---|
falcon | 206.5 |
parrot | 206.5 |
lion | 69.1 |
monkey | 69.1 |
leopard | 69.1 |
分組後篩選原始資料:
Signature:
DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs) Docstring:
Return a copy of a DataFrame excluding filtered elements. Elements from groups are filtered if they do not satisfy the
boolean criterion specified by func.
# 篩選出 按class分組後,分組內max_speed平均值大於100的元素
df.groupby(['class']).filter(lambda x: x['max_speed'].mean() > 100)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
class | order | max_speed | |
---|---|---|---|
falcon | bird | Falconiformes | 389.0 |
parrot | bird | Psittaciformes | 24.0 |
# 取出分組後index
df.groupby('class').apply(lambda x: x.index.to_list())
class
bird [falcon, parrot]
mammal [lion, monkey, leopard]
dtype: object
# 取出分組後每組中max_speed最大的前N個
df.groupby('class').apply(lambda x: x.sort_values(
by='max_speed', ascending=False).head(1))
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
class | order | max_speed | ||
---|---|---|---|---|
class | ||||
bird | falcon | bird | Falconiformes | 389.0 |
mammal | lion | mammal | Carnivora | 80.2 |
df.groupby('class').apply(lambda x: pd.Series({
'speed_max': x['max_speed'].max(),
'speed_min': x['max_speed'].min(),
'speed_mean': x['max_speed'].mean(),
}))
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
speed_max | speed_min | speed_mean | |
---|---|---|---|
class | |||
bird | 389.0 | 24.0 | 206.5 |
mammal | 80.2 | 58.0 | 69.1 |
按分組匯出Excel
檔案:
for group, data in df.groupby('class'):
data.to_excel(f'data/{group}.xlsx')
# 每組去重值後數量
df.groupby('class').order.nunique()
class
bird 2
mammal 2
Name: order, dtype: int64
# 每組去重後的值
df.groupby("class")['order'].unique()
class
bird [Falconiformes, Psittaciformes]
mammal [Carnivora, Primates]
Name: order, dtype: object
# 統計每組資料值的數量
df.groupby("class")['order'].value_counts()
class order
bird Falconiformes 1
Psittaciformes 1
mammal Carnivora 2
Primates 1
Name: order, dtype: int64
# 每組最大的1個
df.groupby("class")['max_speed'].nlargest(1)
class
bird falcon 389.0
mammal lion 80.2
Name: max_speed, dtype: float64
# 每組最小的2個
df.groupby("class")['max_speed'].nsmallest(2)
class
bird parrot 24.0
falcon 389.0
mammal leopard 58.0
lion 80.2
Name: max_speed, dtype: float64
# 每組值是否單調遞增
df.groupby("class")['max_speed'].is_monotonic_increasing
class
bird False
mammal False
Name: max_speed, dtype: bool
# 每組值是否單調遞減
df.groupby("class")['max_speed'].is_monotonic_decreasing
class
bird True
mammal False
Name: max_speed, dtype: bool
堆疊與透視
實際生產中,我們拿到的原始資料的表現形狀可能並不符合當前需求,比如說不是期望的維度、資料不夠直觀、表現力不夠等等。此時,可以對原始資料進行適當的變形,比如堆疊、透視、行列轉置等。
堆疊
看個簡單的例子就能明白講的是什麼:
df = pd.DataFrame([[19, 136, 180, 98], [21, 122, 178, 96]], index=['令狐沖', '李尋歡'],
columns=['age', 'weight', 'height', 'score'])
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
age | weight | height | score | |
---|---|---|---|---|
令狐沖 | 19 | 136 | 180 | 98 |
李尋歡 | 21 | 122 | 178 | 96 |
# 有點像寬表變高表, 我是這樣覺得的
df.stack()
令狐沖 age 19
weight 136
height 180
score 98
李尋歡 age 21
weight 122
height 178
score 96
dtype: int64
# 有點像高表變寬表
df.stack().unstack()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
age | weight | height | score | |
---|---|---|---|---|
令狐沖 | 19 | 136 | 180 | 98 |
李尋歡 | 21 | 122 | 178 | 96 |
透視表
Signature:
df.pivot(index=None, columns=None, values=None) -> 'DataFrame' Docstring:
Return reshaped DataFrame organized by given index / column values. Reshape data (produce a "pivot" table) based on column values. Uses
unique values from specified `index` / `columns` to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns.
df = pd.DataFrame({'name': ['江小魚', '江小魚', '江小魚', '花無缺', '花無缺',
'花無缺'],
'bug_level': ['A', 'B', 'C', 'A', 'B', 'C'],
'bug_count': [2, 3, 5, 1, 5, 6]})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
name | bug_level | bug_count | |
---|---|---|---|
0 | 江小魚 | A | 2 |
1 | 江小魚 | B | 3 |
2 | 江小魚 | C | 5 |
3 | 花無缺 | A | 1 |
4 | 花無缺 | B | 5 |
5 | 花無缺 | C | 6 |
把上面的bug等級與bug數統計表變形如下,還是原來的資料,但是不是更加直觀呢?
df.pivot(index='name', columns='bug_level', values='bug_count')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
bug_level | A | B | C |
---|---|---|---|
name | |||
江小魚 | 2 | 3 | 5 |
花無缺 | 1 | 5 | 6 |
如果原始資料中有重複的統計呢?就比如說上面的例子中來自不同產品線的bug統計,就可能出現兩行這樣的資料['江小魚','B',3]、['江小魚','B',4]
,先試下用pivot
會怎樣?
df = pd.DataFrame({'name': ['江小魚', '江小魚', '江小魚', '江小魚', '江小魚', '花無缺', '花無缺',
'花無缺', '花無缺', '花無缺', ],
'bug_level': ['A', 'B', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
'bug_count': [2, 3, 5, 4, 6, 1, 5, 6, 3, 1],
'score': [70, 80, 90, 76, 86, 72, 82, 88, 68, 92]})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
name | bug_level | bug_count | score | |
---|---|---|---|---|
0 | 江小魚 | A | 2 | 70 |
1 | 江小魚 | B | 3 | 80 |
2 | 江小魚 | C | 5 | 90 |
3 | 江小魚 | B | 4 | 76 |
4 | 江小魚 | C | 6 | 86 |
5 | 花無缺 | A | 1 | 72 |
6 | 花無缺 | B | 5 | 82 |
7 | 花無缺 | C | 6 | 88 |
8 | 花無缺 | A | 3 | 68 |
9 | 花無缺 | B | 1 | 92 |
try:
df.pivot(index='name', columns='bug_level', values='bug_count')
except ValueError as e:
print(e)
Index contains duplicate entries, cannot reshape
原來,pivot()
只能將資料進行reshape
,不支援聚合。遇到上面這種含重複值需進行聚合計算,應使用pivot_table()
。它能實現類似Excel
那樣的高階資料透視功能。
# 統計員工來自不同產品線不同級別的bug總數
df.pivot_table(index=['name'], columns=['bug_level'],
values='bug_count', aggfunc=np.sum)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
bug_level | A | B | C |
---|---|---|---|
name | |||
江小魚 | 2 | 7 | 11 |
花無缺 | 4 | 6 | 6 |
當然,這裡的聚合可以非常靈活:
df.pivot_table(index=['name'], columns=['bug_level'], aggfunc={
'bug_count': np.sum, 'score': [max, np.mean]})
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }
bug_count | score | ||||||||
---|---|---|---|---|---|---|---|---|---|
sum | max | mean | |||||||
bug_level | A | B | C | A | B | C | A | B | C |
name | |||||||||
江小魚 | 2 | 7 | 11 | 70 | 80 | 90 | 70 | 78 | 88 |
花無缺 | 4 | 6 | 6 | 72 | 92 | 88 | 70 | 87 | 88 |
還可以給每列每行加個彙總,如下所示:
df.pivot_table(index=['name'], columns=['bug_level'],
values='bug_count', aggfunc=np.sum, margins=True, margins_name='彙總')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
bug_level | A | B | C | 彙總 |
---|---|---|---|---|
name | ||||
江小魚 | 2 | 7 | 11 | 20 |
花無缺 | 4 | 6 | 6 | 16 |
彙總 | 6 | 13 | 17 | 36 |
交叉表
交叉表是用於統計分組頻率的特殊透視表。簡單來說,就是將兩個或者多個列重中不重複的元素組成一個新的 DataFrame,新資料的行和列交叉的部分值為其組合在原資料中的數量。
還是來個例子比較直觀。有如下學生選專業資料:
df = pd.DataFrame({'name': ['楊過', '小龍女', '郭靖', '黃蓉', '李尋歡', '孫小紅', '張無忌',
'趙敏', '令狐沖', '任盈盈'],
'gender': ['男', '女', '男', '女', '男', '女', '男', '女', '男', '女'],
'major': ['機械工程', '軟體工程', '金融工程', '工商管理', '機械工程', '金融工程', '軟體工程', '工商管理', '軟體工程', '工商管理']})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
name | gender | major | |
---|---|---|---|
0 | 楊過 | 男 | 機械工程 |
1 | 小龍女 | 女 | 軟體工程 |
2 | 郭靖 | 男 | 金融工程 |
3 | 黃蓉 | 女 | 工商管理 |
4 | 李尋歡 | 男 | 機械工程 |
5 | 孫小紅 | 女 | 金融工程 |
6 | 張無忌 | 男 | 軟體工程 |
7 | 趙敏 | 女 | 工商管理 |
8 | 令狐沖 | 男 | 軟體工程 |
9 | 任盈盈 | 女 | 工商管理 |
若想了解學生選專業是否與性別有關,可以做如下統計:
pd.crosstab(df['gender'], df['major'])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
major | 工商管理 | 機械工程 | 軟體工程 | 金融工程 |
---|---|---|---|---|
gender | ||||
女 | 3 | 0 | 1 | 1 |
男 | 0 | 2 | 2 | 1 |
同時,回憶一下上篇講到的 https://www.cnblogs.com/bytesfly/p/pandas-1.html#畫圖
# 男、女生填報專業比例餅狀圖
pd.crosstab(df['gender'], df['major']).T.plot(
kind='pie', subplots=True, figsize=(12, 8), autopct="%.0f%%")
plt.show()

換個角度看下:
# 各專業男女生填報人數柱狀圖
pd.crosstab(df['gender'], df['major']).T.plot(
kind='bar', stacked=True, rot=0, title='各專業男女生填報人數柱狀圖', xlabel='', figsize=(10, 6))
plt.show()

再回到上面所講的交叉表相關知識。
# 對交叉結果進行歸一化
pd.crosstab(df['gender'], df['major'], normalize=True)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
major | 工商管理 | 機械工程 | 軟體工程 | 金融工程 |
---|---|---|---|---|
gender | ||||
女 | 0.3 | 0.0 | 0.1 | 0.1 |
男 | 0.0 | 0.2 | 0.2 | 0.1 |
# 對交叉結果按行進行歸一化
pd.crosstab(df['gender'], df['major'], normalize='index')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
major | 工商管理 | 機械工程 | 軟體工程 | 金融工程 |
---|---|---|---|---|
gender | ||||
女 | 0.6 | 0.0 | 0.2 | 0.2 |
男 | 0.0 | 0.4 | 0.4 | 0.2 |
# 對交叉結果按列進行歸一化
pd.crosstab(df['gender'], df['major'], normalize='columns')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
major | 工商管理 | 機械工程 | 軟體工程 | 金融工程 |
---|---|---|---|---|
gender | ||||
女 | 1.0 | 0.0 | 0.333333 | 0.5 |
男 | 0.0 | 1.0 | 0.666667 | 0.5 |
同樣,也可以給每列每行加個彙總,如下:
pd.crosstab(df['gender'], df['major'], margins=True, margins_name='彙總')
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }
major | 工商管理 | 機械工程 | 軟體工程 | 金融工程 | 彙總 |
---|---|---|---|---|---|
gender | |||||
女 | 3 | 0 | 1 | 1 | 5 |
男 | 0 | 2 | 2 | 1 | 5 |
彙總 | 3 | 2 | 3 | 2 | 10 |
.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }