pandas 處理缺失值[dropna、drop、fillna]
阿新 • • 發佈:2018-12-11
面對缺失值三種處理方法:
- option 1: 去掉含有缺失值的樣本(行)
- option 2:將含有缺失值的列(特徵向量)去掉
- option 3:將缺失值用某些值填充(0,平均值,中值等)
對於dropna和fillna,dataframe和series都有,在這主要講datafame的
對於option1:
使用DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
引數說明:
- axis:
- axis=0: 刪除包含缺失值的行
- axis=1: 刪除包含缺失值的列
- how: 與axis配合使用
- how=‘any’ :只要有缺失值出現,就刪除該行貨列
- how=‘all’: 所有的值都缺失,才刪除行或列
- thresh: axis中至少有thresh個非缺失值,否則刪除
比如 axis=0,thresh=10:標識如果該行中非缺失值的數量小於10,將刪除改行 - subset: list
在哪些列中檢視是否有缺失值 - inplace: 是否在原資料上操作。如果為真,返回None否則返回新的copy,去掉了缺失值
建議在使用時將全部的預設引數都寫上,便於快速理解
examples:
df = pd.DataFrame(
{"name": ['Alfred' , 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25")
pd.NaT]})
>>> df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
# Drop the rows where at least one element is missing.
>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
# Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
# Drop the rows where all elements are missing.
>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
# Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
# Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'born'])
name toy born
1 Batman Batmobile 1940-04-25
# Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True)
>>> df
name toy born
1 Batman Batmobile 1940-04-25
對於option 2:
可以使用dropna 或者drop函式
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
- labels: 要刪除行或列的列表
- axis: 0 行 ;1 列
df = pd.DataFrame(np.arange(12).reshape(3,4),
columns=['A', 'B', 'C', 'D'])
>>>df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
# 刪除列
>>> df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
# 刪除行(索引)
>>> df.drop([0, 1])
A B C D
2 8 9 10 11
對於option3
使用DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
- value: scalar, dict, Series, or DataFrame
dict 可以指定每一行或列用什麼值填充 - method: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
在列上操作- ffill / pad: 使用前一個值來填充缺失值
- backfill / bfill :使用後一個值來填充缺失值
- limit 填充的缺失值個數限制。應該不怎麼用
f = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
>>> df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
# 使用0代替所有的缺失值
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
# 使用後邊或前邊的值填充缺失值
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
>>>df.fillna(method='bfill')
A B C D
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN 3.0 NaN 5
3 NaN 3.0 NaN 4
# Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
# 每一列使用不同的缺失值
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>>> df.fillna(value=values)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
#只替換第一個缺失值
>>>df.fillna(value=values, limit=1)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 NaN 1
2 NaN 1.0 NaN 5
3 NaN 3.0 NaN 4
房價分析:
在此問題中,只有bedroom一列有缺失值,按照此三種方法處理程式碼為:
# option 1 將含有缺失值的行去掉
housing.dropna(subset=["total_bedrooms"])
# option 2 將"total_bedrooms"這一列從資料中去掉
housing.drop("total_bedrooms", axis=1)
# option 3 使用"total_bedrooms"的中值填充缺失值
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median)