基於XGBoost模型的幸福度預測——阿里天池學習賽

阿新 • • 發佈：2020-12-20

> 本文根據阿里天池學習賽《快來一起挖掘幸福感！》撰寫 ## 載入資料載入的是完整版的資料 `happiness_train_complete.csv` 。 ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns sns.set_style('whitegrid') ``` ```python # 將 id 列作為 DataFrame 的 index 並且指定 survey_time 為時間序列 data_origin = pd.read_csv('./data/happiness_train_complete.csv', index_col='id', parse_dates=['survey_time'], encoding='gbk') ``` ## 資料集基本資訊的探索下面簡單輸出前5行檢視。 ```python data_origin.head() ```

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
id
1	4	1	12	32	59	2015-08-04 14:18:00	1	1959	1	1	...	4	50	60	50	50	30.0	30	50	50	50
2	4	2	18	52	85	2015-07-21 15:04:00	1	1992	1	1	...	3	90	70	70	80	85.0	70	90	60	60
3	4	2	29	83	126	2015-07-21 13:24:00	2	1967	1	0	...	4	90	80	75	79	80.0	90	90	90	75
4	5	2	10	28	51	2015-07-25 17:33:00	2	1943	1	1	...	3	100	90	70	80	80.0	90	90	80	80
5	4	1	7	18	36	2015-08-10 09:50:00	2	1994	1	1	...	2	50	50	50	50	50.0	50	50	50	50

5 rows × 139 columns

檢視資料的詳細資訊，共8000條記錄，139個特徵。第二列為特證名、第三列為非空記錄個數、第四列為特徵的資料格式。 ```python data_origin.info(verbose=True, null_counts=True) ``` Int64Index: 8000 entries, 1 to 8000 Data columns (total 139 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 happiness 8000 non-null int64 1 survey_type 8000 non-null int64 2 province 8000 non-null int64 3 city 8000 non-null int64 4 county 8000 non-null int64 5 survey_time 8000 non-null datetime64[ns] 6 gender 8000 non-null int64 7 birth 8000 non-null int64 8 nationality 8000 non-null int64 9 religion 8000 non-null int64 10 religion_freq 8000 non-null int64 11 edu 8000 non-null int64 12 edu_other 3 non-null object 13 edu_status 6880 non-null float64 14 edu_yr 6028 non-null float64 15 income 8000 non-null int64 16 political 8000 non-null int64 17 join_party 824 non-null float64 18 floor_area 8000 non-null float64 19 property_0 8000 non-null int64 20 property_1 8000 non-null int64 21 property_2 8000 non-null int64 22 property_3 8000 non-null int64 23 property_4 8000 non-null int64 24 property_5 8000 non-null int64 25 property_6 8000 non-null int64 26 property_7 8000 non-null int64 27 property_8 8000 non-null int64 28 property_other 66 non-null object 29 height_cm 8000 non-null int64 30 weight_jin 8000 non-null int64 31 health 8000 non-null int64 32 health_problem 8000 non-null int64 33 depression 8000 non-null int64 34 hukou 8000 non-null int64 35 hukou_loc 7996 non-null float64 36 media_1 8000 non-null int64 37 media_2 8000 non-null int64 38 media_3 8000 non-null int64 39 media_4 8000 non-null int64 40 media_5 8000 non-null int64 41 media_6 8000 non-null int64 42 leisure_1 8000 non-null int64 43 leisure_2 8000 non-null int64 44 leisure_3 8000 non-null int64 45 leisure_4 8000 non-null int64 46 leisure_5 8000 non-null int64 47 leisure_6 8000 non-null int64 48 leisure_7 8000 non-null int64 49 leisure_8 8000 non-null int64 50 leisure_9 8000 non-null int64 51 leisure_10 8000 non-null int64 52 leisure_11 8000 non-null int64 53 leisure_12 8000 non-null int64 54 socialize 8000 non-null int64 55 relax 8000 non-null int64 56 learn 8000 non-null int64 57 social_neighbor 7204 non-null float64 58 social_friend 7204 non-null float64 59 socia_outing 8000 non-null int64 60 equity 8000 non-null int64 61 class 8000 non-null int64 62 class_10_before 8000 non-null int64 63 class_10_after 8000 non-null int64 64 class_14 8000 non-null int64 65 work_exper 8000 non-null int64 66 work_status 2951 non-null float64 67 work_yr 2951 non-null float64 68 work_type 2951 non-null float64 69 work_manage 2951 non-null float64 70 insur_1 8000 non-null int64 71 insur_2 8000 non-null int64 72 insur_3 8000 non-null int64 73 insur_4 8000 non-null int64 74 family_income 7999 non-null float64 75 family_m 8000 non-null int64 76 family_status 8000 non-null int64 77 house 8000 non-null int64 78 car 8000 non-null int64 79 invest_0 8000 non-null int64 80 invest_1 8000 non-null int64 81 invest_2 8000 non-null int64 82 invest_3 8000 non-null int64 83 invest_4 8000 non-null int64 84 invest_5 8000 non-null int64 85 invest_6 8000 non-null int64 86 invest_7 8000 non-null int64 87 invest_8 8000 non-null int64 88 invest_other 29 non-null object 89 son 8000 non-null int64 90 daughter 8000 non-null int64 91 minor_child 6934 non-null float64 92 marital 8000 non-null int64 93 marital_1st 7172 non-null float64 94 s_birth 6282 non-null float64 95 marital_now 6230 non-null float64 96 s_edu 6282 non-null float64 97 s_political 6282 non-null float64 98 s_hukou 6282 non-null float64 99 s_income 6282 non-null float64 100 s_work_exper 6282 non-null float64 101 s_work_status 2565 non-null float64 102 s_work_type 2565 non-null float64 103 f_birth 8000 non-null int64 104 f_edu 8000 non-null int64 105 f_political 8000 non-null int64 106 f_work_14 8000 non-null int64 107 m_birth 8000 non-null int64 108 m_edu 8000 non-null int64 109 m_political 8000 non-null int64 110 m_work_14 8000 non-null int64 111 status_peer 8000 non-null int64 112 status_3_before 8000 non-null int64 113 view 8000 non-null int64 114 inc_ability 8000 non-null int64 115 inc_exp 8000 non-null float64 116 trust_1 8000 non-null int64 117 trust_2 8000 non-null int64 118 trust_3 8000 non-null int64 119 trust_4 8000 non-null int64 120 trust_5 8000 non-null int64 121 trust_6 8000 non-null int64 122 trust_7 8000 non-null int64 123 trust_8 8000 non-null int64 124 trust_9 8000 non-null int64 125 trust_10 8000 non-null int64 126 trust_11 8000 non-null int64 127 trust_12 8000 non-null int64 128 trust_13 8000 non-null int64 129 neighbor_familiarity 8000 non-null int64 130 public_service_1 8000 non-null int64 131 public_service_2 8000 non-null int64 132 public_service_3 8000 non-null int64 133 public_service_4 8000 non-null int64 134 public_service_5 8000 non-null float64 135 public_service_6 8000 non-null int64 136 public_service_7 8000 non-null int64 137 public_service_8 8000 non-null int64 138 public_service_9 8000 non-null int64 dtypes: datetime64[ns](1), float64(25), int64(110), object(3) memory usage: 8.5+ MB 檢視資料總體統計量。 ```python data_origin.describe() ```

	happiness	survey_type	province	city	county	gender	birth	nationality	religion	religion_freq	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
count	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.00000	8000.000000	8000.00000	8000.000000	8000.000000	...	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.00000	8000.000000	8000.000000
mean	3.850125	1.405500	15.155375	42.564750	70.619000	1.53000	1964.707625	1.37350	0.772250	1.427250	...	3.722250	70.809500	68.170000	62.737625	66.320125	62.794187	67.064000	66.09625	65.626750	67.153750
std	0.938228	0.491019	8.917100	27.187404	38.747503	0.49913	16.842865	1.52882	1.071459	1.408441	...	1.143358	21.184742	20.549943	24.771319	22.049437	23.463162	21.586817	23.08568	23.827493	22.502203
min	-8.000000	1.000000	1.000000	1.000000	1.000000	1.00000	1921.000000	-8.00000	-8.000000	-8.000000	...	-8.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.00000	-3.000000	-3.000000
25%	4.000000	1.000000	7.000000	18.000000	37.000000	1.00000	1952.000000	1.00000	1.000000	1.000000	...	3.000000	60.000000	60.000000	50.000000	60.000000	55.000000	60.000000	60.00000	60.000000	60.000000
50%	4.000000	1.000000	15.000000	42.000000	73.000000	2.00000	1965.000000	1.00000	1.000000	1.000000	...	4.000000	79.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.00000	70.000000	70.000000
75%	4.000000	2.000000	22.000000	65.000000	104.000000	2.00000	1977.000000	1.00000	1.000000	1.000000	...	5.000000	80.000000	80.000000	80.000000	80.000000	80.000000	80.000000	80.00000	80.000000	80.000000
max	5.000000	2.000000	31.000000	89.000000	134.000000	2.00000	1997.000000	8.00000	1.000000	9.000000	...	5.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.00000	100.000000	100.000000

8 rows × 135 columns

## 資料預處理 ### 缺失值處理檢視子特徵的缺失情況，其中 - `required_list` 表示特徵中的必填項 - `continuous_list` 表示特徵屬性為連續型變數 - `categorical_list` 表示分型別變數其餘特徵均為等級（ordinal）型的分類變數。 ```python required_list = ['survey_type', 'province', 'city', 'county', 'survey_time', 'gender', 'birth', 'nationality', 'religion', 'religion_freq', 'edu', 'income', 'political', 'floor_area', 'height_cm', 'weight_jin', 'health', 'health_problem', 'depression', 'hukou', 'socialize', 'relax', 'learn', 'equity', 'class', 'work_exper', 'work_status', 'work_yr', 'work_type', 'work_manage', 'family_income', 'family_m', 'family_status', 'house', 'car', 'marital', 'status_peer', 'status_3_before', 'view', 'inc_ability'] continuous_list = ['birth', 'edu_yr', 'income', 'floor_area', 'height_cm', 'weight_jin', 'work_yr', 'family_income', 'family_m', 'house', 'son', 'daughter', 'minor_child', 'marital_1st', 's_birth', 'marital_now', 's_income', 'f_birth', 'm_birth', 'inc_exp', 'public_service_1', 'public_service_2', 'public_service_3', 'public_service_4', 'public_service_5', 'public_service_6', 'public_service_7', 'public_service_8', 'public_service_9'] categorical_list = ['survey_type', 'province', 'gender', 'nationality'] ``` #### 必填項的缺失值分析檢視必填項中缺失值的情況。 ```python data_origin[required_list].isna().sum()[data_origin[required_list].isna().sum() > 0].to_frame().T ```

	work_status	work_yr	work_type	work_manage	family_income
0	5049	5049	5049	5049	1

其中 - `work_status` 表示目前工作的狀況 - `work_yr` 表示一共工作了多少年 - `work_type` 表示目前工作的性質 - `work_manage` 表示目前工作的管理活動情況 - `family_income` 表示去年全年家庭總收入首先分析 `work_` 開頭的四項特徵的缺失情況，它們的缺失計數一樣，可能說明調查問卷的填寫方式，可能被跳過了。首先檢查調查問卷，找到對應的問卷問題，發現在 `work_exper` 特徵中，即工作經歷及狀況，根據不同的工作經歷，將上面四個問題跳過。檢視 `work_exper` 對應的問卷。 ![圖片](https://tva1.sinaimg.cn/large/006VTcCxly1glpgvo45lvj30yy096q4h.jpg) 可以發現 `work_exper` 除了 `1` 分類，其它問題均被跳問；所以將上面四列的缺失記錄的 `work_exper` 輸出，檢視是否都為非 `1` 類的記錄。通過下面的輸出可以看到，在上面四項特徵為缺失值的情況下，其記錄對應的 `work_exper` 的取值大部分不為 `1` 。 ```python data_origin.loc[data_origin[required_list].isna().sum(axis=1)[data_origin[required_list].isna().sum(axis=1) > 0].index, 'work_exper'].to_frame().plot.hist() pd.value_counts(data_origin.loc[data_origin[required_list].isna().sum(axis=1)[data_origin[required_list].isna().sum(axis=1) > 0].index, 'work_exper']) ``` 5 1968 3 1242 4 1065 2 387 6 380 1 7 Name: work_exper, dtype: int64 ![output_15_1](https://tva4.sinaimg.cn/large/006VTcCxly1gluiojypjsj30av06v0so.jpg) 進一步檢視取值為 `1` 的記錄。 ```python (data_origin[data_origin[required_list].isna().sum(axis=1) > 0])[(data_origin[data_origin[required_list].isna().sum(axis=1) > 0].work_exper == 1)] ```

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
id
692	4	2	21	64	101	2015-07-20 11:12:00	2	1975	1	1	...	5	80	70	80	80	80.0	80	80	80	80
841	4	2	31	88	133	2015-08-17 13:49:00	2	1971	1	0	...	4	50	30	-2	-2	-2.0	50	50	50	70
1411	4	2	2	2	9	2015-07-23 09:25:00	1	1967	8	1	...	4	90	85	80	90	90.0	92	93	94	90
3117	4	1	4	7	18	2015-10-03 16:02:00	1	1980	1	1	...	2	30	35	30	40	60.0	40	30	70	70
4783	5	2	22	65	103	2015-07-08 18:45:00	1	1955	1	1	...	5	90	90	90	90	80.0	90	80	90	90
5589	5	2	16	46	78	2015-07-29 11:34:00	2	1964	1	1	...	3	89	63	67	75	74.0	67	65	78	79
7368	4	2	21	64	101	2015-07-19 08:32:00	2	1963	1	1	...	5	70	70	70	60	70.0	70	60	60	60

7 rows × 139 columns

可以發現 `work_exper` 為 `1` 的記錄存在7條，故將此刪除。 ```python data_origin.drop((data_origin[data_origin[required_list].isna().sum(axis=1) > 0])[(data_origin[data_origin[required_list].isna().sum(axis=1) > 0].work_exper == 1)].index, inplace=True) ``` 因為 `family_income` 缺失個數只有1條，不影響資料規模，所以直接將其刪除。 ```python data_origin.drop(data_origin['family_income'].isna()[data_origin['family_income'].isna()].index, inplace=True) ``` #### 連續型特徵缺失值分析檢視連續型特徵的卻失情況。 ```python data_origin[continuous_list].isna().sum()[data_origin[continuous_list].isna().sum() > 0].to_frame().T ```

	edu_yr	work_yr	minor_child	marital_1st	s_birth	marital_now	s_income
0	1970	5041	1066	828	1718	1770	1718

其中 - `edu_yr` 表示已經完成的最高學歷是哪一年獲得的 - `work_yr` 表示第一份非農工作到目前的工作一共工作了多少年 - `minor_child` 表示有幾個18週歲以下未成年子女 - `marital_1st` 表示第一次結婚的時間 - `s_birth` 表示目前的配偶或同居伴侶是哪一年出生的 - `martital_now` 表示與目前的配偶是哪一年結婚的 - `s_income` 表示配偶或同居伴侶去年全年的總收入對於 `edu_yr` 即已經完成的最高學歷是哪一年獲得的，檢視缺失記錄的 `edu_status` 取值分佈情況。 ```python data_origin[data_origin['edu_yr'].isna()]['edu_status'].plot.hist() pd.value_counts(data_origin[data_origin['edu_yr'].isna()]['edu_status']) ``` 2.0 746 3.0 103 4.0 1 1.0 1 Name: edu_status, dtype: int64 ![output_26_1](https://tvax4.sinaimg.cn/large/006VTcCxly1gluiot98ydj30ap06vdfq.jpg) 檢視 `edu_yr` 缺失的記錄的 `edu_status` 特徵後，只有選項 `4` 即畢業的記錄才應該填寫 `edu_yr` 的畢業年份，所以應該刪除記錄。 ```python data_origin.drop(data_origin[(data_origin['edu_status'] == 4) & (data_origin['edu_yr'].isna())].index, inplace=True) ``` ```python data_origin.shape ``` (7991, 139) 對於 `minor_child` 特徵，可以檢查這個特徵缺失的記錄另外兩項特徵 `son` 和 `daughter` 分別表示兒子、女兒的數量，如果為0，則將 `minor_child` 也填充為0。 ```python print(data_origin[np.array(data_origin['minor_child'].isna())].loc[:, 'son'].sum()) print(data_origin[np.array(data_origin['minor_child'].isna())].loc[:, 'daughter'].sum()) data_origin[np.array(data_origin['minor_child'].isna())].loc[:, 'son':'daughter'] ``` 0 0

	son	daughter
id
2	0	0
5	0	0
9	0	0
29	0	0
31	0	0
...	...	...
7967	0	0
7972	0	0
7991	0	0
7999	0	0
8000	0	0

1066 rows × 2 columns

可以看對 `minor_child` 缺失的記錄，其兒子和女兒的個數也為0，所以將 `minor_child` 缺失值填充為0。 ```python data_origin['minor_child'].fillna(0, inplace=True) ``` 對於 `marital_1st` 的記錄的缺失情況，可以檢視對應的記錄的 `marital` 的取值是否為 `1` 表示未婚。 ```python print(data_origin[np.array(data_origin['marital_1st'].isna())]['marital'].sum() == data_origin[np.array(data_origin['marital_1st'].isna())]['marital'].shape[0]) data_origin[np.array(data_origin['marital_1st'].isna())]['marital'].plot.hist() pd.value_counts(data_origin[np.array(data_origin['marital_1st'].isna())]['marital']) ``` True 1 828 Name: marital, dtype: int64 ![output_35_2](https://tva1.sinaimg.cn/large/006VTcCxly1gluiozohf6j30ap06vt8m.jpg) 可以看到輸出結果表明對於 `marital_1st` 缺失的記錄都是未婚人士，所以缺失值正常。下面檢視 `s_birth` 即目前的配偶或同居伴侶是哪一年出生的的缺失情況，首先檢視缺失的記錄的 `marital` 狀態，檢視是否滿足無配偶或同居伴侶的情況。 ```python data_origin[data_origin['s_birth'].isna()]['marital'].plot.hist() pd.value_counts(data_origin[data_origin['s_birth'].isna()]['marital']) ``` 1 828 7 718 6 171 2 1 Name: marital, dtype: int64 ![output_37_1](https://tvax4.sinaimg.cn/large/006VTcCxly1gluip4t3mtj30ap06vwee.jpg) 根據輸出可以看到，`marital` 取值為 `1` 、`6`、`7` 分別表示未婚、離婚和喪偶，所以 `s_birth` 缺失屬於正常；而且取值為 `2` 表示同居的缺失記錄只有一條，所以直接將其刪除即可。 ```python data_origin.drop(data_origin[data_origin['s_birth'].isna()]['marital'][data_origin[data_origin['s_birth'].isna()]['marital'] == 2].index, inplace=True) ``` 對於 `marital_now` 即與目前的配偶是哪一年結婚的，首先輸出 `marital` 檢視婚姻的狀態，是否滿足沒結婚的條件。 ```python data_origin[data_origin['marital_now'].isna()]['marital'].plot.hist() pd.value_counts(data_origin[data_origin['marital_now'].isna()]['marital']) ``` 1 828 7 718 6 171 2 51 3 1 Name: marital, dtype: int64 ![output_41_1](https://tva4.sinaimg.cn/large/006VTcCxly1gluip988tnj30ap06vwee.jpg) 根據輸出可以得到 `1` 和 `2` 表示沒有結婚的情況，所以缺失屬於正常；對於 `3`、`6`、`7` 分別表示初婚有配偶、離婚、喪偶；只有 `3` 屬於目前有配偶並結婚的情況，所以應該刪除。 ```python data_origin.drop(data_origin[data_origin['marital_now'].isna()].loc[data_origin[data_origin['marital_now'].isna()]['marital'] == 3].index, inplace=True) ``` ```python data_origin.shape ``` (7989, 139) 對於 `s_income` 即配偶或同居伴侶去年全年的總收入的缺失情況，可以檢查對於 `marital` 檢視其是否滿足無配偶或伴侶的條件。 ```python data_origin[data_origin['s_income'].isna()]['marital'].plot.hist() pd.value_counts(data_origin[data_origin['s_income'].isna()]['marital']) ``` 1 828 7 718 6 171 Name: marital, dtype: int64 ![output_46_1](https://tvax4.sinaimg.cn/large/006VTcCxly1gluipdoqzpj30ap06vwee.jpg) 可以看到對於 `s_income` 的缺失值，其記錄對應的婚姻狀態都為未婚、離婚或喪偶，所以 `s_income` 缺失是正常的。 #### 分類變數缺失值分析檢視分型別（categorical）變數的缺失情況，全部為0，則沒有缺失值。 ```python data_origin[categorical_list].isna().sum().to_frame().T ```

	survey_type	province	gender	nationality
0	0	0	0	0

#### 所有特徵缺失值分析檢視所有特徵的缺失情況。 ```python data_origin.isna().sum()[data_origin.isna().sum() > 0].to_frame().T ```

	edu_other	edu_status	edu_yr	join_party	property_other	hukou_loc	social_neighbor	social_friend	work_status	work_yr	...	marital_1st	s_birth	marital_now	s_edu	s_political	s_hukou	s_income	s_work_exper	s_work_status	s_work_type
0	7986	1119	1969	7167	7923	4	795	795	5038	5038	...	828	1717	1768	1717	1717	1717	1717	1717	5427	5427

1 rows × 23 columns

首先對於 `edu_other` 特徵，只有在 `edu` 填寫了 `14` 的情況下才填寫，首先檢查 `edu_other` 缺失的記錄的 `edu` 是否為 `14` 若為 `14` 則說明 `edu_other` 不應該為缺失，應該將其刪除。 ```python data_origin[data_origin['edu_other'].isna()][data_origin[data_origin['edu_other'].isna()]['edu'] == 14] ```

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
id
1242	4	2	3	6	13	2015-09-24 17:58:00	1	1971	1	1	...	5	100	90	60	80	70.0	80	70	60	50
3651	3	2	3	6	13	2015-09-24 20:25:00	1	1953	1	1	...	5	100	100	60	50	70.0	50	30	70	40
5330	2	2	3	6	13	2015-09-25 07:57:00	1	1953	1	1	...	5	100	100	100	100	100.0	100	30	100	50

3 rows × 139 columns

可以看到 `edu` 為 `14` 的記錄中，有3條記錄 `edu_other` 也為缺失；所以將3條記錄刪除。 ```python data_origin.drop(data_origin[data_origin['edu_other'].isna()][data_origin[data_origin['edu_other'].isna()]['edu'] == 14].index, inplace=True) ``` 對於 `edu_status` 的缺失記錄，可以先檢查記錄對應的 `edu` 是取的何值。 ```python data_origin[data_origin['edu_status'].isna()]['edu'].plot.hist() pd.value_counts(data_origin[data_origin['edu_status'].isna()]['edu']) ``` 1 1052 2 65 3 2 Name: edu, dtype: int64 ![output_57_1](https://tvax1.sinaimg.cn/large/006VTcCxly1gluipl0cmij30av06v0sm.jpg) 可以看到對於 `edu_status` 缺失的記錄，其對應的 `edu` 教育程度為別為沒有受過任何教育、私塾、掃盲班和小學；對於取值為 `1` 和 `2` 的情況，屬於跳問選項，對應的 `edu_status` 屬於缺失是正常的；所以將 `edu` 取值為 `3` 的記錄刪除。 ```python data_origin.drop(data_origin[data_origin['edu_status'].isna()][data_origin[data_origin['edu_status'].isna()]['edu'] == 3].index, inplace=True) ``` 對於 `join_party` 即目前政治面貌是黨員的入黨時間，只有政治面貌不是黨員的缺失值才算正確，檢視分佈情況。 ```python data_origin[data_origin['join_party'].isna()]['political'].plot.hist() pd.value_counts(data_origin[data_origin['join_party'].isna()]['political']) ``` 1 6703 2 402 -8 41 3 11 4 5 Name: political, dtype: int64 ![output_61_1](https://tvax1.sinaimg.cn/large/006VTcCxly1gluipqiffij30av06yjra.jpg) 根據直方圖看到，有5條記錄的 `partical` 的取值是 `4` 而入黨時間沒有填寫，所以將這5條記錄刪除。 ```python data_origin.drop(data_origin[data_origin['join_party'].isna()][data_origin[data_origin['join_party'].isna()]['political'] == 4].index, inplace=True) ``` 對於 `hukou_loc` 即目前的戶口登記地，檢視缺失記錄的 `hukou` 登記情況，發現取值都為 `7` 即沒有戶口，所以缺失屬於正常。 ```python data_origin[data_origin['hukou_loc'].isna()]['hukou'].to_frame() ```

	hukou
id
589	7
3657	7
3799	7
7811	7

對於 `social_neighbor` 和 `social_friend` 即與與其他朋友進行社交娛樂活動的頻繁程度和有多少個晚上是因為出去度假或者探訪親友而沒有在家過夜，首先檢視缺失記錄的 `socialize` 的分佈情況。 ```python data_origin[data_origin['social_neighbor'].isna()]['socialize'].plot.hist() pd.value_counts(data_origin[data_origin['social_neighbor'].isna()]['socialize']) ``` 1 793 Name: socialize, dtype: int64 ![output_67_1](https://tva2.sinaimg.cn/large/006VTcCxly1gluipvuibkj30ap06vt8m.jpg) 可以發現所有的 `social_neighbor` 和 `social_friend` 缺失記錄的 `socialize` 即是否經常在空閒時間做社交的事情全部均為 `1` 即從不社交，所以兩個特徵的缺失值可以使用 `1` 填充。 ```python data_origin['social_neighbor'].fillna(1, inplace=True) data_origin['social_friend'].fillna(1, inplace=True) ``` 對於 `s_edu` 到 `s_work_exper` 的特徵，缺失值的記錄數都一樣，所以存在可能這幾項特徵的缺失記錄都來自同一批問卷物件。首先檢視 `s_edu` 的缺失記錄的 `marital` 的分佈情況。 ```python data_origin[data_origin['s_edu'].isna()]['marital'].plot.hist() pd.value_counts(data_origin[data_origin['s_edu'].isna()]['marital']) ``` 1 827 7 717 6 171 Name: marital, dtype: int64 ![output_71_1](https://tva2.sinaimg.cn/large/006VTcCxly1gluiq1nsc8j30ap06vwee.jpg) 可以發現 `s_edu` 缺失的記錄的婚姻情況全部均為未婚、離婚或喪偶，均屬於沒有配偶或同居伴侶的情況，所以屬於正常的缺失。對於 `s_political` 到 `s_work_exper` 全部均屬於上述情況。對於 `s_work_status` 即配偶或同居伴侶目前的工作狀況，首先檢視調查問卷。 ![圖片](https://tvax3.sinaimg.cn/large/006VTcCxly1glq05rnq47j30xy09kq4n.jpg) 可以得知只有 `s_work_exper` 填寫了 `1` 的情況下才應該填寫 `s_work_status` 和 `s_work_type` 其它選項均需要跳過，所以屬於正常缺失值。下面檢視 `s_work_status` 缺失記錄的 `s_work_exper` 的分佈情況。 ```python data_origin[data_origin['s_work_status'].isna()]['s_work_exper'].plot.hist() pd.value_counts(data_origin[data_origin['s_work_status'].isna()]['s_work_exper']) ``` 5.0 1424 3.0 1017 4.0 823 6.0 221 2.0 217 1.0 1 Name: s_work_exper, dtype: int64 ![output_73_1](https://tvax2.sinaimg.cn/large/006VTcCxly1gluiq62r3vj30av06vgli.jpg) 檢視得知 `s_work_exper` 選 `1` 的記錄只有1條，直接刪除即可。 ```python data_origin.drop(data_origin[data_origin['s_work_status'].isna()][data_origin[data_origin['s_work_status'].isna()]['s_work_exper'] == 1].index, inplace=True) ``` 在調查問卷中，每個選項通用含義，其 `-1` 表示不適用；`-2` 表示不知道；`-3` 表示拒絕回答；`-8` 表示無法回答。在這裡將所有的特徵的負數使用每一個特徵的中位數進行填充。 ```python data_origin.shape ``` (7978, 139) ```python no_ne_rows_index = (data_origin.drop(['survey_time', 'edu_other', 'property_other', 'invest_other'], axis=1) < 0).sum(axis=1)[(data_origin.drop(['survey_time', 'edu_other', 'property_other', 'invest_other'], axis=1) < 0).sum(axis=1) == 0].index ``` ```python for column, content in data_origin.items(): if pd.api.types.is_numeric_dtype(content): data_origin[column] = data_origin[column].apply(lambda x : pd.Series(data_origin.loc[no_ne_rows_index, :][column].unique()).median() if(x < 0 and x != np.nan) else x) ``` 將所有的負數填充完成後，再將 `NaN` 數值全部使用統一的一個值 `-1` 填充。 ```python data_origin.fillna(-1, inplace=True) ``` 至此，所有特徵的缺失值已經全部處理完畢。 ### 文字資料處理在所有的特徵中，有3個特徵分別是 `edu_other`、`property_other`、`invest_other` 是字串資料，需要將其轉換成序號編碼（Ordinal Encoding）。首先檢視 `edu_other` 的填寫情況。 ```python data_origin[data_origin['edu_other'] != -1]['edu_other'].to_frame() ```

	edu_other
id
1170	夜校
2513	夜校
4926	夜校

可以看到 `edu_other` 的填寫情況全都是夜校，將字串轉換成序號編碼。 ```python data_origin['edu_other'] = data_origin['edu_other'].astype('category').values.codes + 1 ``` 檢視 `property_other` 即房子產權歸屬誰，首先檢查調查問卷的填寫情況。 ```python data_origin[data_origin['property_other'] != -1]['property_other'].to_frame() ```

	property_other
id
76	無產權
92	已購買，但未過戶
99	家庭共同所有
132	待辦
455	沒有產權
...	...
7376	家人共有
7746	全家人共有
7776	兄弟共有
7821	未分家，全家所有
7917	家人共有

66 rows × 1 columns

根據填寫情況來看，其中有很多填寫資訊都是一個意思，例如 `家庭共同所有` 和 `全家所有` 是同一個意思，但是在python處理中只能一個個的手動處理。 ```python #data_origin.loc[[8009, 9212, 9759, 10517], 'property_other'] = '多人擁有' #data_origin.loc[[8014, 8056, 10264], 'property_other'] = '未過戶' #data_origin.loc[[8471, 8825, 9597, 9810, 9842, 9967, 10069, 10166, 10203, 10469], 'property_other'] = '全家擁有' #data_origin.loc[[8553, 8596, 9605, 10421, 10814], 'property_other'] = '無產權' ``` ```python data_origin.loc[[76, 132, 455, 495, 1415, 2511, 2792, 2956, 3647, 4147, 4193, 4589, 5023, 5382, 5492, 6102, 6272, 6339, 6507, 7184, 7239], 'property_other'] = '無產權' data_origin.loc[[92, 1888, 2703, 3381, 5654], 'property_other'] = '未過戶' data_origin.loc[[99, 619, 2728, 3062, 3222, 3251, 3696, 5283, 6191, 7295, 7376, 7746, 7821, 7917], 'property_other'] = '全家擁有' data_origin.loc[[1597, 4993, 5398, 5899, 7240, 7776], 'property_other'] = '多人擁有' data_origin.loc[[6469, 6891], 'property_other'] = '小產權' ``` 將字串編碼為整數型的序號（ordinal）型別。 ```python data_origin['property_other'] = data_origin['property_other'].astype('category').values.codes + 1 ``` 檢視 `invest_other` 即從事的投資活動的填寫情況。 ```python pd.DataFrame(data_origin[data_origin['invest_other'] != -1]['invest_other'].unique()) ```

	0
0	理財產品
1	民間借貸
2	銀行理財
3	儲蓄存款
4	理財
5	銀行存款利息
6	活期儲蓄
7	投資服務業、傢俱業
8	銀行存款
9	個人融資
10	租房
11	老人家不清楚
12	家中有部分土地承包出去
13	沒有
14	高利貸
15	彩票
16	自己沒有，兒女不清楚
17	網上理財
18	統籌
19	福利車票
20	其他理財產品
21	商業萬能保險
22	投資開發區
23	字畫、茶壺

同樣地，將其轉換成整數型別的序號（ordinal）編碼。 ```python data_origin['invest_other'] = data_origin['invest_other'].astype('category').values.codes + 1 ``` ### 離群值處理 ```python data_nona = data_origin.copy() ``` 畫出箱型圖分析特徵的異常值。並刪除離群記錄。 ```python sns.boxplot(x=data_nona['house']) ``` ![output_100_1](https://tvax1.sinaimg.cn/large/006VTcCxly1gluiqawemkj309p078dfn.jpg) ```python data_nona.drop(data_nona[data_nona['house'] > 25].index, inplace=True) ``` ```python sns.boxplot(x=data_nona['family_m']) ``` ![output_102_1](https://tva4.sinaimg.cn/large/006VTcCxly1gluiqflfmrj309p078jr7.jpg) ```python data_nona.drop(data_nona[data_nona['family_m'] > 40].index, inplace=True) ``` ```python sns.boxplot(x=data_nona['inc_exp']) ``` ![output_104_1](https://tvax1.sinaimg.cn/large/006VTcCxly1gluiqk82lqj309p0780sk.jpg) ```python data_nona.drop(data_nona[data_nona['inc_exp'] > 0.6e8].index, inplace=True) ``` 檢視調查時間的月份分佈情況，因為調查問卷都是在2015年填寫，只需要檢視月份的離群點。 ![圖片](https://tvax1.sinaimg.cn/large/006VTcCxly1glqnzxouv4j30kh0tkgmv.jpg) 由圖可知調查問卷是從6月開始的，記錄中2月的問卷屬於異常資料，應該刪除。 ```python sns.boxplot(x=data_nona['survey_time'].dt.month) ``` ![output_107_1](https://tvax3.sinaimg.cn/large/006VTcCxly1gluiqoqodgj309p078web.jpg) ```python data_nona.drop(data_nona[data_nona['survey_time'].dt.month < 6].index, inplace=True) ``` ## 特徵構造特徵構造也可稱為特徵交叉、特徵組合、資料變換。 ### 連續變數離散化離散化除了一些計算方面等等好處，還可以引入非線性特性，也可以很方便的做cross-feature。離散特徵的增加和減少都很容易，易於模型的快速迭代。此外，噪聲很大的環境中，離散化可以降低特徵中包含的噪聲，提升特徵的表達能力。 ```python pd.DataFrame(continuous_list) ```

	0
0	birth
1	edu_yr
2	income
3	floor_area
4	height_cm
5	weight_jin
6	work_yr
7	family_income
8	family_m
9	house
10	son
11	daughter
12	minor_child
13	marital_1st
14	s_birth
15	marital_now
16	s_income
17	f_birth
18	m_birth
19	inc_exp
20	public_service_1
21	public_service_2
22	public_service_3
23	public_service_4
24	public_service_5
25	public_service_6
26	public_service_7
27	public_service_8
28	public_service_9

將連續型變數全部進行分箱，然後對每個區間進行編碼，生成新的離散的特徵。 ```python for column in continuous_list: cut = pd.qcut(data_nona[column], q=5, duplicates='drop') cat = cut.values codes = cat.codes data_nona[column + '_discrete'] = codes ``` ```python for column, content in data_nona.items(): if pd.api.types.is_numeric_dtype(content): data_nona[column] = content.astype('int') ``` ## 特徵選擇將連續變數離散化後，生成以後綴 `_discrete` 的新特徵，所以將原來的連續變數的特徵刪除掉。 ```python data_nona.to_csv('./data/happiness_train_complete_analysis.csv') ``` ```python data_nona.drop(continuous_list, axis=1, inplace=True) ``` ```python data_nona.to_csv('./data/happiness_train_complete_nona.csv') ``` ## 特徵分析 ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns ``` ```python data = pd.read_csv('./data/happiness_train_complete_analysis.csv', index_col='id', parse_dates=['survey_time']) ``` ```python data.head() ```

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	inc_exp_discrete	public_service_1_discrete	public_service_2_discrete	public_service_3_discrete	public_service_4_discrete	public_service_5_discrete	public_service_6_discrete	public_service_7_discrete	public_service_8_discrete	public_service_9_discrete
id
1	4	1	12	32	59	2015-08-04 14:18:00	1	1959	1	1	...	2	0	0	0	0	0	0	0	0	0
2	4	2	18	52	85	2015-07-21 15:04:00	1	1992	1	1	...	2	4	1	2	3	4	1	4	0	0
3	4	2	29	83	126	2015-07-21 13:24:00	2	1967	1	0	...	3	4	2	3	3	3	4	4	4	2
4	5	2	10	28	51	2015-07-25 17:33:00	2	1943	1	1	...	0	4	3	2	3	3	4	4	3	2
5	4	1	7	18	36	2015-08-10 09:50:00	2	1994	1	1	...	4	0	0	0	0	0	0	0	0	0

5 rows × 168 columns

```python data.describe() ```

	happiness	survey_type	province	city	county	gender	birth	nationality	religion	religion_freq	...	inc_exp_discrete	public_service_1_discrete	public_service_2_discrete	public_service_3_discrete	public_service_4_discrete	public_service_5_discrete	public_service_6_discrete	public_service_7_discrete	public_service_8_discrete	public_service_9_discrete
count	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	...	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000
mean	3.866466	1.405120	15.158258	42.572164	70.631903	1.530748	1964.710216	1.399724	0.880271	1.452560	...	1.725653	1.665537	1.272214	1.841365	1.613328	1.848519	1.643449	1.651732	1.654869	1.302962
std	0.818844	0.490946	8.915876	27.183764	38.736751	0.499085	16.845155	1.466409	0.324665	1.358444	...	1.338535	1.420309	1.108440	1.342524	1.499494	1.297290	1.533445	1.544477	1.511468	1.078601
min	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1921.000000	1.000000	0.000000	1.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	4.000000	1.000000	7.000000	18.000000	37.000000	1.000000	1952.000000	1.000000	1.000000	1.000000	...	1.000000	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
50%	4.000000	1.000000	15.000000	42.000000	73.000000	2.000000	1965.000000	1.000000	1.000000	1.000000	...	1.000000	2.000000	1.000000	2.000000	1.000000	2.000000	1.000000	1.000000	1.000000	1.000000
75%	4.000000	2.000000	22.000000	65.000000	104.000000	2.000000	1977.000000	1.000000	1.000000	1.000000	...	3.000000	2.000000	2.000000	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000	2.000000
max	5.000000	2.000000	31.000000	89.000000	134.000000	2.000000	1997.000000	8.000000	1.000000	9.000000	...	4.000000	4.000000	3.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	3.000000

8 rows × 167 columns

```python data.info(verbose=True, null_counts=True) ``` Int64Index: 7968 entries, 1 to 8000 Data columns (total 168 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 happiness 7968 non-null int64 1 survey_type 7968 non-null int64 2 province 7968 non-null int64 3 city 7968 non-null int64 4 county 7968 non-null int64 5 survey_time 7968 non-null datetime64[ns] 6 gender 7968 non-null int64 7 birth

基於XGBoost模型的幸福度預測——阿里天池學習賽

基於XGBoost模型的幸福度預測——阿里天池學習賽

阿里天池競賽 A股上市公司營收預測使用LSTM模型做時序預測

基於AIRMA模型對訂單總額未來七天的預測

第十二次作業——基於波士頓資料集的迴歸模型與房價預測0.0

揭祕阿里小蜜：基於檢索模型和生成模型相結合的聊天引擎

R語言基於支援向量機訓練模型實現類預測

基於時間序列的使用者預測模型

基於Spark和Tensorflow構建DCN模型進行CTR預測

阿里天池大資料競賽——口碑商家客流量預測 A

Matlab之DNN：基於Matlab利用神經網路模型(epochs=10000000)預測勒布朗詹姆斯的2018年總決賽(騎士VS勇士)第一場得分、籃板、助攻

【NLP】Python實例：基於文本相似度對申報項目進行查重設計

基於Qt的OpenGL可編程管線學習（4）- 使用Subroutine繪制不同光照的模型

Orleans框架------基於Actor模型生成分布式Id

基於EPOLL模型的局域網聊天室和Echo服務器

真實感海洋的繪制（一）：基於統計學模型的水面模擬方法詳解

深度學習高手該怎樣煉成？這位拿下阿里天池大賽冠軍的中科院博士為你規劃了一份專業成長路徑

用xgboost模型對特徵重要性進行排序

BAT大揭祕：在騰訊、百度、阿里上班，差別竟然這麼大？

你需要從A地去B地但你不知道能不能到這時該怎麼辦 Google 谷歌百度 baidu 阿里巴巴 aliba

一輛學校班車裡面能裝多少個高爾夫球 Google 谷歌百度 baidu 阿里巴巴 alibaba 微軟華為 hu

基於XGBoost模型的幸福度預測——阿里天池學習賽

相關推薦