1. 程式人生 > >kaggle 入門題 Tatanic | top_5%解法及資料分析常用技巧詳解

kaggle 入門題 Tatanic | top_5%解法及資料分析常用技巧詳解

Titanic:

身為女性且來自頭等艙的ROSE存活率高達95%,身為男性且來自二等艙的Jack存活率不足10%。悲劇似乎在他們上船的那一刻就已經註定了。

姓名:李子羽
學號:2015300009
日期:2018/10/07

1.任務簡介

泰坦尼克號的沉沒是歷史上最著名的沉船之一。1912年4月15日,在她的處女航中,泰坦尼克號在與冰山相撞後沉沒,在2224名乘客和機組人員中造成1502人死亡。這場聳人聽聞的悲劇震驚了國際社會,並促使各國制定了更好的船舶安全規定。

造成海難失事的原因之一是乘客和機組人員沒有足夠的救生艇。儘管倖存有一些運氣因素,但有些人比其他人更容易生存,比如女人,孩子和上流社會。

在這個挑戰中,我們要求您完成對哪些人可能存活進行分析。特別地,我們要求您運用機器學習工具來預測哪些乘客能夠倖免於悲劇。

2.資料分析

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# sns.set_style('white')

from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, learning_curve,
ShuffleSplit, GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC, LinearSVC from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier,
ExtraTreesClassifier, VotingClassifier from sklearn.metrics import f1_score, accuracy_score, classification_report

定義函式 load_data

def load_data():
    home_path = "data"
    train_path = os.path.join(home_path, "train.csv")
    test_path = os.path.join(home_path, "test.csv")
    submit_path = os.path.join(home_path, "gender_submission.csv")
    train_data = pd.read_csv(train_path)
    test_data = pd.read_csv(test_path)
    submit_data = pd.read_csv(submit_path)
    # 去掉第一列
    train_data.drop("PassengerId", axis=1, inplace=True)
    test_data.drop("PassengerId", axis=1, inplace=True)
#     train_data.drop("Name", axis=1, inplace=True)
#     test_data.drop("Name", axis=1, inplace=True)
    return train_data, test_data, submit_data

概覽

可以看到訓練集有891個樣本,測試集有418個樣本,除去是否倖存這個target變數,共有10個特徵變數

trainData, testData, submitData = load_data()
print('train Data:{}   testData:{}'.format(trainData.shape, testData.shape))
train Data:(891, 11)   testData:(418, 10)

特徵名含義

這十個特徵的含義如下:

Pclass:頭等艙,二等艙...
Name:姓名
Sex:性別
Age:年齡
SibSp:該乘客在船上兄弟姐妹或妻子的數量
Parch:該乘客在船上父母或孩子的數量
Ticket:船票
Fare:費用
Cabin:船艙
Embarked: 出發港口

另外還有一個target variable

survival:倖存(1),遇難(0)

資料大概長什麼樣子?

trainData.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
print(trainData.describe())
         Survived      Pclass         Age       SibSp       Parch        Fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

哪些資料有缺失?

trainData.isnull().sum()
Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64
testData.isnull().sum()
Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64

有多少人倖存下來了?

plt.subplots(figsize=[6,6])
trainData['Survived'].value_counts().plot(kind='pie',autopct='%.1f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x2551337c9b0>

png

相關性

# print(trainData.corr())
cmap = sns.diverging_palette(220, 10, as_cmap=True)
ax = plt.subplots(figsize=(9, 6))
sns.heatmap(trainData.corr(),annot=True,cmap = cmap , linewidths=.5, )
<matplotlib.axes._subplots.AxesSubplot at 0x255136c0ac8>

png

從列表第一列可以看出未處理的各個資料和倖存與否的關係

變數研究

變數分類

特徵變數常可以分為:Categorical Features, Ordinal Features, Continous Feature。

Categorical Features往往代表某種事件的型別,比如Sex, Embarked。

Ordinal Features 也代表了事件的某些型別,但這些型別往往可以排序,身高往往可以分為高中矮。Pclass。

Continous Feature 可能是某個區間上的任意數值,比如Age。

Sex

f, ax = plt.subplots(1, 2, figsize=(10, 5))
trainData[['Survived', 'Sex']].groupby(['Sex']).mean().plot(kind='bar', ax=ax[0])
sns.countplot('Sex',hue='Survived',data=trainData,ax=ax[1])
ax[0].set_title('Survived Rate vs Sex')
ax[1].set_title('Survived number vs Sex')
# sns.countplot(data=trainData, hue='Sex','Survived', ax=ax[1])
Text(0.5,1,'Survived number vs Sex')

png

可以明顯地看到,女性的生存概率高於男性

Pclass

f, ax =plt.subplots(1, 2, figsize=(10, 5))
trainData[['Pclass', 'Survived']].groupby(['Pclass']).mean().plot(kind='bar', ax=ax[0])
ax[0].set_title('Survived Rate vs Pclass')
sns.countplot('Pclass', hue='Survived', data=trainData, ax=ax[1])
ax[1].set_title('Survived number vs Sex')
Text(0.5,1,'Survived number vs Sex')

png

可以看到,處於越高階級的人存活率越高。

Embarked

f, ax =plt.subplots(1, 2, figsize=(10, 5))
trainData[['Embarked', 'Survived']].groupby(['Embarked']).mean().plot(kind='bar', ax=ax[0])
ax[0].set_title('Survived Rate vs Embarked')
sns.countplot('Embarked', hue='Survived', data=trainData, ax=ax[1])
ax[1].set_title('Survived number vs Embarked')
Text(0.5,1,'Survived number vs Embarked')

png

有趣的是,名義變數Embarked也與存活率有關,從不同港口出發的存活率也不一樣。

sns.FacetGrid(trainData, hue='Survived', aspect=2, size=5).map(sns.kdeplot, 'Fare', shade=True).add_legend()
<seaborn.axisgrid.FacetGrid at 0x255136c09e8>

png

可以看到,Fare的分佈偏斜度太大,我們對其取+1取對數

x = pd.DataFrame()
x['Fare'] = trainData['Fare'].map(lambda x: np.log(x+1))
x['Survived'] = trainData['Survived']
sns.FacetGrid(x, hue='Survived', aspect=2, size=5).map(sns.kdeplot, 'Fare', shade=True).add_legend()
<seaborn.axisgrid.FacetGrid at 0x25513ae6630>

png

可以看到,有錢人的生存概率明顯較高。

組合分析

f = sns.factorplot('Pclass', 'Survived', hue='Sex', data=trainData, size=4, aspect=1.5)

png

可以看出女性貴族的存活率非常高,接近100%

sns.FacetGrid(trainData, row='Sex' , hue='Survived', aspect=3, size=3).map(
    sns.kdeplot , 'Age' , shade=True ).set(xlim=(trainData['Age'].min(),trainData['Age'].max())).add_legend()
<seaborn.axisgrid.FacetGrid at 0x25513bd17f0>

png

可以看到中青年男性和青少年女性存活率較低,幼年老年男性,中年女性存活率較低。

sns.FacetGrid(trainData, row='Pclass', hue='Survived', aspect=3, size=3).map(
    sns.kdeplot, 'Age', shade=True).set(xlim=(trainData['Age'].min(),trainData['Age'].max())).add_legend()
<seaborn.axisgrid.FacetGrid at 0x25513e00630>

png

利用seaborn的函式pairplot可以分析:

- 各個numeric variable之間的關係(非對角的圖)
- 各個數值特徵變數的分佈對target variable 'Survived'的影響。(對角的圖)
# 圖'Parch'-'Parch'有bug,但在pycharm中執行正常
sns.pairplot(trainData,vars = ['Pclass', 'SibSp', 'Parch', 'Fare'], aspect = 2.5, size = 2.5, 
             hue = 'Survived', diag_kind = 'kde', diag_kws=dict(shade=True))
<seaborn.axisgrid.PairGrid at 0x2551416fcf8>

png

資料預處理

將型別變數轉換為數值變數:Sex, Embarked

將型別變數轉換為數值變數一般這樣做:對於只有兩類的型別變數,將其轉換為一列(01)數值變數,對於n(n>2)類的型別變數,將其轉換為n列01數值變數。

Embarked, Sex為型別變數,Sex為只有兩類的型別變數,Embarked為有三類的型別變數。

full = trainData.append(testData, ignore_index=True, sort=False)
Sex = pd.Series(np.where(full['Sex']=='male', 1, 0), name='Sex')
Embarked= pd.get_dummies(full['Embarked'],prefix='Embarked')
print(Embarked.head())
   Embarked_C  Embarked_Q  Embarked_S
0           0           0           1
1           1           0           0
2           0           0           1
3           0           0           1
4           0           0           1

提取出姓名中的稱呼:Name–title

full['title'] = 0
full['title'] = full['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip(',. '))
# 顯示所有的Title種類
titleUnique = full['title'].drop_duplicates()
print(np.array(titleUnique).T)
['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
 'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer' 'Dona']
full['title'].replace(
    ['Mlle','Mme','Ms','Dr','Major','Lady','the Countess','Jonkheer','Col','Rev','Capt','Sir','Dona', 'Don'],
    ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mrs','Mr'],inplace=True)
title = pd.get_dummies(full['title'],prefix='Titile')
title.head()
Titile_Master Titile_Miss Titile_Mr Titile_Mrs Titile_Other
0 0 0 1 0 0
1 0 0 0 1 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0

可以看到title這個變數還是相當有區分度的。

full[['title', 'Survived']].groupby('title').mean().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x25516c60550>

png

補全殘缺樣本:Fare, Age

對於數值變數,最簡單的方法是通過將缺失資料填補為平均數來構造新的樣本。填補NaN的函式為[Dataframe].fillna()

我們通過稱呼來填補年齡,用平均值來填補Fare

full.loc[full['Fare'].isnull(),'Fare'] = full['Fare'].mean()
fare = full['Fare'].map(lambda x:np.log(x+1))
full.groupby('title')['Age'].mean()
title
Master     5.482642
Miss      21.834533
Mr        32.545531
Mrs       37.046243
Other     44.923077
Name: Age, dtype: float64
full.loc[full['title']=='Master', 'Age'] = 6
full.loc[full['title']=='Miss', 'Age'] = 22
full.loc[full['title']=='Mr', 'Age'] = 33
full.loc[full['title']=='Mrs', 'Age'] = 37
full.loc[full['title']=='Other', 'Age'] = 45
age = full['Age']
age.isnull().any()
False

構建新樣本:cabin(4)

cabin = pd.DataFrame()
full['Cabin'] = full.Cabin.fillna('U')
# 這樣會出錯:cabin['Cabin'] = full['Carbin'].fillna('U')
cabin['Cabin'] = full.Cabin.map(lambda cabin:cabin[0])
full['Cabin'] = cabin['Cabin']
# cabin = pd.get_dummies(cabin['Cabin'], prefix='Cabin')
# print(cabin.head())
f, ax = plt.subplots(1, 2, figsize=(20,5))
full[['Cabin', 'Survived']].groupby(['Cabin']).mean().plot(kind='bar',ax=ax[0])
sns.countplot('Cabin', hue='Survived', data=full, ax=ax[1])
ax[0].set_title('Survived Rate vs Cabin')
ax[1].set_title('Survived countvs Cabin')
plt.show()
print(full['Cabin'].value_counts())

png

U    1014
C      94
B      65
D      46
E      41
A      22
F      21
G       5
T       1
Name: Cabin, dtype: int64

將較少的樣本併入與其存活率相近的樣本,將存活率相近的樣本合併:

將TU合併,將AG合併,將BDE合併,將CF合併。

full.loc[full.Cabin=='T', 'Cabin'] = 1
full.loc[full.Cabin=='U', 'Cabin'] = 1
full.loc[full.Cabin=='A', 'Cabin'] = 2
full.loc[full.Cabin=='G', 'Cabin'] = 2
full.loc[full.Cabin=='C', 'Cabin'] = 3
full.loc[full.Cabin=='F', 'Cabin'] = 3
full.loc[full.Cabin=='B', 'Cabin'] = 4
full.loc[full.Cabin=='D', 'Cabin'] = 4
full.loc[full.Cabin=='E', 'Cabin'] = 4
cabin = pd.get_dummies(full['Cabin'], prefix='Cabin')
cabin.head()
Cabin_1 Cabin_2 Cabin_3 Cabin_4
0 1 0 0 0
1 0 0 1