kaggle競賽：泰坦尼克倖存者預測

阿新 • • 發佈：2018-12-19

kaggle競賽：泰坦尼克倖存者預測——(一）

這裡寫圖片描述

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

匯入資料

titanic = pd.read_csv(r'E:\DataScience\ML\Titanic\train.csv')

titanic_test = pd.read_csv(r'E:\DataScience\ML\Titanic\test.csv' 
)

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

單詞	翻譯	Key
survival	是否倖存	0 = No, 1 = Yes
pclass	社會階層	1 = 精英, 2 = 中層 , 3 = 普通民眾
sex	性別
Age	年齡
sibsp	船上兄弟/姐妹的個數
parch	船上父母/孩子的個數
ticket	船票號
fare	船票價格
cabin	船艙號碼
embarked	登船口	C = Cherbourg, Q = Queenstown, S = Southampton

# 檢視資料簡單的統計

titanic.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

# 檢視資料概要

titanic.info()

# 統計空值

print(titanic.isnull().sum())

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

資料清洗

處理缺失值

# 可以填充整個dataframe的空值
# titanic.fillna(0)

# 也可以單獨填充一列
# titanic.Age.fillna(0)

titanic.Age.fillna(-30, inplace=True)

#檢視為空的資料
titanic.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

資料分析

性別Sex對生還與否的影響

# 做簡單是彙總統計
titanic.groupby(['Sex','Survived'])['Survived'].count()

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

# 生還率統計

df_sex = titanic[['Sex','Survived']].groupby(['Sex']).mean()
df_sex

	Survived
Sex
female	0.742038
male	0.188908

# 繪製柱狀圖

df_sex.plot(kind='bar',
            figsize=(8,6),
            rot=0,
            fontsize=18,
            stacked=True)
plt.grid(True, linestyle='--')

這裡寫圖片描述

從上面可以發現，事實是與男性比女性的生存能力更強的經驗常識相悖的，可以推測Lady First起到了很大的作用

社會階層 Pclass與生還與否的關係

# 統計
titanic.groupby(['Pclass', 'Survived'])['Pclass'].count()

Pclass  Survived
1       0            80
        1           136
2       0            97
        1            87
3       0           372
        1           119
Name: Pclass, dtype: int64

df_pclass = titanic[['Pclass', 'Survived']].groupby(['Pclass']).mean()
df_pclass

	Survived
Pclass
1	0.629630
2	0.472826
3	0.242363

# 繪製柱狀圖

df_pclass.plot(kind='bar',
               rot=0,
               fontsize=18,
               figsize=(8,6))
plt.show()

png

可以看到，等級越高的人，生存機率越大，那麼ladyfirst能否跨越等級界限呢？

df_psex = titanic[['Pclass', 'Sex', 'Survived']].groupby(['Pclass', 'Sex']).mean()
df_psex

		Survived
Pclass	Sex
1	female	0.968085
1	male	0.368852
2	female	0.921053
2	male	0.157407
3	female	0.500000
3	male	0.135447

df_psex.plot(kind='bar',
             rot=0,
             fontsize=12,
             figsize=(8,6))
plt.show()

png

可以看到，ladyfirst確實跨越了社會等級界限，普通階層的女性的生還率都高於精英階層的男性生還率。
不過，無法忽視的是，不同等級的生還率還是有一定區別的。

年齡Age對生還與否的影響

繪圖分析不同階層和不同性別下的年齡分佈情況以及與生還的關係

# 繪圖分析不同階層和不同性別下的年齡分佈情況以及與生還的關係

fig, ax = plt.subplots(1, 2, figsize=(18,8))
sns.violinplot('Pclass','Age', hue='Survived', data=titanic, split=True, ax=ax[0])
ax[0].set_title('Pclass and Age  vs  Survived',size=18)
ax[0].set_yticks(range(0, 110, 10))

sns.violinplot("Sex", "Age", hue="Survived", data=titanic, split=True, ax=ax[1])
ax[1].set_title('Sex and Age  vs  Survived',size=18)
ax[1].set_yticks(range(0, 110, 10))
plt.show()

png

# 統計總體的年齡分佈
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
titanic['Age'].hist(bins=20)
plt.xlabel('Age')
plt.ylabel('Num')

plt.subplot(1,2,2)
titanic.boxplot(column='Age', showfliers=False)
plt.show()

png

因為年齡缺失值填充的問題，所以中間高出很多

page = sns.FacetGrid(titanic, hue="Survived",aspect=4)
page.map(sns.kdeplot,'Age',shade= True)
page.set(xlim=(-40, titanic['Age'].max()))
page.add_legend()
plt.show()

png

可以看到，孩子和中年人更容易獲救。那麼規則就是 lady and children first，預設值中死亡更多
所以無法統計到年齡

f, ax = plt.subplots(figsize=(8,3))
ax.set_title('Sex Age dist', size=20)
sns.distplot(titanic[titanic.Sex=='female'].dropna().Age, hist=False, color='pink', label='female')
sns.distplot(titanic[titanic.Sex=='male'].dropna().Age, hist=False, color='blue', label='male')
ax.legend(fontsize=15)
plt.show()

png

可以看到，女性更加年輕些，孩子和中老年人中男性更多

f, ax = plt.subplots(figsize=(8,3))
ax.set_title('Pclass Age dist', size=20)
sns.distplot(titanic[titanic.Pclass==1].dropna().Age, hist=False, color='pink', label='P1',rug=True)
sns.distplot(titanic[titanic.Pclass==2].dropna().Age, hist=False, color='blue', label='p2',rug=True)
sns.distplot(titanic[titanic.Pclass==3].dropna().Age, hist=False, color='g', label='p3',rug=True)
ax.legend(fontsize=15)
plt.show()

png

階層越高，年紀更老齡化

有無兄弟姐妹 SibSp 對生還與否的影響

# 首先將資料分為有兄弟姐妹和沒有兄弟姐妹兩組

df_sibsp = titanic[titanic['SibSp'] != 0]
df_sibsp_no = titanic[titanic['SibSp'] == 0]

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
df_sibsp['Survived'].value_counts().plot(kind='pie',labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('sibsp',fontsize=18)

plt.subplot(1,2,2)
df_sibsp_no['Survived'].value_counts().plot(kind='pie',labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('sibsp_no',fontsize=18)

plt.show()

png

有了兄弟姐妹的幫助，似乎更能在險境中存活

有無父母孩子 Parch 對生還與否的影響

方法同上

# 按照有無父母孩子分組
df_parch = titanic[titanic['Parch'] != 0]
df_parch_no = titanic[titanic['Parch'] == 0]

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
df_sibsp['Survived'].value_counts().plot(kind='pie',labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('Parch',fontsize=18)

plt.subplot(1,2,2)
df_sibsp_no['Survived'].value_counts().plot(kind='pie',labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('Parch_no',fontsize=18)

plt.show()

png

從之前的分析中知道，孩子是特殊照顧的物件，而孩子一般是有父母跟隨的。即使都是成年人，互相幫助存活概率也更高。

親人數量對生還與否的影響

是否親人越多，生還可能性越大呢？

fig,ax = plt.subplots(1, 2, figsize=(12,8))
titanic[['Parch','Survived']].groupby(['Parch']).mean().plot(kind='bar',ax=ax[0])
ax[0].set_title('Parch and Survived')

titanic[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar(ax=ax[1])
ax[1].set_title('SibSp and Survived')
plt.show()

png

titanic['fam_size'] = titanic['SibSp'] + titanic['Parch'] + 1
titanic[['fam_size','Survived']].groupby(['fam_size']).mean().plot.bar(figsize=(8,6))
plt.show()

png

從上可以看出，家庭成員在1-4人生還率最高，推測應該是這樣正好組成了可以互幫互助，行動又不臃腫從小組。
而後面7人家庭成員的存活率上升，推測應該是人數上升後，至少存活一人的概率增加。

### 票價 Fare 對生還與否的影響

# 繪製票價分佈圖
titanic['Fare'].plot(kind='hist',bins=100,figsize=(10,6), grid=True)

titanic.boxplot(column='Fare', by='Pclass',showfliers=False,figsize=(10,6))
plt.show()

png

titanic['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

# 繪製生還者非生還者票價分析
titanic.boxplot(column='Fare', by='Survived',showfliers=False,showmeans=True)

png

可以看到，倖存者的票價普遍更高，符合之前階層越高，生還機率越大的推測

船艙號碼 Cabin 對生還與否的影響

按照查詢的資料，我認為乘客所處的船艙應該是跟是否生還有很大關係的，特別是下層的乘客，下部船艙快速進水，通向甲板的路不難想象也是混作一團，這就大大減少了生還可能。但是，此欄位缺失資料多達600多個，所以只做下簡單的資料分析。（不過我認為，票價和船艙應該有對應關係，如果能知道票價與船艙對應的史料就最好了）

titanic.Cabin.isnull().value_counts()

True     687
False    204
Name: Cabin, dtype: int64

titanic.groupby(by=titanic.Cabin.isnull())['Survived'].mean()

Cabin
False    0.666667
True     0.299854
Name: Survived, dtype: float64

由上可知，缺失值的生存率很低，那麼可以將Cabin是否為空作為一個特徵！

titanic['Cabin_fir'] = titanic.Cabin.fillna('0').str.split(' ').apply(lambda x: x[0][0])
df_cabin_fir = titanic.groupby(by='Cabin_fir')['Survived'].mean()
print(df_cabin_fir)

df_cabin_fir.plot(kind='bar',
                 rot=0,
                 legend=True,figsize=(10,8),
                 fontsize=12)
plt.show()

Cabin_fir
0    0.299854
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
T    0.000000
Name: Survived, dtype: float64

png

df_cabin_fare = titanic.groupby(by='Cabin_fir')['Fare','Survived'].mean()
df_cabin_fare

	Fare	Survived
Cabin_fir
0	19.157325	0.299854
A	39.623887	0.466667
B	113.505764	0.744681
C	100.151341	0.593220
D	57.244576	0.757576
E	46.026694	0.750000
F	18.696792	0.615385
G	13.581250	0.500000
T	35.500000	0.000000

在有記錄的乘客中，可以發現，BC艙位總統套間，掏錢最多，DE為貴賓艙，費用中等，其餘為普通艙。生還率大致符合階層的情況。至於為何C艙生還率
低於BDE，暫不分析，推測應該與所處艙位位置不佳，男性佔比大，年齡偏大有關。

登船地點 Embarked 對生還與否的影響

泰坦尼克號從英國南安普敦出發，途經法國瑟堡-奧克特維爾以及愛爾蘭昆士敦 —— 百度百科

南安普頓對應 S = Southampton，瑟堡-奧克特維爾對應 C = Cherbourg，昆士敦對應 Q = Queenstown

titanic.groupby(by='Embarked')['Survived'].mean().plot(kind='bar', rot=0, fontsize=15, legend=True)
plt.show()

png

df_embarked = titanic.groupby(by='Embarked')['Survived','Fare'].agg(['mean', 'count'])
df_embarked

	Survived		Fare
	mean	count	mean	count
Embarked
C	0.553571	168	59.954144	168
Q	0.389610	77	13.276030	77
S	0.336957	644	27.079812	644

ax = plt.figure(figsize=(10,6)).add_subplot(111)
ax.set_xlim([-40, 80])
sns.kdeplot(titanic[titanic.Embarked=='C'].Age, ax=ax, label='C')
sns.kdeplot(titanic[titanic.Embarked=='Q'].Age, ax=ax, label='Q')
sns.kdeplot(titanic[titanic.Embarked=='S'].Age, ax=ax, label='S')
ax.legend(fontsize=18)
plt.show()

png

C和S上岸的乘客的年齡分佈較為相似，Q上岸的人很多沒有年齡。
C和S比較，C口岸的人中有更多的孩子和老人

名字 Name 對生還與否的影響

通過對名字該欄位的初步觀察，發現名字中不但透漏出性別，還代表著一個人的地位，年齡，職業等
比如Master，Miss等

# 稱謂統計
titanic['Title'] = titanic.Name.apply(lambda x: x.split(',')[1].split('.')[0])
titanic['Title'].value_counts()

 Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Mlle              2
 Major             2
 Col               2
 the Countess      1
 Ms                1
 Don               1
 Capt              1
 Mme               1
 Sir               1
 Lady              1
 Jonkheer          1
Name: Title, dtype: int64

# 姓氏統計
titanic.Name.apply(lambda x: x.split(',')[1].split('.')[1]).value_counts()[:10]

 John             9
 James            7
 William          6
 Mary             6
 William Henry    4
 Ivan             4
 William John     4
 Bertha           4
 Anna             3
 Victor           3
Name: Name, dtype: int64

titanic[['Title','Survived']].groupby(['Title']).mean()

	Survived
Title
Capt	0.000000
Col	0.500000
Don	0.000000
Dr	0.428571
Jonkheer	0.000000
Lady	1.000000
Major	0.500000
Master	0.575000
Miss	0.697802
Mlle	1.000000
Mme	1.000000
Mr	0.156673
Mrs	0.792000
Ms	1.000000
Rev	0.000000
Sir	1.000000
the Countess	1.000000

# 不同稱呼的生存率統計
titanic[['Title','Survived']].groupby(['Title']).mean().plot.bar(rot=45, figsize=(15,6), fontsize=12)
plt.show()

png

可以看到，稱謂確實與獲救率有關，以為稱謂往往與人的性別，地位有關。

換個角度，我們知道，歪果仁的名字中通常會加入家族名字，爵位等，所以是不是名字越長就越能像是一個家族的歷史和地位呢？那麼名字的長短是否能夠顯示出人的地位從而影響到是否獲救？

titanic['name_len'] = titanic['Name'].apply(len)
df_namelen = titanic[['name_len','Survived']].groupby(['name_len'],as_index=False).mean()
df_namelen.plot.bar(x='name_len',y='Survived',figsize=(18,6),rot=0,colormap='Blues_r',alpha=0.6,fontsize=12)
plt.show()

這裡寫圖片描述

看來猜想是正確的，名字的長度確實與是否獲救有一定關係

Ticket

類別比較大，觀察可以發現，票號開頭應該代表著船艙區域，故提取分析

titanic['Ticket_Lett'] = titanic['Ticket'].apply(lambda x: str(x)[0])
titanic['Ticket_Lett'] = titanic['Ticket_Lett'].apply(lambda x: str(x))
titanic.groupby(titanic['Ticket_Lett'])['Survived'].mean()

Ticket_Lett
1    0.630137
2    0.464481
3    0.239203
4    0.200000
5    0.000000
6    0.166667
7    0.111111
8    0.000000
9    1.000000
A    0.068966
C    0.340426
F    0.571429
L    0.250000
P    0.646154
S    0.323077
W    0.153846
Name: Survived, dtype: float64

titanic.groupby(titanic['Ticket_Lett'])['Survived'].mean().plot.bar(rot=0)

這裡寫圖片描述

可以看到，船票不同開頭的生存率不同，可以作為一個特徵

通過以上的分析，我們發現，乘客獲救與否，與多種因素有關。包括性別，年齡，階級等。在這大災難面前，強壯的男人死亡率反常的高，而女人和孩子反而更易存活，這不正常，但也是正常的，這應該就是文明發展的結果。

那麼，如果你當時在泰坦尼克上，你是否會成功獲救呢？下篇文章，將通過機器學習演算法，來預測另一批乘客是否會活下來。

特徵工程

變數轉換

變數轉換的目的是將資料轉換為適用於模型使用的資料，不同模型接受不同型別的資料，Scikit-learn要求資料都是數字型numeric，所以我們要將一些非數字型的原始資料轉換為數字型numeric

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings('ignore')

os.chdir('E:\DataScience\ML\Titanic')
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
combine = pd.concat([data_train,data_test])

對資料進行特徵工程，也就是從各項引數中提取出對輸出結果有或大或小的影響的特徵，將這些特徵作為訓練模型的依據。一般來說，我們會先從含有缺失值的特徵開始

Embarked

因為該項的缺失值沒幾個，所以這裡我們以眾數來填充：

# 缺失值填充，眾數為 S
combine['Embarked'] = combine['Embarked'].fillna('S')

# dummy處理
df = pd.get_dummies(combine['Embarked'], prefix='Embarked')
combine = pd.concat([combine, df], axis=1).drop('Embarked', axis=1)

Name_length

combine['Name_length'] = combine['Name'].apply(len)

Title

combine['Title'] = combine['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x:x.split('.')[0])
combine['Title'] = combine['Title'].apply(lambda x: x.strip())
combine['Title'] = combine['Title'].replace(['Major','Capt','Rev','Col','Dr'],'officer')
combine['Title'] = combine['Title'].replace(['Mlle','Miss'], 'Miss')
combine['Title'] = combine['Title'].replace(['Mme','Ms','Mrs'], 'Mrs')
combine['Title'] = combine['Title'].replace(['Master','Jonkheer'], 'Master')
combine['Title'] = combine['Title'].replace(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty')
df = pd.get_dummies(combine['Title'],prefix='Title')
combine = pd.concat([combine,df], axis=1)

Fare

該項只有一個缺失值，對該值進行填充,我們可以按照階級均價來填充

combine['Fare'] = combine['Fare'].fillna(combine.groupby('Pclass')['Fare'].transform(np.mean))

通過對Ticket簡單的統計，我們可以看到部分票號資料有重複，同時結合親屬人數及名字的資料，和票價船艙等級對比，我們可以知道購買的票中有團體票，所以我們需要將團體票的票價分配到每個人的頭上

combine['Group_Ticket'] = combine['Fare'].groupby(by=combine['Ticket']).transform('count')
combine['Fare'] = combine['Fare'] / combine['Group_Ticket']
combine.drop(['Group_Ticket'], axis=1, inplace=True)

#  分級
combine['Fare_1'] = np.where(combine['Fare'] <= 7.91,1,0)
combine['Fare_2'] = np.where((combine['Fare'] > 7.91) & (combine['Fare'] <= 14.454),1,0)
combine['Fare_3'] = np.where((combine['Fare'] > 14.454)& (combine['Fare'] <= 31),1,0)
combine['Fare_4'] = np.where((combine['Fare'] > 31),1,0)
combine = combine.drop('Fare',axis=1)

Dead_female_family & Survive_male_family

前面分析可以知道，家庭的行為具有一致性，那麼如果家族中有一個女的死亡，那麼其他女性也傾向於死亡，反之，如果有男性生還，其他男性也會傾向於生還，為了防止模型無腦判斷女性生還和男性死亡，在這裡分出這兩類情況。

combine['Fname'] = combine['Name'].apply(lambda x:x.split(',')[0])
combine['Familysize'] = combine['SibSp']+combine['Parch']
dead_female_Fname = list(set(combine[(combine.Sex=='female') & (combine.Age>=12) & (combine.Survived==0) & (combine.Familysize>1)]['Fname'].values))
survive_male_Fname = list(set(combine[(combine.Sex=='male') & (combine.Age>=12) & (combine.Survived==1) & (combine.Familysize>1)]['Fname'].values))
combine['Dead_female_family'] = np.where(combine['Fname'].isin(dead_female_Fname),1,0)
combine['Survive_male_family'] = np.where(combine['Fname'].isin(survive_male_Fname),1,0)
combine = combine.drop(['Name','Fname','Familysize'],axis=1)

Age

Age缺失值太多，可以按照階級性別的平均年齡填充，也可以利用機器學習演算法來預測,這裡我們採用第一種方法

group = combine.groupby(['Title', 'Pclass'])['Age']
combine['Age'] = group.transform(lambda x: x.fillna(x.median()))
combine['IsChild'] = np.where(combine['Age']<=12,1,0)
# combine['Age'] = pd.cut(combine['Age'],5)
combine = combine.drop(['Title'],axis=1)

Cabin

Cabin的缺失值太多，但是根據之前的分析，該特徵值的有無與生還與否也相關性，所以我們將其分為兩類

combine['Cabin_0'] = np.where(combine['Cabin'].isnull(),1,0)
combine['Cabin_1'] = np.where(combine['Cabin'].isnull(),0,1)
combine = combine.drop('Cabin',axis=1)

Pclass

Pclass這一項，只需要將其轉換為dummy形式就可以了

df = pd.get_dummies(combine['Pclass'], prefix='Pclass')
combine = pd.concat([combine, df], axis=1).drop('Pclass',axis=1)

Ticket

Ticket 在前面並沒有分析，主要是因為裡面有英文有數字，難以分析出規律，但是隻看英文數字結合的票號，不難發現，票號前面的英文應該代表著位置資訊，那麼位置影響逃生路線，故將這部分提取出來做特徵處理

combine['Ticket_Lett'] = combine['Ticket'].apply(lambda x: str(x)[0])
combine['Ticket_Lett'] = combine['Ticket_Lett'].apply(lambda x: str(x))

combine['High_Survival_Ticket'] = np.where(combine['Ticket_Lett'].isin(['1', '2', 'P','9','F']),1,0)
combine['mid_Survival_Ticket'] = np.where(combine['Ticket_Lett'].isin(['3','4','L','S']),1,0)
combine['Low_Survival_Ticket'] = np.where(combine['Ticket_Lett'].isin(['A','W','6','7']),1,0)
combine = combine.drop(['Ticket','Ticket_Lett'],axis=1)

Sex

對Sex進行one-hot編碼

df = pd.get_dummies(combine['Sex'], prefix='Sex')
combine = pd.concat([combine, df],axis=1).drop('Sex',axis=1)

Parch and SibSp

親友數量是會影響到生存率的，那麼將這兩項合為一項

combine['Family_size'] = np.where((combine['Parch']+combine['SibSp']==0),'Alone',
                                  np.where((combine['Parch']+combine['SibSp']<=3),'Small','Big'))

df = pd.get_dummies(combine['Family_size'], prefix='Family_size')
combine = pd.concat([combine,df],axis=1).drop(['SibSp','Parch','Family_size'],axis=1)

將所有特徵轉換正數值型編碼

features = combine.drop(["PassengerId","Survived"], axis=1).columns
le = LabelEncoder()
for feature in features:
    le = le.fit(combine[feature])
    combine[feature] = le.transform(combine[feature])

將訓練資料和測試資料分開

x_train = combine.iloc[:891,:].drop(['PassengerId', 'Survived'],axis=1)
y_train = combine.iloc[:891,:]['Survived']
x_test = combine.iloc[891:,:].drop(['PassengerId','Survived'], axis=1)

模型比較

# logistic Regression
Logreg = LogisticRegression()
Logreg.fit(x_train,y_train)
y_pred = Logreg.predict(x_test)
acc_logreg = round(Logreg.score(x_train, y_train) * 100,2)

# Support Vector Machines
svc = SVC()
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)
acc_svc = round(svc.score(x_train, y_train) *100,2)

# K-Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
acc_knn = round(knn.score(x_train, y_train) * 100, 2)

# Random Forest
rf = RandomForestClassifier(n_estimators=300,min_samples_leaf=4,class_weight={0:0.745,1:0.255})
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
acc_rf = round(rf.score(x_train, y_train) * 100, 2)

# Decision Tree
dec_tree = DecisionTreeClassifier()
dec_tree.fit(x_train, y_train)
y_pred = dec_tree.predict(x_test)
acc_dec_tree = round(dec_tree.score(x_train,y_train) * 100,2)


# XGBoost
xgb = XGBClassifier()
xgb.fit(x_train,y_train)
y_pred = xgb.predict(x_test)
acc_xgb = round(xgb.score(x_train,y_train) * 100, 2)


models = pd.DataFrame({'model':['Logreg','svc','knn','rf','dec_tree','xgb'],
                       'Score':[acc_logreg,acc_svc,acc_knn,acc_rf,acc_dec_tree,acc_xgb]})

print(models.sort_values(by='Score', ascending=False))

   Score     model
4  99.21  dec_tree
5  89.11       xgb
2  87.32       knn
1  87.09       svc
0  86.31    Logreg
3  85.41        rf

# XGB
xgb = XGBClassifier()
xgb.fit(x_train,y_train)
y_pred = xgb.predict(x_test).astype(int)# 該列必須是整型，否則格式不對，得分0分（別問我怎麼知道的）
# 只得到了78分的成績

# logistic Regression
# Logreg = LogisticRegression()
# Logreg.fit(x_train,y_train)
# y_pred = Logreg.predict(x_test).astype(int)
# 只得到了78分的成績

# Random Forest
# rf = RandomForestClassifier(n_estimators=100)
# rf.fit(x_train, y_train)
# y_pred = rf.predict(x_test).astype(int)

subminssion = pd.DataFrame({"PassengerId": data_test["PassengerId"],"Survived": y_pred})
subminssion.to_csv('submission.csv',index=False)

最後，提交結果後，發現得到了11% 的排名，這裡沒有做模型融合，模型的調參也不怎麼熟練，特徵工程也做的一般，所以還是有很大的優化空間的。

kaggle競賽：泰坦尼克倖存者預測

kaggle競賽：泰坦尼克倖存者預測——(一）

匯入資料

資料清洗

處理缺失值

資料分析

性別Sex對生還與否的影響

社會階層 Pclass與生還與否的關係

年齡Age對生還與否的影響

有無兄弟姐妹 SibSp 對生還與否的影響

有無父母孩子 Parch 對生還與否的影響

親人數量對生還與否的影響

船艙號碼 Cabin 對生還與否的影響

登船地點 Embarked 對生還與否的影響

名字 Name 對生還與否的影響

Ticket

特徵工程

變數轉換

Embarked

Name_length

Title

Fare

Dead_female_family & Survive_male_family

Age

Cabin

Pclass

Ticket

Sex

Parch and SibSp

將所有特徵轉換正數值型編碼

將訓練資料和測試資料分開

模型比較

相關推薦