Titanic資料分析報告（python）

阿新 • • 發佈：2019-01-10

# Titanic資料分析報告 ## 1.1 資料載入與描述性統計載入所需資料與所需的python庫。

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import patsy
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from scipy import stats
import seaborn as sns

train = pd.read_csv("D:/學習/資料探勘與機器學習/Titanic/train.csv" 
)

資料集中共有12個欄位，PassengerId：乘客編號，Survived：乘客是否存活，Pclass：乘客所在的船艙等級；Name：乘客姓名，Sex：乘客性別，Age：乘客年齡，SibSp：乘客的兄弟姐妹和配偶數量，Parch：乘客的父母與子女數量，Ticket：票的編號，Fare：票價，Cabin：座位號，Embarked：乘客登船碼頭。共有891位乘客的資料資訊。其中277位乘客的年齡資料缺失，2位乘客的登船碼頭資料缺失，687位乘客的船艙資料缺失。

train.head()

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

train.info()

train.describe()

PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

## 1.2單變數探索 ### 1.2.1 年齡與費用畫出訓練集中乘客年齡和費用的分佈直方圖，如下所示。可以發現，大部分乘客的年齡位於20-40歲之間，總體上呈正態分佈。大部分乘客的票價很低，位於0-100之間，其他少部分乘客的票價較高。

fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
train["Age"].hist(ax=ax[0])
ax[0].set_title("Hist plot of Age")
train["Fare"].hist(ax=ax[1])
ax[1].set_title("Hist plot of Fare")

### 1.2.2 乘客是否獲救畫出乘客獲救與沒有獲救的條形圖，如下所示。可以發現，大部分乘客沒有獲救。

fig,ax = plt.subplots(figsize=(7,5))
train["Survived"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Not Survived","Survived"),  rotation= "horizontal" )
ax.set_title("Bar plot of Survived ")

### 1.2.3 性別畫出乘客性別條形分佈圖，如下所示。可以發現，大部分乘客為男性。

fig,ax = plt.subplots(figsize=(7,5))
train["Sex"].value_counts().plot(kind="bar")
ax.set_xticklabels(("male","female"),rotation= "horizontal"  )
ax.set_title("Bar plot of Sex ")

### 1.2.4 乘客所在的船艙等級畫出乘客的Pclass條形分佈圖，如下所示。可以發現，大部分乘客位於第三等級，第一等級和第二等級的乘客各有200個左右。

fig,ax = plt.subplots(figsize=(7,5))
train["Pclass"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Class3","Class1","Class2"),rotation= "horizontal"  )
ax.set_title("Bar plot of Pclass ")

### 1.2.5 乘客座位號對乘客座位號資料進行處理，將缺失值賦值為Unknown。從乘客座位號資料可以發現，第一個字母可能代表了船艙號碼，將該字元提取出來，賦值給Cabin，視為船艙號。

train.Cabin.fillna("Unknown",inplace=True)
for i in range(0, 891):
    train.Cabin[i]= train.Cabin[i][0]

D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance() 畫出乘客的船艙號的條形分佈圖，如下所示。可以發現，大部分乘客的船艙號為未知。

fig,ax = plt.subplots(figsize=(7,5))
train.Cabin.value_counts().plot(kind="bar")
ax.set_title("Bar plot of Cabin ")

### 1.2.6 兄弟姐妹與配偶數目畫出乘客兄弟姐妹與配偶數目的條形分佈圖，如下所示。可以發現，大部分乘客在船上沒有兄弟姐妹或配偶，大約200位乘客在船上有1個兄弟姐妹或配偶。

fig,ax = plt.subplots(figsize=(7,5))
train["SibSp"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of SibSp ")

### 1.2.7 父母與子女數目畫出乘客父母與子女數目的條形分佈圖，如下所示。可以發現，大部分乘客在船上沒有父母或子女，100多位乘客在船上有1個兄弟姐妹或配偶，大約90位乘客在船上有2個兄弟姐妹或配偶。

fig,ax = plt.subplots(figsize=(7,5))
train["Parch"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of Parch ")

### 1.2.8 乘客出發港口畫出乘客出發港口的分佈條形圖，如下所示。可以發現，大部分乘客從Southampton港口出發，不到200位乘客從Cherburge出發，不到100位乘客從Queentown出發。

fig,ax = plt.subplots(figsize=(7,5))
train["Embarked"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Southampton","Cherbourg","Queenstown"),rotation= "horizontal"  )
ax.set_title("Bar plot of Embarked ")

## 1.3 多變數探索 ### 1.3.1 性別與是否獲救畫出性別與是否獲救的交叉表和條形圖，如下所示。可以發現，女性獲救的可能性更高，而男性獲救的比例很低。

pd.crosstab(train["Sex"],train["Survived"])

Survived	0	1
Sex
female	81	233
male	468	109

pd.crosstab(train["Sex"],train["Survived"]).plot(kind="bar")

### 1.3.2 船艙等級與是否獲救畫出船艙等級與是否獲救的交叉表與條形圖，如下所示。可以發現，第一等級的乘客獲救的可能性更高，超過50%，第二等級的乘客獲救可能性在50%左右，而第三等級的乘客獲救可能性很低。

pd.crosstab(train["Pclass"],train["Survived"])

Survived	0	1
Pclass
1	80	136
2	97	87
3	372	119

pd.crosstab(train["Pclass"],train["Survived"]).plot(kind="bar")

### 1.3.3 兄弟姐妹或配偶數量與是否獲救畫出兄弟姐妹與配偶數目與是否獲救的交叉表與條形圖，如下所示。可以發現，有數量為1或2的乘客獲救的可能性更高。

pd.crosstab(train["SibSp"],train["Survived"])

Survived	0	1
SibSp
0	398	210
1	97	112
2	15	13
3	12	4
4	15	3
5	5	0
8	7	0

pd.crosstab(train["SibSp"],train["Survived"]).plot(kind="bar")

### 1.3.4 父母或子女數目和是否獲救畫出父母或子女數目與是否獲救的交叉表與條形圖，如下所示。可以發現，有數量為1或2的乘客獲救的可能性更高。

pd.crosstab(train["Parch"],train["Survived"])

Survived	0	1
Parch
0	445	233
1	53	65
2	40	40
3	2	3
4	4	0
5	4	1
6	1	0

pd.crosstab(train["Parch"],train["Survived"]).plot(kind="bar")

### 1.3.5 登船港口與是否獲救畫出登船港口與是否獲救的交叉表與條形圖，如下所示。可以發現，從Cherburge出發的乘客獲救的人數比例更高。

pd.crosstab(train["Embarked"],train["Survived"])

Survived	0	1
Embarked
C	75	93
Q	47	30
S	427	217

pd.crosstab(train["Embarked"],train["Survived"]).plot(kind="bar")

### 1.3.6 乘客船艙與是否獲救畫出乘客所在船艙與是否獲救的交叉表與條形圖，如下所示。可以發現，船艙後沒有缺失的乘客獲救的人數比例更高。

pd.crosstab(train["Cabin"],train["Survived"])

Survived	0	1
Cabin
A	8	7
B	12	35
C	24	35
D	8	25
E	8	24
F	5	8
G	2	2
T	1	0
U	481	206

pd.crosstab(train["Cabin"],train["Survived"]).plot(kind="bar")

### 1.3.7 乘客年齡與是否獲救畫出乘客是否獲救與年齡的箱線圖，如下所示。從箱線圖上來看，兩者關係並不明顯。

fig,ay = plt.subplots()
Age1 = train.Age[train.Survived == 1].dropna()
Age0 = train.Age[train.Survived == 0].dropna()
plt.boxplot((Age1,Age0),labels=('Survived','Not Survived'))
ay.set_ylim([-5,70])
ay.set_title("Boxplot of Age")

### 1.3.8 票價與是否獲救畫出乘客是否獲救與票價的箱線圖，如下所示。可以發現，總體而言，獲救的乘客票價更高。

fig,ay = plt.subplots()
Fare1 = train.Fare[train.Survived == 1]
Fare0 = train.Fare[train.Survived == 0]
plt.boxplot((Fare1,Fare0),labels=('Survived','Not Survived'))
ay.set_ylim([-10,150])
ay.set_title("Boxplot of Fare")

### 1.3.9 票價與乘客艙位等級畫出乘客票價與艙位等級的箱線圖，如下所示。可以明顯的發現，艙位等級越高的乘客，票價越高。這兩個變數之間存在非常明顯的線性相關關係。

fig,ay = plt.subplots()
Farec1 = train.Fare[train.Pclass == 1]
Farec2 = train.Fare[train.Pclass == 2]
Farec3 = train.Fare[train.Pclass == 3]
plt.boxplot((Farec1,Farec2,Farec3),labels=("Pclass1","Pclass2","Pclass3"))
ay.set_ylim([-10,180])
ay.set_title("Boxplot of Fare and Pclass")

## 1.4 資料處理 ### 1.4.1 缺失值處理用年齡的均值填充年齡的缺失值，用出發港口的眾數填補出發港口的缺失值。

train.Age.mean()
train.Age.fillna(29.7,inplace=True)
train.Embarked.fillna("S",inplace=True)

### 1.4.2 資料分箱根據以上分析結果和變數間的關係，將年齡資料分段為0-5歲、5-15歲、15-20歲、20-35歲、35-50歲、50-60歲、60-100歲7段。將Parch變數分成數目為0、數目為1或2、數目為大於2三段。將SibSp變數分成數目為0、數目為1或2、數目為大於2三段。將Cabin變數分為缺失和沒有缺失兩段。

train.age=pd.cut(train.Age,[0,5,15,20,35,50,60,100])
pd.crosstab(train.age,train.Survived).plot(kind="bar")

train.Parch[(train.Parch>0) & (train.Parch<=2)]=1
train.Parch[train.Parch>2]=2
train.SibSp[(train.SibSp>0) & (train.SibSp<=2)]=1
train.SibSp[train.SibSp>2]=2
train.Cabin[train.Cabin!="U"]="K"

D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == ‘__main__’: D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance() D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy ### 1.4.3 建立虛擬變數為Pclass、Sex、Embarked、Parch、SibSp、Cabin變數建立虛擬變數

dummy_Pclass = pd.get_dummies(train.Pclass, prefix='Pclass')
dummy_Sex = pd.get_dummies(train.Sex, prefix='Sex')
dummy_Embarked = pd.get_dummies(train.Embarked, prefix='Embarked')
dummy_Parch = pd.get_dummies(train.Parch, prefix='Parch')
dummy_SibSp = pd.get_dummies(train.SibSp, prefix='SibSp')
dummy_Age = pd.get_dummies(train.age, prefix='Age')
dummy_Cabin = pd.get_dummies(train.Cabin, prefix='Cabin')

## 1.5 模型建立 ### 1.5.1 建立訓練集

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve,roc_auc_score,classification_report

劃分訓練集，將編號為0-623的乘客作為訓練集。去除PassengerId和Name變數，新增常數項intercept. 因變數為乘客是否獲救，自變數為乘客的票價、性別、登船碼頭、父母與子女數目、兄弟姐妹與配偶數目、年齡、船艙。除票價外，都為虛擬變數。考慮到Fare和Pclass之間的線性相關性，剔除Pclass變數。

train_y = train[:623]["Survived"]
cols_to_keep = ["Fare"]
train_x = train[:623][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
train_x['intercept'] = 1.0
train_x.tail()

Fare	Sex_male	Embarked_Q	Parch_1	Parch_2	SibSp_2	Age_(20, 35]	Age_(35, 50]	Age_(50, 60]	intercept
618	39.0000	0	1	1	1	0	0	0	0	1
619	10.5000	1	1	0	0	0	1	0	1	1
620	14.4542	1	0	0	1	0	1	0	1	1
621	52.5542	1	1	0	1	0	0	1	0	1
622	15.7417	1	0	1	1	1	0	0	1	1

### 1.5.2 模型構建對訓練集構建邏輯斯蒂模型。

clf = LogisticRegression()
clf.fit(train_x,train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1, penalty=’l2’, random_state=None, solver=’liblinear’, tol=0.0001, verbose=0, warm_start=False) ### 1.5.3 模型檢驗劃分測試集，將編號為624-890的乘客作為測試集。

test_y = train[623:]["Survived"]
cols_to_keep = ["Fare"]
test_x = train[623:][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
test_x['intercept'] = 1.0
test_x.head()

Fare	Sex_male	Embarked_Q	Embarked_S	Parch_1	Age_(35, 50]	Age_(60, 100]	Cabin_U	intercept
623	7.8542	1	0	1	1	0	0	1	1
624	16.1000	1	0	1	1	0	0	1	1
625	32.3208	1	0	1	0	0	1	0	1
626	12.3500	1	1	0	0	1	0	1	1
627	77.9583	0	0	1	1	0	0	0	1

利用測試集對模型進行測試

clf.predict(test_x)

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0], dtype=int64)

clf.predict_proba(test_x)

array([[ 0.86039834,  0.13960166],
       [ 0.85711962,  0.14288038],
       [ 0.74830885,  0.25169115],
       [ 0.84502153,  0.15497847],
       [ 0.12886629,  0.87113371],
       [ 0.86038196,  0.13961804],
       [ 0.81492416,  0.18507584],
       [ 0.74973913,  0.25026087],
       [ 0.88595841,  0.11404159],
       [ 0.60609598,  0.39390402],
       [ 0.86346251,  0.13653749],
       [ 0.61197924,  0.38802076],
       [ 0.35569986,  0.64430014],
       [ 0.86037046,  0.13962954],
       [ 0.79463014,  0.20536986],
       [ 0.57570079,  0.42429921],
       [ 0.83467975,  0.16532025],
       [ 0.85798051,  0.14201949],
       [ 0.10986907,  0.89013093],
       [ 0.56445448,  0.43554552],
       [ 0.84012224,  0.15987776],
       [ 0.15820833,  0.84179167],
       [ 0.56790934,  0.43209066],
       [ 0.85796389,  0.14203611],
       [ 0.65552637,  0.34447363],
       [ 0.86051808,  0.13948192],
       [ 0.35980506,  0.64019494],
       [ 0.86038196,  0.13961804],
       [ 0.29324608,  0.70675392],
       [ 0.86017015,  0.13982985],
       [ 0.28622388,  0.71377612],
       [ 0.28287555,  0.71712445],
       [ 0.8070549 ,  0.1929451 ],
       [ 0.86038196,  0.13961804],
       [ 0.20682467,  0.79317533],
       [ 0.85835972,  0.14164028],
       [ 0.53882734,  0.46117266],
       [ 0.80220769,  0.19779231],
       [ 0.85539757,  0.14460243],
       [ 0.69483827,  0.30516173],
       [ 0.87933359,  0.12066641],
       [ 0.83561804,  0.16438196],
       [ 0.8070549 ,  0.1929451 ],
       [ 0.85835972,  0.14164028],
       [ 0.86042952,  0.13957048],
       [ 0.87914067,  0.12085933],
       [ 0.11937883,  0.88062117],
       [ 0.2853273 ,  0.7146727 ],
       [ 0.59808311,  0.40191689],
       [ 0.90594814,  0.09405186],
       [ 0.85835972,  0.14164028],
       [ 0.86346251,  0.13653749],
       [ 0.85801214,  0.14198786],
       [ 0.86032122,  0.13967878],
       [ 0.35349566,  0.64650434],
       [ 0.52724747,  0.47275253],
       [ 0.22879217,  0.77120783],
       [ 0.28601744,  0.71398256],
       [ 0.56939306,  0.43060694],
       [ 0.85743204,  0.14256796],
       [ 0.94208727,  0.05791273],
       [ 0.82348607,  0.17651393],
       [ 0.74901495,  0.25098505],
       [ 0.94336392,  0.05663608],
       [ 0.85705258,  0.14294742],
       [ 0.85800384,  0.14199616],
       [ 0.05587819,  0.94412181],
       [ 0.5941366 ,  0.4058634 ],
       [ 0.18541791,  0.81458209],
       [ 0.84012224,  0.15987776],
       [ 0.83358085,  0.16641915],
       [ 0.87933975,  0.12066025],
       [ 0.88380587,  0.11619413],
       [ 0.87914067,  0.12085933],
       [ 0.28628812,  0.71371188],
       [ 0.48214083,  0.51785917],
       [ 0.70716252,  0.29283748],
       [ 0.05715212,  0.94284788],
       [ 0.65795297,  0.34204703],
       [ 0.25709963,  0.74290037],
       [ 0.81492001,  0.18507999],
       [ 0.83837631,  0.16162369],
       [ 0.8727473 ,  0.1272527 ],
       [ 0.3942793 ,  0.6057207 ],
       [ 0.69435145,  0.30564855],
       [ 0.25955289,  0.74044711],
       [ 0.76489311,  0.23510689],
       [ 0.11637845,  0.88362155],
       [ 0.65775927,  0.34224073],
       [ 0.63734093,  0.36265907],
       [ 0.85975561,  0.14024439],
       [ 0.8839741 ,  0.1160259 ],
       [ 0.66714496,  0.33285504],
       [ 0.07984609,  0.92015391],
       [ 0.15579323,  0.84420677],
       [ 0.81105308,  0.18894692],
       [ 0.86042952,  0.13957048],
       [ 0.24260674,  0.75739326],
       [ 0.8360098 ,  0.1639902 ],
       [ 0.85835972,  0.14164028],
       [ 0.87740579,  0.12259421],
       [ 0.59721595,  0.40278405],
       [ 0.85765731,  0.14234269],
       [ 0.72253208,  0.27746792],
       [ 0.2862853 ,  0.7137147 ],
       [ 0.83015243,  0.16984757],
       [ 0.3208549 ,  0.6791451 ],
       [ 0.08720221,  0.91279779],
       [ 0.79040078,  0.20959922],
       [ 0.86346251,  0.13653749],
       [ 0.85835972,  0.14164028],
       [ 0.85835972,  0.14164028],
       [ 0.85711962,  0.14288038],
       [ 0.53746947,  0.46253053],
       [ 0.24073084,  0.75926916],
       [ 0.86038196,  0.13961804],
       [ 0.86038196,  0.13961804],
       [ 0.65520865,  0.34479135],
       [ 0.61675979,  0.38324021],
       [ 0.04187545,  0.95812455],
       [ 0.83467975,  0.16532025],
       [ 0.86037046,  0.13962954],
       [ 0.63589211,  0.36410789],
       [ 0.7945787 ,  0.2054213 ],
       [ 0.35569986,  0.64430014],
       [ 0.5923993 ,  0.4076007 ],
       [ 0.8149159 ,  0.1850841 ],
       [ 0.18626833,  0.81373167],
       [ 0.55494501,  0.44505499],
       [ 0.85974901,  0.14025099],
       [ 0.86038196,  0.13961804],
       [ 0.26826856,  0.73173144],
       [ 0.72096103,  0.27903897],
       [ 0.86042134,  0.13957866],
       [ 0.85651789,  0.14348211],
       [ 0.86032122,  0.13967878],
       [ 0.12575524,  0.87424476],
       [ 0.8577608 ,  0.1422392 ],
       [ 0.87946251,  0.12053749],
       [ 0.83078796,  0.16921204],
       [ 0.09214503,  0.90785497],
       [ 0.85801214,  0.14198786],
       [ 0.1353402 ,  0.8646598 ],
       [ 0.81833156,  0.18166844],
       [ 0.28627693,  0.71372307],
       [ 0.77835462,  0.22164538],
       [ 0.86019807,  0.13980193],
       [ 0.85974901,  0.14025099],
       [ 0.87920886,  0.12079114],
       [ 0.18831656,  0.81168344],
       [ 0.83358085,  0.16641915],
       [ 0.56217041,  0.43782959],
       [ 0.85802213,  0.14197787],
       [ 0.59346021,  0.40653979],
       [ 0.26217247,  0.73782753],
       [ 0.81492209,  0.18507791],
       [ 0.0820547 ,  0.9179453 ],
       [ 0.26297266,  0.73702734],
       [ 0.11560723,  0.88439277],
       [ 0.65520865,  0.34479135],
       [ 0.7961241 ,  0.2038759 ],
       [ 0.86071471,  0.13928529],
       [ 0.86063609,  0.13936391],
       [ 0.35525524,  0.64474476],
       [ 0.92489299,  0.07510701],
       [ 0.71693682,  0.28306318],
       [ 0.6076947 ,  0.3923053 ],
       [ 0.8149159 ,  0.1850841 ],
       [ 0.85057636,  0.14942364],
       [ 0.6376244 ,  0.3623756 ],
       [ 0.822631  ,  0.177369  ],
       [ 0.86038196,  0.13961804],
       [ 0.87740579,  0.12259421],
       [ 0.17163368,  0.82836632],
       [ 0.35894969,  0.64105031],
       [ 0.83357894,  0.16642106],
       [ 0.26194935,  0.73805065],
       [ 0.85835972,  0.14164028],
       [ 0.26062053,  0.73937947],
       [ 0.42452126,  0.57547874],
       [ 0.71744256,  0.28255744],
       [ 0.86074419,  0.13925581],
       [ 0.86042952,  0.13957048],
       [ 0.71232894,  0.28767106],
       [ 0.35504562,  0.64495438],
       [ 0.87740579,  0.12259421],
       [ 0.11900024,  0.88099976],
       [ 0.86038523,  0.13961477],
       [ 0.87341934,  0.12658066],
       [ 0.85935323,  0.14064677],
       [ 0.60934863,  0.39065137],
       [ 0.86032122,  0.13967878],
       [ 0.67707567,  0.32292433],
       [ 0.35952192,  0.64047808],
       [ 0.751824  ,  0.248176  ],
       [ 0.8796969 ,  0.1203031 ],
       [ 0.94539356,  0.05460644],
       [ 0.10542774,  0.89457226],
       [ 0.86007975,  0.13992025],
       [ 0.88191682,  0.11808318],
       [ 0.12684285,  0.87315715],
       [ 0.93191133,  0.06808867],
       [ 0.81531115,  0.18468885],
       [ 0.84012224,  0.15987776],
       [ 0.69813211,  0.30186789],
       [ 0.8149159 ,  0.1850841 ],
       [ 0.18808428,  0.81191572],
       [ 0.22676529,  0.77323471],
       [ 0.71814943,  0.28185057],
       [ 0.83357894,  0.16642106],
       [ 0.86039834,  0.13960166],
       [ 0.85780233,  0.14219767],
       [ 0.0849913 ,  0.9150087 ],
       [ 0.86007975,  0.13992025],
       [ 0.86032122,  0.13967878],
       [ 0.84012224,  0.15987776],
       [ 0.60672195,  0.39327805],
       [ 0.85795222,  0.14204778],
       [ 0.85692031,  0.14307969],
       [ 0.29681064,  0.70318936],
       [ 0.83393869,  0.16606131],
       [ 0.85765731,  0.14234269],
       [ 0.87931473,  0.12068527],
       [ 0.95077511,  0.04922489],
       [ 0.83327556,  0.16672444],
       [ 0.8180723 ,  0.1819277 ],
       [ 0.08871631,  0.91128369],
       [ 0.93364059,  0.06635941],
       [ 0.90670657,  0.09329343],
       [ 0.18814843,  0.81185157],
       [ 0.11532953,  0.88467047],
       [ 0.3446263 ,  0.6553737 ],
       [ 0.30260559,  0.69739441],
       [ 0.20902316,  0.79097684],
       [ 0.70727922,  0.29272078],
       [ 0.49908055,  0.50091945],
       [ 0.83357894,  0.16642106],
       [ 0.85717847,  0.14282153],
       [ 0.83675021,  0.16324979],
       [ 0.17163368,  0.82836632],
       [ 0.6376244 ,  0.3623756 ],
       [ 0.85835972,  0.14164028],
       [ 0.39467085,  0.60532915],
       [ 0.2731416 ,  0.7268584 ],
       [ 0.63987502,  0.36012498],
       [ 0.85974901,  0.14025099],
       [ 0.72317603,  0.27682397],
       [ 0.86038196,  0.13961804],
       [ 0.11238481,  0.88761519],
       [ 0.67348135,  0.32651865],
       [ 0.87880937,  0.12119063],
       [ 0.2665907 ,  0.7334093 ],
       [ 0.26297533,  0.73702467],
       [ 0.85718307,  0.14281693],
       [ 0.85796389,  0.14203611],
       [ 0.86038196,  0.13961804],
       [ 0.10513251,  0.89486749],
       [ 0.29535416,  0.70464584],
       [ 0.86038196,  0.13961804],
       [ 0.35756781,  0.64243219],
       [ 0.85935323,  0.14064677],
       [ 0.86071471,  0.13928529],
       [ 0.50077687,  0.49922313],
       [ 0.85835972,  0.14164028],
       [ 0.14507275,  0.85492725],
       [ 0.26239326,  0.73760674],
       [ 0.60648726,  0.39351274],
       [ 0.8149159 ,  0.1850841 ]])

preds = clf.predict(test_x)

計算模型的混淆矩陣如下所示。

confusion_matrix(test_y,preds)

array([[157,  15],
       [ 35,  61]])

計算模型的ROC/AUC得分，並畫出ROC曲線。模型的ROC/AUC得分為0.88，表明預測準確的概率為88%左右。模型預測結果較好。

pre = clf.predict_proba(test_x)
roc_auc_score(test_y,pre[:,1])

0.88114704457364346

fpr,tpr,thresholds = roc_curve(test_y,pre[:,1])
fig,ax = plt.subplots(figsize=(8,5))
plt.plot(fpr,tpr)
ax.set_title("Roc of Logistic Regression")

<matplotlib.text.Text at 0x7674a1f588>

png

模型預測結果分類報告如下所示。

print(classification_report(test_y,preds))

             precision    recall  f1-score   support

          0       0.82      0.91      0.86       172
          1       0.80      0.64      0.71        96

avg / total       0.81      0.81      0.81       268

總體而言，模型的擬合結果較好。

Titanic資料分析報告（python）

Titanic資料分析報告（python）

如何做好資料分析報告（三）

如何做到資料分析報告（七）

如何做到資料分析報告（六）

如何做到資料分析報告（五）

如何做好資料分析報告（四）

如何做好資料分析報告（二）

如何做好資料分析報告（一）

怎麼撰寫一份優秀的資料分析報告（一）

怎麼撰寫一份優秀的資料分析報告（五）

四個步驟教你寫好一款產品的運營資料分析報告（轉）

Python資料分析基礎（二）——NumPy基礎

Python資料分析入門（一）——初探資料視覺化

Python資料分析基礎（八）——時間序列

Python 3.x--資料分析: numpy（一）

python資料分析之（3）pandas

數據分析——作圖（Python）

軟工實踐第七次作業- 需求分析報告（第五組）

【LeetCode】730. Count Different Palindromic Subsequences 解題報告（Python）

【LeetCode】387. First Unique Character in a String 解題報告（Python）

Titanic資料分析報告（python）

相關推薦