1. 程式人生 > >Titanic資料分析報告(python)

Titanic資料分析報告(python)

# Titanic資料分析報告 ## 1.1 資料載入與描述性統計 載入所需資料與所需的python庫。
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import patsy
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from scipy import stats
import seaborn as sns
train = pd.read_csv("D:/學習/資料探勘與機器學習/Titanic/train.csv"
)
資料集中共有12個欄位,PassengerId:乘客編號,Survived:乘客是否存活,Pclass:乘客所在的船艙等級;Name:乘客姓名,Sex:乘客性別,Age:乘客年齡,SibSp:乘客的兄弟姐妹和配偶數量,Parch:乘客的父母與子女數量,Ticket:票的編號,Fare:票價,Cabin:座位號,Embarked:乘客登船碼頭。 共有891位乘客的資料資訊。其中277位乘客的年齡資料缺失,2位乘客的登船碼頭資料缺失,687位乘客的船艙資料缺失。
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
train.info()
train.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
## 1.2單變數探索 ### 1.2.1 年齡與費用 畫出訓練集中乘客年齡和費用的分佈直方圖,如下所示。可以發現,大部分乘客的年齡位於20-40歲之間,總體上呈正態分佈。大部分乘客的票價很低,位於0-100之間,其他少部分乘客的票價較高。
fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
train["Age"].hist(ax=ax[0])
ax[0].set_title("Hist plot of Age")
train["Fare"].hist(ax=ax[1])
ax[1].set_title("Hist plot of Fare")
### 1.2.2 乘客是否獲救 畫出乘客獲救與沒有獲救的條形圖,如下所示。可以發現,大部分乘客沒有獲救。
fig,ax = plt.subplots(figsize=(7,5))
train["Survived"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Not Survived","Survived"),  rotation= "horizontal" )
ax.set_title("Bar plot of Survived ")
### 1.2.3 性別 畫出乘客性別條形分佈圖,如下所示。可以發現,大部分乘客為男性。
fig,ax = plt.subplots(figsize=(7,5))
train["Sex"].value_counts().plot(kind="bar")
ax.set_xticklabels(("male","female"),rotation= "horizontal"  )
ax.set_title("Bar plot of Sex ")
### 1.2.4 乘客所在的船艙等級 畫出乘客的Pclass條形分佈圖,如下所示。可以發現,大部分乘客位於第三等級,第一等級和第二等級的乘客各有200個左右。
fig,ax = plt.subplots(figsize=(7,5))
train["Pclass"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Class3","Class1","Class2"),rotation= "horizontal"  )
ax.set_title("Bar plot of Pclass ")
### 1.2.5 乘客座位號 對乘客座位號資料進行處理,將缺失值賦值為Unknown。從乘客座位號資料可以發現,第一個字母可能代表了船艙號碼,將該字元提取出來,賦值給Cabin,視為船艙號。
train.Cabin.fillna("Unknown",inplace=True)
for i in range(0, 891):
    train.Cabin[i]= train.Cabin[i][0]
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance() 畫出乘客的船艙號的條形分佈圖,如下所示。可以發現,大部分乘客的船艙號為未知。
fig,ax = plt.subplots(figsize=(7,5))
train.Cabin.value_counts().plot(kind="bar")
ax.set_title("Bar plot of Cabin ")
### 1.2.6 兄弟姐妹與配偶數目 畫出乘客兄弟姐妹與配偶數目的條形分佈圖,如下所示。可以發現,大部分乘客在船上沒有兄弟姐妹或配偶,大約200位乘客在船上有1個兄弟姐妹或配偶。
fig,ax = plt.subplots(figsize=(7,5))
train["SibSp"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of SibSp ")
### 1.2.7 父母與子女數目 畫出乘客父母與子女數目的條形分佈圖,如下所示。可以發現,大部分乘客在船上沒有父母或子女,100多位乘客在船上有1個兄弟姐妹或配偶,大約90位乘客在船上有2個兄弟姐妹或配偶。
fig,ax = plt.subplots(figsize=(7,5))
train["Parch"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of Parch ")
### 1.2.8 乘客出發港口 畫出乘客出發港口的分佈條形圖,如下所示。可以發現,大部分乘客從Southampton港口出發,不到200位乘客從Cherburge出發,不到100位乘客從Queentown出發。
fig,ax = plt.subplots(figsize=(7,5))
train["Embarked"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Southampton","Cherbourg","Queenstown"),rotation= "horizontal"  )
ax.set_title("Bar plot of Embarked ")
## 1.3 多變數探索 ### 1.3.1 性別與是否獲救 畫出性別與是否獲救的交叉表和條形圖,如下所示。可以發現,女性獲救的可能性更高,而男性獲救的比例很低。
pd.crosstab(train["Sex"],train["Survived"])
Survived 0 1
Sex
female 81 233
male 468 109
pd.crosstab(train["Sex"],train["Survived"]).plot(kind="bar")
### 1.3.2 船艙等級與是否獲救 畫出船艙等級與是否獲救的交叉表與條形圖,如下所示。可以發現,第一等級的乘客獲救的可能性更高,超過50%,第二等級的乘客獲救可能性在50%左右,而第三等級的乘客獲救可能性很低。
pd.crosstab(train["Pclass"],train["Survived"])
Survived 0 1
Pclass
1 80 136
2 97 87
3 372 119
pd.crosstab(train["Pclass"],train["Survived"]).plot(kind="bar")
### 1.3.3 兄弟姐妹或配偶數量與是否獲救 畫出兄弟姐妹與配偶數目與是否獲救的交叉表與條形圖,如下所示。可以發現,有數量為1或2的乘客獲救的可能性更高。
pd.crosstab(train["SibSp"],train["Survived"])
Survived 0 1
SibSp
0 398 210
1 97 112
2 15 13
3 12 4
4 15 3
5 5 0
8 7 0
pd.crosstab(train["SibSp"],train["Survived"]).plot(kind="bar")
### 1.3.4 父母或子女數目和是否獲救 畫出父母或子女數目與是否獲救的交叉表與條形圖,如下所示。可以發現,有數量為1或2的乘客獲救的可能性更高。
pd.crosstab(train["Parch"],train["Survived"])
Survived 0 1
Parch
0 445 233
1 53 65
2 40 40
3 2 3
4 4 0
5 4 1
6 1 0
pd.crosstab(train["Parch"],train["Survived"]).plot(kind="bar")
### 1.3.5 登船港口與是否獲救 畫出登船港口與是否獲救的交叉表與條形圖,如下所示。可以發現,從Cherburge出發的乘客獲救的人數比例更高。
pd.crosstab(train["Embarked"],train["Survived"])
Survived 0 1
Embarked
C 75 93
Q 47 30
S 427 217
pd.crosstab(train["Embarked"],train["Survived"]).plot(kind="bar")
### 1.3.6 乘客船艙與是否獲救 畫出乘客所在船艙與是否獲救的交叉表與條形圖,如下所示。可以發現,船艙後沒有缺失的乘客獲救的人數比例更高。
pd.crosstab(train["Cabin"],train["Survived"])
Survived 0 1
Cabin
A 8 7
B 12 35
C 24 35
D 8 25
E 8 24
F 5 8
G 2 2
T 1 0
U 481 206
pd.crosstab(train["Cabin"],train["Survived"]).plot(kind="bar")
### 1.3.7 乘客年齡與是否獲救 畫出乘客是否獲救與年齡的箱線圖,如下所示。從箱線圖上來看,兩者關係並不明顯。
fig,ay = plt.subplots()
Age1 = train.Age[train.Survived == 1].dropna()
Age0 = train.Age[train.Survived == 0].dropna()
plt.boxplot((Age1,Age0),labels=('Survived','Not Survived'))
ay.set_ylim([-5,70])
ay.set_title("Boxplot of Age")
### 1.3.8 票價與是否獲救 畫出乘客是否獲救與票價的箱線圖,如下所示。可以發現,總體而言,獲救的乘客票價更高。
fig,ay = plt.subplots()
Fare1 = train.Fare[train.Survived == 1]
Fare0 = train.Fare[train.Survived == 0]
plt.boxplot((Fare1,Fare0),labels=('Survived','Not Survived'))
ay.set_ylim([-10,150])
ay.set_title("Boxplot of Fare")
### 1.3.9 票價與乘客艙位等級 畫出乘客票價與艙位等級的箱線圖,如下所示。可以明顯的發現,艙位等級越高的乘客,票價越高。這兩個變數之間存在非常明顯的線性相關關係。
fig,ay = plt.subplots()
Farec1 = train.Fare[train.Pclass == 1]
Farec2 = train.Fare[train.Pclass == 2]
Farec3 = train.Fare[train.Pclass == 3]
plt.boxplot((Farec1,Farec2,Farec3),labels=("Pclass1","Pclass2","Pclass3"))
ay.set_ylim([-10,180])
ay.set_title("Boxplot of Fare and Pclass")
## 1.4 資料處理 ### 1.4.1 缺失值處理 用年齡的均值填充年齡的缺失值,用出發港口的眾數填補出發港口的缺失值。
train.Age.mean()
train.Age.fillna(29.7,inplace=True)
train.Embarked.fillna("S",inplace=True)
### 1.4.2 資料分箱 根據以上分析結果和變數間的關係,將年齡資料分段為0-5歲、5-15歲、15-20歲、20-35歲、35-50歲、50-60歲、60-100歲7段。將Parch變數分成數目為0、數目為1或2、數目為大於2三段。將SibSp變數分成數目為0、數目為1或2、數目為大於2三段。將Cabin變數分為缺失和沒有缺失兩段。
train.age=pd.cut(train.Age,[0,5,15,20,35,50,60,100])
pd.crosstab(train.age,train.Survived).plot(kind="bar")
train.Parch[(train.Parch>0) & (train.Parch<=2)]=1
train.Parch[train.Parch>2]=2
train.SibSp[(train.SibSp>0) & (train.SibSp<=2)]=1
train.SibSp[train.SibSp>2]=2
train.Cabin[train.Cabin!="U"]="K"
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == ‘__main__’: D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance() D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy ### 1.4.3 建立虛擬變數 為Pclass、Sex、Embarked、Parch、SibSp、Cabin變數建立虛擬變數
dummy_Pclass = pd.get_dummies(train.Pclass, prefix='Pclass')
dummy_Sex = pd.get_dummies(train.Sex, prefix='Sex')
dummy_Embarked = pd.get_dummies(train.Embarked, prefix='Embarked')
dummy_Parch = pd.get_dummies(train.Parch, prefix='Parch')
dummy_SibSp = pd.get_dummies(train.SibSp, prefix='SibSp')
dummy_Age = pd.get_dummies(train.age, prefix='Age')
dummy_Cabin = pd.get_dummies(train.Cabin, prefix='Cabin')
## 1.5 模型建立 ### 1.5.1 建立訓練集
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve,roc_auc_score,classification_report 
劃分訓練集,將編號為0-623的乘客作為訓練集。去除PassengerId和Name變數,新增常數項intercept. 因變數為乘客是否獲救,自變數為乘客的票價、性別、登船碼頭、父母與子女數目、兄弟姐妹與配偶數目、年齡、船艙。除票價外,都為虛擬變數。考慮到Fare和Pclass之間的線性相關性,剔除Pclass變數。
train_y = train[:623]["Survived"]
cols_to_keep = ["Fare"]
train_x = train[:623][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
train_x['intercept'] = 1.0
train_x.tail()
Fare Sex_male Embarked_Q Embarked_S Parch_1 Parch_2 SibSp_1 SibSp_2 Age_(5, 15] Age_(15, 20] Age_(20, 35] Age_(35, 50] Age_(50, 60] Age_(60, 100] Cabin_U intercept
618 39.0000 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1
619 10.5000 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1
620 14.4542 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1
621 52.5542 1 0 1 0 0 1 0 0 0 0 1 0 0 0 1
622 15.7417 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1
### 1.5.2 模型構建 對訓練集構建邏輯斯蒂模型。
clf = LogisticRegression()
clf.fit(train_x,train_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1, penalty=’l2’, random_state=None, solver=’liblinear’, tol=0.0001, verbose=0, warm_start=False) ### 1.5.3 模型檢驗 劃分測試集,將編號為624-890的乘客作為測試集。
test_y = train[623:]["Survived"]
cols_to_keep = ["Fare"]
test_x = train[623:][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
test_x['intercept'] = 1.0
test_x.head()
Fare Sex_male Embarked_Q Embarked_S Parch_1 Parch_2 SibSp_1 SibSp_2 Age_(5, 15] Age_(15, 20] Age_(20, 35] Age_(35, 50] Age_(50, 60] Age_(60, 100] Cabin_U intercept
623 7.8542 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1
624 16.1000 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1
625 32.3208 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1
626 12.3500 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1
627 77.9583 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1

利用測試集對模型進行測試

clf.predict(test_x)
array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0], dtype=int64)
clf.predict_proba(test_x)
array([[ 0.86039834,  0.13960166],
       [ 0.85711962,  0.14288038],
       [ 0.74830885,  0.25169115],
       [ 0.84502153,  0.15497847],
       [ 0.12886629,  0.87113371],
       [ 0.86038196,  0.13961804],
       [ 0.81492416,  0.18507584],
       [ 0.74973913,  0.25026087],
       [ 0.88595841,  0.11404159],
       [ 0.60609598,  0.39390402],
       [ 0.86346251,  0.13653749],
       [ 0.61197924,  0.38802076],
       [ 0.35569986,  0.64430014],
       [ 0.86037046,  0.13962954],
       [ 0.79463014,  0.20536986],
       [ 0.57570079,  0.42429921],
       [ 0.83467975,  0.16532025],
       [ 0.85798051,  0.14201949],
       [ 0.10986907,  0.89013093],
       [ 0.56445448,  0.43554552],
       [ 0.84012224,  0.15987776],
       [ 0.15820833,  0.84179167],
       [ 0.56790934,  0.43209066],
       [ 0.85796389,  0.14203611],
       [ 0.65552637,  0.34447363],
       [ 0.86051808,  0.13948192],
       [ 0.35980506,  0.64019494],
       [ 0.86038196,  0.13961804],
       [ 0.29324608,  0.70675392],
       [ 0.86017015,  0.13982985],
       [ 0.28622388,  0.71377612],
       [ 0.28287555,  0.71712445],
       [ 0.8070549 ,  0.1929451 ],
       [ 0.86038196,  0.13961804],
       [ 0.20682467,  0.79317533],
       [ 0.85835972,  0.14164028],
       [ 0.53882734,  0.46117266],
       [ 0.80220769,  0.19779231],
       [ 0.85539757,  0.14460243],
       [ 0.69483827,  0.30516173],
       [ 0.87933359,  0.12066641],
       [ 0.83561804,  0.16438196],
       [ 0.8070549 ,  0.1929451 ],
       [ 0.85835972,  0.14164028],
       [ 0.86042952,  0.13957048],
       [ 0.87914067,  0.12085933],
       [ 0.11937883,  0.88062117],
       [ 0.2853273 ,  0.7146727 ],
       [ 0.59808311,  0.40191689],
       [ 0.90594814,  0.09405186],
       [ 0.85835972,  0.14164028],
       [ 0.86346251,  0.13653749],
       [ 0.85801214,  0.14198786],
       [ 0.86032122,  0.13967878],
       [ 0.35349566,  0.64650434],
       [ 0.52724747,  0.47275253],
       [ 0.22879217,  0.77120783],
       [ 0.28601744,  0.71398256],
       [ 0.56939306,  0.43060694],
       [ 0.85743204,  0.14256796],
       [ 0.94208727,  0.05791273],
       [ 0.82348607,  0.17651393],
       [ 0.74901495,  0.25098505],
       [ 0.94336392,  0.05663608],
       [ 0.85705258,  0.14294742],
       [ 0.85800384,  0.14199616],
       [ 0.05587819,  0.94412181],
       [ 0.5941366 ,  0.4058634 ],
       [ 0.18541791,  0.81458209],
       [ 0.84012224,  0.15987776],
       [ 0.83358085,  0.16641915],
       [ 0.87933975,  0.12066025],
       [ 0.88380587,  0.11619413],
       [ 0.87914067,  0.12085933],
       [ 0.28628812,  0.71371188],
       [ 0.48214083,  0.51785917],
       [ 0.70716252,  0.29283748],
       [ 0.05715212,  0.94284788],
       [ 0.65795297,  0.34204703],
       [ 0.25709963,  0.74290037],
       [ 0.81492001,  0.18507999],
       [ 0.83837631,  0.16162369],
       [ 0.8727473 ,  0.1272527 ],
       [ 0.3942793 ,  0.6057207 ],
       [ 0.69435145,  0.30564855],
       [ 0.25955289,  0.74044711],
       [ 0.76489311,  0.23510689],
       [ 0.11637845,  0.88362155],
       [ 0.65775927,  0.34224073],
       [ 0.63734093,  0.36265907],
       [ 0.85975561,  0.14024439],
       [ 0.8839741 ,  0.1160259 ],
       [ 0.66714496,  0.33285504],
       [ 0.07984609,  0.92015391],
       [ 0.15579323,  0.84420677],
       [ 0.81105308,  0.18894692],
       [ 0.86042952,  0.13957048],
       [ 0.24260674,  0.75739326],
       [ 0.8360098 ,  0.1639902 ],
       [ 0.85835972,  0.14164028],
       [ 0.87740579,  0.12259421],
       [ 0.59721595,  0.40278405],
       [ 0.85765731,  0.14234269],
       [ 0.72253208,  0.27746792],
       [ 0.2862853 ,  0.7137147 ],
       [ 0.83015243,  0.16984757],
       [ 0.3208549 ,  0.6791451 ],
       [ 0.08720221,  0.91279779],
       [ 0.79040078,  0.20959922],
       [ 0.86346251,  0.13653749],
       [ 0.85835972,  0.14164028],
       [ 0.85835972,  0.14164028],
       [ 0.85711962,  0.14288038],
       [ 0.53746947,  0.46253053],
       [ 0.24073084,  0.75926916],
       [ 0.86038196,  0.13961804],
       [ 0.86038196,  0.13961804],
       [ 0.65520865,  0.34479135],
       [ 0.61675979,  0.38324021],
       [ 0.04187545,  0.95812455],
       [ 0.83467975,  0.16532025],
       [ 0.86037046,  0.13962954],
       [ 0.63589211,  0.36410789],
       [ 0.7945787 ,  0.2054213 ],
       [ 0.35569986,  0.64430014],
       [ 0.5923993 ,  0.4076007 ],
       [ 0.8149159 ,  0.1850841 ],
       [ 0.18626833,  0.81373167],
       [ 0.55494501,  0.44505499],
       [ 0.85974901,  0.14025099],
       [ 0.86038196,  0.13961804],
       [ 0.26826856,  0.73173144],
       [ 0.72096103,  0.27903897],
       [ 0.86042134,  0.13957866],
       [ 0.85651789,  0.14348211],
       [ 0.86032122,  0.13967878],
       [ 0.12575524,  0.87424476],
       [ 0.8577608 ,  0.1422392 ],
       [ 0.87946251,  0.12053749],
       [ 0.83078796,  0.16921204],
       [ 0.09214503,  0.90785497],
       [ 0.85801214,  0.14198786],
       [ 0.1353402 ,  0.8646598 ],
       [ 0.81833156,  0.18166844],
       [ 0.28627693,  0.71372307],
       [ 0.77835462,  0.22164538],
       [ 0.86019807,  0.13980193],
       [ 0.85974901,  0.14025099],
       [ 0.87920886,  0.12079114],
       [ 0.18831656,  0.81168344],
       [ 0.83358085,  0.16641915],
       [ 0.56217041,  0.43782959],
       [ 0.85802213,  0.14197787],
       [ 0.59346021,  0.40653979],
       [ 0.26217247,  0.73782753],
       [ 0.81492209,  0.18507791],
       [ 0.0820547 ,  0.9179453 ],
       [ 0.26297266,  0.73702734],
       [ 0.11560723,  0.88439277],
       [ 0.65520865,  0.34479135],
       [ 0.7961241 ,  0.2038759 ],
       [ 0.86071471,  0.13928529],
       [ 0.86063609,  0.13936391],
       [ 0.35525524,  0.64474476],
       [ 0.92489299,  0.07510701],
       [ 0.71693682,  0.28306318],
       [ 0.6076947 ,  0.3923053 ],
       [ 0.8149159 ,  0.1850841 ],
       [ 0.85057636,  0.14942364],
       [ 0.6376244 ,  0.3623756 ],
       [ 0.822631  ,  0.177369  ],
       [ 0.86038196,  0.13961804],
       [ 0.87740579,  0.12259421],
       [ 0.17163368,  0.82836632],
       [ 0.35894969,  0.64105031],
       [ 0.83357894,  0.16642106],
       [ 0.26194935,  0.73805065],
       [ 0.85835972,  0.14164028],
       [ 0.26062053,  0.73937947],
       [ 0.42452126,  0.57547874],
       [ 0.71744256,  0.28255744],
       [ 0.86074419,  0.13925581],
       [ 0.86042952,  0.13957048],
       [ 0.71232894,  0.28767106],
       [ 0.35504562,  0.64495438],
       [ 0.87740579,  0.12259421],
       [ 0.11900024,  0.88099976],
       [ 0.86038523,  0.13961477],
       [ 0.87341934,  0.12658066],
       [ 0.85935323,  0.14064677],
       [ 0.60934863,  0.39065137],
       [ 0.86032122,  0.13967878],
       [ 0.67707567,  0.32292433],
       [ 0.35952192,  0.64047808],
       [ 0.751824  ,  0.248176  ],
       [ 0.8796969 ,  0.1203031 ],
       [ 0.94539356,  0.05460644],
       [ 0.10542774,  0.89457226],
       [ 0.86007975,  0.13992025],
       [ 0.88191682,  0.11808318],
       [ 0.12684285,  0.87315715],
       [ 0.93191133,  0.06808867],
       [ 0.81531115,  0.18468885],
       [ 0.84012224,  0.15987776],
       [ 0.69813211,  0.30186789],
       [ 0.8149159 ,  0.1850841 ],
       [ 0.18808428,  0.81191572],
       [ 0.22676529,  0.77323471],
       [ 0.71814943,  0.28185057],
       [ 0.83357894,  0.16642106],
       [ 0.86039834,  0.13960166],
       [ 0.85780233,  0.14219767],
       [ 0.0849913 ,  0.9150087 ],
       [ 0.86007975,  0.13992025],
       [ 0.86032122,  0.13967878],
       [ 0.84012224,  0.15987776],
       [ 0.60672195,  0.39327805],
       [ 0.85795222,  0.14204778],
       [ 0.85692031,  0.14307969],
       [ 0.29681064,  0.70318936],
       [ 0.83393869,  0.16606131],
       [ 0.85765731,  0.14234269],
       [ 0.87931473,  0.12068527],
       [ 0.95077511,  0.04922489],
       [ 0.83327556,  0.16672444],
       [ 0.8180723 ,  0.1819277 ],
       [ 0.08871631,  0.91128369],
       [ 0.93364059,  0.06635941],
       [ 0.90670657,  0.09329343],
       [ 0.18814843,  0.81185157],
       [ 0.11532953,  0.88467047],
       [ 0.3446263 ,  0.6553737 ],
       [ 0.30260559,  0.69739441],
       [ 0.20902316,  0.79097684],
       [ 0.70727922,  0.29272078],
       [ 0.49908055,  0.50091945],
       [ 0.83357894,  0.16642106],
       [ 0.85717847,  0.14282153],
       [ 0.83675021,  0.16324979],
       [ 0.17163368,  0.82836632],
       [ 0.6376244 ,  0.3623756 ],
       [ 0.85835972,  0.14164028],
       [ 0.39467085,  0.60532915],
       [ 0.2731416 ,  0.7268584 ],
       [ 0.63987502,  0.36012498],
       [ 0.85974901,  0.14025099],
       [ 0.72317603,  0.27682397],
       [ 0.86038196,  0.13961804],
       [ 0.11238481,  0.88761519],
       [ 0.67348135,  0.32651865],
       [ 0.87880937,  0.12119063],
       [ 0.2665907 ,  0.7334093 ],
       [ 0.26297533,  0.73702467],
       [ 0.85718307,  0.14281693],
       [ 0.85796389,  0.14203611],
       [ 0.86038196,  0.13961804],
       [ 0.10513251,  0.89486749],
       [ 0.29535416,  0.70464584],
       [ 0.86038196,  0.13961804],
       [ 0.35756781,  0.64243219],
       [ 0.85935323,  0.14064677],
       [ 0.86071471,  0.13928529],
       [ 0.50077687,  0.49922313],
       [ 0.85835972,  0.14164028],
       [ 0.14507275,  0.85492725],
       [ 0.26239326,  0.73760674],
       [ 0.60648726,  0.39351274],
       [ 0.8149159 ,  0.1850841 ]])
preds = clf.predict(test_x)

計算模型的混淆矩陣如下所示。

confusion_matrix(test_y,preds)
array([[157,  15],
       [ 35,  61]])

計算模型的ROC/AUC得分,並畫出ROC曲線。模型的ROC/AUC得分為0.88,表明預測準確的概率為88%左右。模型預測結果較好。

pre = clf.predict_proba(test_x)
roc_auc_score(test_y,pre[:,1])
0.88114704457364346
fpr,tpr,thresholds = roc_curve(test_y,pre[:,1])
fig,ax = plt.subplots(figsize=(8,5))
plt.plot(fpr,tpr)
ax.set_title("Roc of Logistic Regression")
<matplotlib.text.Text at 0x7674a1f588>

png

模型預測結果分類報告如下所示。

print(classification_report(test_y,preds))
             precision    recall  f1-score   support

          0       0.82      0.91      0.86       172
          1       0.80      0.64      0.71        96

avg / total       0.81      0.81      0.81       268

總體而言,模型的擬合結果較好。