Titanic資料分析報告(python)
阿新 • • 發佈:2019-01-10
# Titanic資料分析報告
## 1.1 資料載入與描述性統計
載入所需資料與所需的python庫。
## 1.2單變數探索
### 1.2.1 年齡與費用
畫出訓練集中乘客年齡和費用的分佈直方圖,如下所示。可以發現,大部分乘客的年齡位於20-40歲之間,總體上呈正態分佈。大部分乘客的票價很低,位於0-100之間,其他少部分乘客的票價較高。
### 1.5.2 模型構建
對訓練集構建邏輯斯蒂模型。
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import patsy
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from scipy import stats
import seaborn as sns
train = pd.read_csv("D:/學習/資料探勘與機器學習/Titanic/train.csv" )
資料集中共有12個欄位,PassengerId:乘客編號,Survived:乘客是否存活,Pclass:乘客所在的船艙等級;Name:乘客姓名,Sex:乘客性別,Age:乘客年齡,SibSp:乘客的兄弟姐妹和配偶數量,Parch:乘客的父母與子女數量,Ticket:票的編號,Fare:票價,Cabin:座位號,Embarked:乘客登船碼頭。
共有891位乘客的資料資訊。其中277位乘客的年齡資料缺失,2位乘客的登船碼頭資料缺失,687位乘客的船艙資料缺失。
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.info()
train.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
train["Age"].hist(ax=ax[0])
ax[0].set_title("Hist plot of Age")
train["Fare"].hist(ax=ax[1])
ax[1].set_title("Hist plot of Fare")
### 1.2.2 乘客是否獲救
畫出乘客獲救與沒有獲救的條形圖,如下所示。可以發現,大部分乘客沒有獲救。
fig,ax = plt.subplots(figsize=(7,5))
train["Survived"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Not Survived","Survived"), rotation= "horizontal" )
ax.set_title("Bar plot of Survived ")
### 1.2.3 性別
畫出乘客性別條形分佈圖,如下所示。可以發現,大部分乘客為男性。
fig,ax = plt.subplots(figsize=(7,5))
train["Sex"].value_counts().plot(kind="bar")
ax.set_xticklabels(("male","female"),rotation= "horizontal" )
ax.set_title("Bar plot of Sex ")
### 1.2.4 乘客所在的船艙等級
畫出乘客的Pclass條形分佈圖,如下所示。可以發現,大部分乘客位於第三等級,第一等級和第二等級的乘客各有200個左右。
fig,ax = plt.subplots(figsize=(7,5))
train["Pclass"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Class3","Class1","Class2"),rotation= "horizontal" )
ax.set_title("Bar plot of Pclass ")
### 1.2.5 乘客座位號
對乘客座位號資料進行處理,將缺失值賦值為Unknown。從乘客座位號資料可以發現,第一個字母可能代表了船艙號碼,將該字元提取出來,賦值給Cabin,視為船艙號。
train.Cabin.fillna("Unknown",inplace=True)
for i in range(0, 891):
train.Cabin[i]= train.Cabin[i][0]
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
app.launch_new_instance()
畫出乘客的船艙號的條形分佈圖,如下所示。可以發現,大部分乘客的船艙號為未知。
fig,ax = plt.subplots(figsize=(7,5))
train.Cabin.value_counts().plot(kind="bar")
ax.set_title("Bar plot of Cabin ")
### 1.2.6 兄弟姐妹與配偶數目
畫出乘客兄弟姐妹與配偶數目的條形分佈圖,如下所示。可以發現,大部分乘客在船上沒有兄弟姐妹或配偶,大約200位乘客在船上有1個兄弟姐妹或配偶。
fig,ax = plt.subplots(figsize=(7,5))
train["SibSp"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of SibSp ")
### 1.2.7 父母與子女數目
畫出乘客父母與子女數目的條形分佈圖,如下所示。可以發現,大部分乘客在船上沒有父母或子女,100多位乘客在船上有1個兄弟姐妹或配偶,大約90位乘客在船上有2個兄弟姐妹或配偶。
fig,ax = plt.subplots(figsize=(7,5))
train["Parch"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of Parch ")
### 1.2.8 乘客出發港口
畫出乘客出發港口的分佈條形圖,如下所示。可以發現,大部分乘客從Southampton港口出發,不到200位乘客從Cherburge出發,不到100位乘客從Queentown出發。
fig,ax = plt.subplots(figsize=(7,5))
train["Embarked"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Southampton","Cherbourg","Queenstown"),rotation= "horizontal" )
ax.set_title("Bar plot of Embarked ")
## 1.3 多變數探索
### 1.3.1 性別與是否獲救
畫出性別與是否獲救的交叉表和條形圖,如下所示。可以發現,女性獲救的可能性更高,而男性獲救的比例很低。
pd.crosstab(train["Sex"],train["Survived"])
Survived | 0 | 1 |
---|---|---|
Sex | ||
female | 81 | 233 |
male | 468 | 109 |
pd.crosstab(train["Sex"],train["Survived"]).plot(kind="bar")
### 1.3.2 船艙等級與是否獲救
畫出船艙等級與是否獲救的交叉表與條形圖,如下所示。可以發現,第一等級的乘客獲救的可能性更高,超過50%,第二等級的乘客獲救可能性在50%左右,而第三等級的乘客獲救可能性很低。
pd.crosstab(train["Pclass"],train["Survived"])
Survived | 0 | 1 |
---|---|---|
Pclass | ||
1 | 80 | 136 |
2 | 97 | 87 |
3 | 372 | 119 |
pd.crosstab(train["Pclass"],train["Survived"]).plot(kind="bar")
### 1.3.3 兄弟姐妹或配偶數量與是否獲救
畫出兄弟姐妹與配偶數目與是否獲救的交叉表與條形圖,如下所示。可以發現,有數量為1或2的乘客獲救的可能性更高。
pd.crosstab(train["SibSp"],train["Survived"])
Survived | 0 | 1 |
---|---|---|
SibSp | ||
0 | 398 | 210 |
1 | 97 | 112 |
2 | 15 | 13 |
3 | 12 | 4 |
4 | 15 | 3 |
5 | 5 | 0 |
8 | 7 | 0 |
pd.crosstab(train["SibSp"],train["Survived"]).plot(kind="bar")
### 1.3.4 父母或子女數目和是否獲救
畫出父母或子女數目與是否獲救的交叉表與條形圖,如下所示。可以發現,有數量為1或2的乘客獲救的可能性更高。
pd.crosstab(train["Parch"],train["Survived"])
Survived | 0 | 1 |
---|---|---|
Parch | ||
0 | 445 | 233 |
1 | 53 | 65 |
2 | 40 | 40 |
3 | 2 | 3 |
4 | 4 | 0 |
5 | 4 | 1 |
6 | 1 | 0 |
pd.crosstab(train["Parch"],train["Survived"]).plot(kind="bar")
### 1.3.5 登船港口與是否獲救
畫出登船港口與是否獲救的交叉表與條形圖,如下所示。可以發現,從Cherburge出發的乘客獲救的人數比例更高。
pd.crosstab(train["Embarked"],train["Survived"])
Survived | 0 | 1 |
---|---|---|
Embarked | ||
C | 75 | 93 |
Q | 47 | 30 |
S | 427 | 217 |
pd.crosstab(train["Embarked"],train["Survived"]).plot(kind="bar")
### 1.3.6 乘客船艙與是否獲救
畫出乘客所在船艙與是否獲救的交叉表與條形圖,如下所示。可以發現,船艙後沒有缺失的乘客獲救的人數比例更高。
pd.crosstab(train["Cabin"],train["Survived"])
Survived | 0 | 1 |
---|---|---|
Cabin | ||
A | 8 | 7 |
B | 12 | 35 |
C | 24 | 35 |
D | 8 | 25 |
E | 8 | 24 |
F | 5 | 8 |
G | 2 | 2 |
T | 1 | 0 |
U | 481 | 206 |
pd.crosstab(train["Cabin"],train["Survived"]).plot(kind="bar")
### 1.3.7 乘客年齡與是否獲救
畫出乘客是否獲救與年齡的箱線圖,如下所示。從箱線圖上來看,兩者關係並不明顯。
fig,ay = plt.subplots()
Age1 = train.Age[train.Survived == 1].dropna()
Age0 = train.Age[train.Survived == 0].dropna()
plt.boxplot((Age1,Age0),labels=('Survived','Not Survived'))
ay.set_ylim([-5,70])
ay.set_title("Boxplot of Age")
### 1.3.8 票價與是否獲救
畫出乘客是否獲救與票價的箱線圖,如下所示。可以發現,總體而言,獲救的乘客票價更高。
fig,ay = plt.subplots()
Fare1 = train.Fare[train.Survived == 1]
Fare0 = train.Fare[train.Survived == 0]
plt.boxplot((Fare1,Fare0),labels=('Survived','Not Survived'))
ay.set_ylim([-10,150])
ay.set_title("Boxplot of Fare")
### 1.3.9 票價與乘客艙位等級
畫出乘客票價與艙位等級的箱線圖,如下所示。可以明顯的發現,艙位等級越高的乘客,票價越高。這兩個變數之間存在非常明顯的線性相關關係。
fig,ay = plt.subplots()
Farec1 = train.Fare[train.Pclass == 1]
Farec2 = train.Fare[train.Pclass == 2]
Farec3 = train.Fare[train.Pclass == 3]
plt.boxplot((Farec1,Farec2,Farec3),labels=("Pclass1","Pclass2","Pclass3"))
ay.set_ylim([-10,180])
ay.set_title("Boxplot of Fare and Pclass")
## 1.4 資料處理
### 1.4.1 缺失值處理
用年齡的均值填充年齡的缺失值,用出發港口的眾數填補出發港口的缺失值。
train.Age.mean()
train.Age.fillna(29.7,inplace=True)
train.Embarked.fillna("S",inplace=True)
### 1.4.2 資料分箱
根據以上分析結果和變數間的關係,將年齡資料分段為0-5歲、5-15歲、15-20歲、20-35歲、35-50歲、50-60歲、60-100歲7段。將Parch變數分成數目為0、數目為1或2、數目為大於2三段。將SibSp變數分成數目為0、數目為1或2、數目為大於2三段。將Cabin變數分為缺失和沒有缺失兩段。
train.age=pd.cut(train.Age,[0,5,15,20,35,50,60,100])
pd.crosstab(train.age,train.Survived).plot(kind="bar")
train.Parch[(train.Parch>0) & (train.Parch<=2)]=1
train.Parch[train.Parch>2]=2
train.SibSp[(train.SibSp>0) & (train.SibSp<=2)]=1
train.SibSp[train.SibSp>2]=2
train.Cabin[train.Cabin!="U"]="K"
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == ‘__main__’:
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
from ipykernel import kernelapp as app
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
app.launch_new_instance()
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
D:\software\新建資料夾 (4)\lib\site-packages\ipykernel\__main__.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
### 1.4.3 建立虛擬變數
為Pclass、Sex、Embarked、Parch、SibSp、Cabin變數建立虛擬變數
dummy_Pclass = pd.get_dummies(train.Pclass, prefix='Pclass')
dummy_Sex = pd.get_dummies(train.Sex, prefix='Sex')
dummy_Embarked = pd.get_dummies(train.Embarked, prefix='Embarked')
dummy_Parch = pd.get_dummies(train.Parch, prefix='Parch')
dummy_SibSp = pd.get_dummies(train.SibSp, prefix='SibSp')
dummy_Age = pd.get_dummies(train.age, prefix='Age')
dummy_Cabin = pd.get_dummies(train.Cabin, prefix='Cabin')
## 1.5 模型建立
### 1.5.1 建立訓練集
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve,roc_auc_score,classification_report
劃分訓練集,將編號為0-623的乘客作為訓練集。去除PassengerId和Name變數,新增常數項intercept.
因變數為乘客是否獲救,自變數為乘客的票價、性別、登船碼頭、父母與子女數目、兄弟姐妹與配偶數目、年齡、船艙。除票價外,都為虛擬變數。考慮到Fare和Pclass之間的線性相關性,剔除Pclass變數。
train_y = train[:623]["Survived"]
cols_to_keep = ["Fare"]
train_x = train[:623][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
train_x['intercept'] = 1.0
train_x.tail()
Fare | Sex_male | Embarked_Q | Embarked_S | Parch_1 | Parch_2 | SibSp_1 | SibSp_2 | Age_(5, 15] | Age_(15, 20] | Age_(20, 35] | Age_(35, 50] | Age_(50, 60] | Age_(60, 100] | Cabin_U | intercept | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
618 | 39.0000 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
619 | 10.5000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
620 | 14.4542 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
621 | 52.5542 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
622 | 15.7417 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
clf = LogisticRegression()
clf.fit(train_x,train_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1,
penalty=’l2’, random_state=None, solver=’liblinear’, tol=0.0001,
verbose=0, warm_start=False)
### 1.5.3 模型檢驗
劃分測試集,將編號為624-890的乘客作為測試集。
test_y = train[623:]["Survived"]
cols_to_keep = ["Fare"]
test_x = train[623:][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
test_x['intercept'] = 1.0
test_x.head()
Fare | Sex_male | Embarked_Q | Embarked_S | Parch_1 | Parch_2 | SibSp_1 | SibSp_2 | Age_(5, 15] | Age_(15, 20] | Age_(20, 35] | Age_(35, 50] | Age_(50, 60] | Age_(60, 100] | Cabin_U | intercept | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
623 | 7.8542 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
624 | 16.1000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
625 | 32.3208 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
626 | 12.3500 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
627 | 77.9583 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
利用測試集對模型進行測試
clf.predict(test_x)
array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0], dtype=int64)
clf.predict_proba(test_x)
array([[ 0.86039834, 0.13960166],
[ 0.85711962, 0.14288038],
[ 0.74830885, 0.25169115],
[ 0.84502153, 0.15497847],
[ 0.12886629, 0.87113371],
[ 0.86038196, 0.13961804],
[ 0.81492416, 0.18507584],
[ 0.74973913, 0.25026087],
[ 0.88595841, 0.11404159],
[ 0.60609598, 0.39390402],
[ 0.86346251, 0.13653749],
[ 0.61197924, 0.38802076],
[ 0.35569986, 0.64430014],
[ 0.86037046, 0.13962954],
[ 0.79463014, 0.20536986],
[ 0.57570079, 0.42429921],
[ 0.83467975, 0.16532025],
[ 0.85798051, 0.14201949],
[ 0.10986907, 0.89013093],
[ 0.56445448, 0.43554552],
[ 0.84012224, 0.15987776],
[ 0.15820833, 0.84179167],
[ 0.56790934, 0.43209066],
[ 0.85796389, 0.14203611],
[ 0.65552637, 0.34447363],
[ 0.86051808, 0.13948192],
[ 0.35980506, 0.64019494],
[ 0.86038196, 0.13961804],
[ 0.29324608, 0.70675392],
[ 0.86017015, 0.13982985],
[ 0.28622388, 0.71377612],
[ 0.28287555, 0.71712445],
[ 0.8070549 , 0.1929451 ],
[ 0.86038196, 0.13961804],
[ 0.20682467, 0.79317533],
[ 0.85835972, 0.14164028],
[ 0.53882734, 0.46117266],
[ 0.80220769, 0.19779231],
[ 0.85539757, 0.14460243],
[ 0.69483827, 0.30516173],
[ 0.87933359, 0.12066641],
[ 0.83561804, 0.16438196],
[ 0.8070549 , 0.1929451 ],
[ 0.85835972, 0.14164028],
[ 0.86042952, 0.13957048],
[ 0.87914067, 0.12085933],
[ 0.11937883, 0.88062117],
[ 0.2853273 , 0.7146727 ],
[ 0.59808311, 0.40191689],
[ 0.90594814, 0.09405186],
[ 0.85835972, 0.14164028],
[ 0.86346251, 0.13653749],
[ 0.85801214, 0.14198786],
[ 0.86032122, 0.13967878],
[ 0.35349566, 0.64650434],
[ 0.52724747, 0.47275253],
[ 0.22879217, 0.77120783],
[ 0.28601744, 0.71398256],
[ 0.56939306, 0.43060694],
[ 0.85743204, 0.14256796],
[ 0.94208727, 0.05791273],
[ 0.82348607, 0.17651393],
[ 0.74901495, 0.25098505],
[ 0.94336392, 0.05663608],
[ 0.85705258, 0.14294742],
[ 0.85800384, 0.14199616],
[ 0.05587819, 0.94412181],
[ 0.5941366 , 0.4058634 ],
[ 0.18541791, 0.81458209],
[ 0.84012224, 0.15987776],
[ 0.83358085, 0.16641915],
[ 0.87933975, 0.12066025],
[ 0.88380587, 0.11619413],
[ 0.87914067, 0.12085933],
[ 0.28628812, 0.71371188],
[ 0.48214083, 0.51785917],
[ 0.70716252, 0.29283748],
[ 0.05715212, 0.94284788],
[ 0.65795297, 0.34204703],
[ 0.25709963, 0.74290037],
[ 0.81492001, 0.18507999],
[ 0.83837631, 0.16162369],
[ 0.8727473 , 0.1272527 ],
[ 0.3942793 , 0.6057207 ],
[ 0.69435145, 0.30564855],
[ 0.25955289, 0.74044711],
[ 0.76489311, 0.23510689],
[ 0.11637845, 0.88362155],
[ 0.65775927, 0.34224073],
[ 0.63734093, 0.36265907],
[ 0.85975561, 0.14024439],
[ 0.8839741 , 0.1160259 ],
[ 0.66714496, 0.33285504],
[ 0.07984609, 0.92015391],
[ 0.15579323, 0.84420677],
[ 0.81105308, 0.18894692],
[ 0.86042952, 0.13957048],
[ 0.24260674, 0.75739326],
[ 0.8360098 , 0.1639902 ],
[ 0.85835972, 0.14164028],
[ 0.87740579, 0.12259421],
[ 0.59721595, 0.40278405],
[ 0.85765731, 0.14234269],
[ 0.72253208, 0.27746792],
[ 0.2862853 , 0.7137147 ],
[ 0.83015243, 0.16984757],
[ 0.3208549 , 0.6791451 ],
[ 0.08720221, 0.91279779],
[ 0.79040078, 0.20959922],
[ 0.86346251, 0.13653749],
[ 0.85835972, 0.14164028],
[ 0.85835972, 0.14164028],
[ 0.85711962, 0.14288038],
[ 0.53746947, 0.46253053],
[ 0.24073084, 0.75926916],
[ 0.86038196, 0.13961804],
[ 0.86038196, 0.13961804],
[ 0.65520865, 0.34479135],
[ 0.61675979, 0.38324021],
[ 0.04187545, 0.95812455],
[ 0.83467975, 0.16532025],
[ 0.86037046, 0.13962954],
[ 0.63589211, 0.36410789],
[ 0.7945787 , 0.2054213 ],
[ 0.35569986, 0.64430014],
[ 0.5923993 , 0.4076007 ],
[ 0.8149159 , 0.1850841 ],
[ 0.18626833, 0.81373167],
[ 0.55494501, 0.44505499],
[ 0.85974901, 0.14025099],
[ 0.86038196, 0.13961804],
[ 0.26826856, 0.73173144],
[ 0.72096103, 0.27903897],
[ 0.86042134, 0.13957866],
[ 0.85651789, 0.14348211],
[ 0.86032122, 0.13967878],
[ 0.12575524, 0.87424476],
[ 0.8577608 , 0.1422392 ],
[ 0.87946251, 0.12053749],
[ 0.83078796, 0.16921204],
[ 0.09214503, 0.90785497],
[ 0.85801214, 0.14198786],
[ 0.1353402 , 0.8646598 ],
[ 0.81833156, 0.18166844],
[ 0.28627693, 0.71372307],
[ 0.77835462, 0.22164538],
[ 0.86019807, 0.13980193],
[ 0.85974901, 0.14025099],
[ 0.87920886, 0.12079114],
[ 0.18831656, 0.81168344],
[ 0.83358085, 0.16641915],
[ 0.56217041, 0.43782959],
[ 0.85802213, 0.14197787],
[ 0.59346021, 0.40653979],
[ 0.26217247, 0.73782753],
[ 0.81492209, 0.18507791],
[ 0.0820547 , 0.9179453 ],
[ 0.26297266, 0.73702734],
[ 0.11560723, 0.88439277],
[ 0.65520865, 0.34479135],
[ 0.7961241 , 0.2038759 ],
[ 0.86071471, 0.13928529],
[ 0.86063609, 0.13936391],
[ 0.35525524, 0.64474476],
[ 0.92489299, 0.07510701],
[ 0.71693682, 0.28306318],
[ 0.6076947 , 0.3923053 ],
[ 0.8149159 , 0.1850841 ],
[ 0.85057636, 0.14942364],
[ 0.6376244 , 0.3623756 ],
[ 0.822631 , 0.177369 ],
[ 0.86038196, 0.13961804],
[ 0.87740579, 0.12259421],
[ 0.17163368, 0.82836632],
[ 0.35894969, 0.64105031],
[ 0.83357894, 0.16642106],
[ 0.26194935, 0.73805065],
[ 0.85835972, 0.14164028],
[ 0.26062053, 0.73937947],
[ 0.42452126, 0.57547874],
[ 0.71744256, 0.28255744],
[ 0.86074419, 0.13925581],
[ 0.86042952, 0.13957048],
[ 0.71232894, 0.28767106],
[ 0.35504562, 0.64495438],
[ 0.87740579, 0.12259421],
[ 0.11900024, 0.88099976],
[ 0.86038523, 0.13961477],
[ 0.87341934, 0.12658066],
[ 0.85935323, 0.14064677],
[ 0.60934863, 0.39065137],
[ 0.86032122, 0.13967878],
[ 0.67707567, 0.32292433],
[ 0.35952192, 0.64047808],
[ 0.751824 , 0.248176 ],
[ 0.8796969 , 0.1203031 ],
[ 0.94539356, 0.05460644],
[ 0.10542774, 0.89457226],
[ 0.86007975, 0.13992025],
[ 0.88191682, 0.11808318],
[ 0.12684285, 0.87315715],
[ 0.93191133, 0.06808867],
[ 0.81531115, 0.18468885],
[ 0.84012224, 0.15987776],
[ 0.69813211, 0.30186789],
[ 0.8149159 , 0.1850841 ],
[ 0.18808428, 0.81191572],
[ 0.22676529, 0.77323471],
[ 0.71814943, 0.28185057],
[ 0.83357894, 0.16642106],
[ 0.86039834, 0.13960166],
[ 0.85780233, 0.14219767],
[ 0.0849913 , 0.9150087 ],
[ 0.86007975, 0.13992025],
[ 0.86032122, 0.13967878],
[ 0.84012224, 0.15987776],
[ 0.60672195, 0.39327805],
[ 0.85795222, 0.14204778],
[ 0.85692031, 0.14307969],
[ 0.29681064, 0.70318936],
[ 0.83393869, 0.16606131],
[ 0.85765731, 0.14234269],
[ 0.87931473, 0.12068527],
[ 0.95077511, 0.04922489],
[ 0.83327556, 0.16672444],
[ 0.8180723 , 0.1819277 ],
[ 0.08871631, 0.91128369],
[ 0.93364059, 0.06635941],
[ 0.90670657, 0.09329343],
[ 0.18814843, 0.81185157],
[ 0.11532953, 0.88467047],
[ 0.3446263 , 0.6553737 ],
[ 0.30260559, 0.69739441],
[ 0.20902316, 0.79097684],
[ 0.70727922, 0.29272078],
[ 0.49908055, 0.50091945],
[ 0.83357894, 0.16642106],
[ 0.85717847, 0.14282153],
[ 0.83675021, 0.16324979],
[ 0.17163368, 0.82836632],
[ 0.6376244 , 0.3623756 ],
[ 0.85835972, 0.14164028],
[ 0.39467085, 0.60532915],
[ 0.2731416 , 0.7268584 ],
[ 0.63987502, 0.36012498],
[ 0.85974901, 0.14025099],
[ 0.72317603, 0.27682397],
[ 0.86038196, 0.13961804],
[ 0.11238481, 0.88761519],
[ 0.67348135, 0.32651865],
[ 0.87880937, 0.12119063],
[ 0.2665907 , 0.7334093 ],
[ 0.26297533, 0.73702467],
[ 0.85718307, 0.14281693],
[ 0.85796389, 0.14203611],
[ 0.86038196, 0.13961804],
[ 0.10513251, 0.89486749],
[ 0.29535416, 0.70464584],
[ 0.86038196, 0.13961804],
[ 0.35756781, 0.64243219],
[ 0.85935323, 0.14064677],
[ 0.86071471, 0.13928529],
[ 0.50077687, 0.49922313],
[ 0.85835972, 0.14164028],
[ 0.14507275, 0.85492725],
[ 0.26239326, 0.73760674],
[ 0.60648726, 0.39351274],
[ 0.8149159 , 0.1850841 ]])
preds = clf.predict(test_x)
計算模型的混淆矩陣如下所示。
confusion_matrix(test_y,preds)
array([[157, 15],
[ 35, 61]])
計算模型的ROC/AUC得分,並畫出ROC曲線。模型的ROC/AUC得分為0.88,表明預測準確的概率為88%左右。模型預測結果較好。
pre = clf.predict_proba(test_x)
roc_auc_score(test_y,pre[:,1])
0.88114704457364346
fpr,tpr,thresholds = roc_curve(test_y,pre[:,1])
fig,ax = plt.subplots(figsize=(8,5))
plt.plot(fpr,tpr)
ax.set_title("Roc of Logistic Regression")
<matplotlib.text.Text at 0x7674a1f588>
模型預測結果分類報告如下所示。
print(classification_report(test_y,preds))
precision recall f1-score support
0 0.82 0.91 0.86 172
1 0.80 0.64 0.71 96
avg / total 0.81 0.81 0.81 268
總體而言,模型的擬合結果較好。