機器學習實戰之Titanic(Kaggle)

阿新 • • 發佈：2019-02-02

一、船員資料分析

PassengerId :每一個乘客的標誌符
Survived:Lable值，代表是否獲救
Pclass:船員倉庫等級
Name:姓名
Sex:性別
Age:年齡
SibSp:兄弟姐妹有幾個
Parch:老人孩子的數量
Ticket:船票的編號
Fare:船票價格
Cabin:船艙位置，此列出現大量缺失，可以不要
Embarked:上船地點

二、資料預處理

1.匯入需要的包

import pandas as pa
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold

2.觀察資料的前幾行

filename = "train.csv"
titanic = pa.read_csv(filename)
titanic.head()

結果：

3.觀察資料的簡單資料特徵

print titanic.describe()

結果：

    PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000         NaN    0.000000   
50%     446.000000    0.000000    3.000000         NaN    0.000000   
75%     668.500000    1.000000    3.000000         NaN    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200

可以看到Age列資料只有714個，其餘列均有891個，因此此列需要對缺失值進行填充

titanic["Age"]=titanic["Age"].fillna(titanic["Age"].median())
print titanic.describe()

結果：

 PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.361582    0.523008   
std     257.353842    0.486592    0.836071   13.019697    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   22.000000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   35.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200

將string值轉為int/float值
1) 首先，觀察相應列有幾種字串

print titanic["Sex"].unique()
print titanic["Embarked"].unique()

結果：

['male' 'female']
['S' 'C' 'Q' nan]

2) 然後，將相應字串的位置附上對應的Int/float值

titanic.loc[titanic["Sex"]=="male","Sex"] = 0; 
titanic.loc[titanic["Sex"]=="female","Sex"] = 1;
titanic.loc[titanic["Embarked"]=="S","Embarked"] = 0; 
titanic.loc[titanic["Embarked"]=="C","Embarked"] = 1;
titanic.loc[titanic["Embarked"]=="Q","Embarked"] = 2;
titanic.head()

結果：

替換成功

三、分類

def data_proprocess():
    import pandas as pa
    import numpy as np
    import matplotlib.pyplot as plt
    
    filename = "train.csv"
    titanic = pa.read_csv(filename)
    #titanic.head()
    #print titanic.describe()
    titanic["Age"]=titanic["Age"].fillna(titanic["Age"].median())
    titanic['Embarked'] = titanic['Embarked'].fillna('S')
    #print titanic["Sex"].unique()
    #print titanic["Embarked"].unique()
    titanic.loc[titanic["Sex"]=="male","Sex"] = 0; 
    titanic.loc[titanic["Sex"]=="female","Sex"] = 1;
    
    titanic.loc[titanic["Embarked"]=="S","Embarked"] = 0; 
    titanic.loc[titanic["Embarked"]=="C","Embarked"] = 1;
    titanic.loc[titanic["Embarked"]=="Q","Embarked"] = 2;
    #titanic.head()
    return titanic

def classify_LinearRegression(titanic):
    import pandas as pa
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.cross_validation import KFold
    from sklearn.linear_model import LinearRegression
    
    predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]#特徵
    
    alg = LinearRegression()#線性迴歸
    kf = KFold(titanic.shape[0],n_folds=3,random_state=1)#交叉驗證集
    predictions = []
    for train,test in kf:
        train_predictors = (titanic[predictors].iloc[train,:])
        train_target = titanic["Survived"].iloc[train]
        alg.fit(train_predictors,train_target)
        test_predictions = alg.predict(titanic[predictors].iloc[test,:])
        predictions.append(test_predictions)
    
    predictions = np.concatenate(predictions,axis=0)
    predictions[predictions > 0.5] =1
    predictions[predictions <= 0.5] =0

    accuracy = sum(predictions[predictions == titanic['Survived']])/len(predictions)
    return accuracy

def classify_LogisticRegression(titanic):
    import pandas as pa
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import cross_validation
    from sklearn.linear_model import LogisticRegression
    predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]#特徵
    alg = LogisticRegression(random_state=1)
    scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=3)
    return scores.mean()
print "LinearRegression Classification result is ："
print classify_LinearRegression(data_proprocess())
print "LogisticRegression Classification result is ："
print classify_LogisticRegression(data_proprocess())

結果：

LinearRegression Classification result is ：
0.261503928171
LogisticRegression Classification result is ：
0.787878787879

從結果可以看出，還是用邏輯迴歸做分類問題精度更高。

四、使用隨機森林提高分類精度並將結果傳到kaggle

def classify_RandomForestClassifier(train_data,test_data):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import cross_validation
    import pandas as pa  
    import numpy as np  
    predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"] 
    clf = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0)
    scores = cross_validation.cross_val_score(clf,train_data[predictors],train_data["Survived"],cv=3)  
    clf .fit(train_data[predictors],train_data["Survived"])
    predict_result=  clf.predict(test_data[predictors])
    result = pa.DataFrame({'PassengerId':test_data['PassengerId'].as_matrix(), 'Survived':predict_result.astype(np.int32)})
    result.to_csv("logistic_regression_predictions.csv", index=False)
    return scores.mean() 
print "train"
titanic_train=data_proprocess("train.csv")
print "test"
titanic_test=data_proprocess("test.csv")

classify_RandomForestClassifier(titanic_train,titanic_test)

機器學習實戰之Titanic(Kaggle)

一、船員資料分析PassengerId :每一個乘客的標誌符Survived:Lable值，代表是否獲救Pclass:船員倉庫等級Name:姓名Sex:性別Age:年齡 SibSp:兄弟姐妹有幾個Parch:老人孩子的數量Ticket:船票的編號Fare:船票價格Cabin:

機器學習實戰之PCA

數據預處理每一個 numpy 矩陣分享 topn 文本 bsp 偽代碼一，引言　　降維是對數據高維度特征的一種預處理方法。降維是將高維度的數據保留下最重要的一些特征，去除噪聲和不重要的特征，從而實現提升數據處理速度的目的。在實際的生產和應用中，降維在一定的信息損失範

機器學習實戰之第二章 k-近鄰算法

lifo -h 訓練數據 adl sdi 加載 erro orm 數據集第2章 k-近鄰算法 KNN 概述 k-近鄰（kNN, k-NearestNeighbor）算法主要是用來進行分類的. KNN 場景電影可以按照題材分類，那麽如何區分動作片和愛情片呢？

機器學習實戰之迴歸

轉自：https://www.cnblogs.com/zy230530/p/6942458.html 一，引言　　　　前面講到的基本都是分類問題，分類問題的目標變數是標稱型資料，或者離散型資料。而回歸的目標變數為連續型，也即是迴歸對連續型變數做出預測，最直接的辦法是依據輸入寫出一個目標值的計算公式，這樣

機器學習實戰之決策樹

學習《機器學習實戰》 1、決策樹的構造 1、決策樹理解決策樹是一種分類器，根據已知的特徵，做一個最純淨的劃分。例子：現在想構建一個郵件分類系統，第一步：先檢測傳送郵件的域名的地址，若地址是myEmployer.com，就把郵件放在無聊時需要閱讀的郵件，若域

機器學習實戰之K近鄰改進的約會網站程式碼及手寫字型識別程式碼

from numpy import * import operator import os def createDataSet(): group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels=['A','A','B','B']

Scikit-learn機器學習實戰之Kmeans

摘要上篇部落格談到了如何安裝Python中強大的機器學習庫scikit-learn：Windos環境安裝scikit-learn函式庫流程，本篇主要是對其Kmeans示例進行學習。 Kmeans演算法的缺陷聚類中心的個數K 需要事先給定，但在實際中這個 K 值的選

機器學習實戰之KNN演算法

前段時間在京東上購買了這本很多人都推薦的書---機器學習實戰。剛剛看完第一章，感覺本書很適合初學者，特別是對急於應用機器學習但又不想深究理論的小白（像我這樣的）。不過在這裡還是推薦一下李航老師的那本《統計學習方法》，該書注重理論推導及挖掘演算法背後的數學本質，和《機器

機器學習實戰之k-近鄰演算法（3）---如何視覺化資料

關於視覺化：《機器學習實戰》書中的一個小錯誤，P22的datingTestSet.txt這個檔案，根據網上的原始碼，應該選擇datingTestSet2.txt這個檔案。主要的區別是最後的標籤，作者原來使用字串‘veryLike’作為標籤，但是Python轉換會出現Val

機器學習實戰之adaboost

1.概念定義 (1)元演算法(meta-algorithm)/整合方法(ensemble method): 是對其他演算法進行組合的一種方式.有多種整合方式: 不同演算法的整合;同一演算法在不同設定下

機器學習實戰之使用k-鄰近演算法改進約會網站的配對效果

1 準備資料，從文字檔案中解析資料用到的資料是機器學習實戰書中datingTextSet2.txt 程式碼如下： from numpy import * def file2matrix(filname): fr=open(filname) arrayOLines

機器學習實戰之樸素貝葉斯

問題1 來源：使用樸素貝葉斯過濾垃圾郵件描述：spamTest()和textParse()讀檔案時編譯通不過報錯：UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal

【python】機器學習實戰之樸素貝葉斯分類

一，引言　　前兩章的KNN分類演算法和決策樹分類演算法最終都是預測出例項的確定的分類結果，但是，有時候分類器會產生錯誤結果；本章要學的樸素貝葉斯分類演算法則是給出一個最優的猜測結果，同時給出猜測的概率估計值。 1 準備知識：條件概率公式相信學過概率論的同學對於概

機器學習實戰之k-means聚類_程式碼註釋

#-*- coding: UTF-8 -*- from numpy import * def loadDataSet(fileName):#函式的輸入為檔名稱，函式的主要作用是將檔案中的每行內容轉換成浮點型， # 每行

機器學習實戰之K-近鄰演算法總結和程式碼解析

機器學習實戰是入手機器學習和python實戰的比較好的書，可惜我現在才開始練習程式碼！先宣告：本人菜鳥一枚，機器學習的理論知識剛看了一部分，python的知識也沒學很多，所以寫程式碼除錯的過程很痛可！但是還是挨個找出了問題所在，蠻開心的！看了很多大牛

Python機器學習實戰之邏輯迴歸

''' Created on Oct 27, 2010 Logistic Regression Working Module @author: Peter ''' from numpy import * def loadDataSet(): dataMat = []; labelMat = []

機器學習實戰之KNN分類演算法

示例：使用KNN改進約會網站配對效果(學習這一節把自己需要注意的和理解的記錄下來) 第零步：實現KNN演算法：需注意： classCount[voteIlabel] = classCount.get(voteIlabel,0)+1 #Python 字典(

機器學習實戰-之SVM核函式與案例

在現實任務中，原始樣本空間中可能不存在這樣可以將樣本正確分為兩類的超平面，但是我們知道如果原始空間的維數是有限的，也就是說屬性數是有限的，則一定存在一個高維特徵空間能夠將樣本劃分。事實上，在做任務中，我們並不知道什麼樣的核函式是合適的。但是核函式的選擇卻

機器學習實戰之K-means

1 演算法思想聚類的基本思想是將資料集中的樣本劃分為若干個通常是不相交的子集，每個子集稱為一個”簇”(cluster)。劃分後，每個簇可能有相同對應的概念(性質)。K-均值演算法就是一個使用很廣泛的聚類演算法，其中的K就表示簇的數量，K-means簡單的說就

機器學習實戰之樸素貝葉斯_程式碼註釋

#-*- coding: UTF-8 -*- from numpy import * def loadDataSet():#建立包含文件的訓練集和各文件對應的標籤列表 postinglist = [['my','dog','has','flea','problems',

機器學習實戰之Titanic(Kaggle)

一、船員資料分析

二、資料預處理

1.匯入需要的包

2.觀察資料的前幾行

三、分類

四、使用隨機森林提高分類精度並將結果傳到kaggle

相關推薦