第8章 SVC/LinearSVC（乳腺癌檢測）

阿新 • • 發佈：2019-01-05

資料預處理

不同於匯入 scikit-learn 自有乳腺癌資料集，採用 pandas 讀取下載的資料集。
在這裡插入圖片描述

# 載入資料
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
print('data shape: {0}; no. positive: {1}; no. negative: {2}'.format(
    X.shape, y[y==1].shape[0], y[y==0].shape[0]))

data shape: (569, 30); no. positive: 357; no. negative: 212

注意：自有資料集中 diagnosis 已經是0，1形式的 int 型資料。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv(r'D:\machinelearningDatasets\BreastCancerLR\Breast cancer.csv')

X = data.iloc[:,2:31]
y = data.iloc[:,1 
:2]

y.diagnosis.value_counts()

在這裡插入圖片描述

y = y.values.ravel()

資料標準化：
sklearn.preprocessing.MinMaxScaler
Transforms features by scaling each feature to a given range.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

劃分資料集：

from sklearn.model_selection import 
 train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

高斯核函式（rbf）

from sklearn.svm import SVC

clf = SVC(C=1.0, kernel='rbf')
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))

train score: 0.9538461538461539; test score: 0.9649122807017544
擬合非常好！

不對資料標準化時：
train score: 1.0; test score: 0.631578947368421
訓練集分數接近滿分，而驗證集評分很低，典型的過擬合現象。此時優化gamma也可使評分達到0.950。

模型優化：

import sys
sys.path.append(r'C:\Users\Qiuyi\Desktop\scikit-learn code\code\common')

from utils import plot_param_curve
from sklearn.model_selection import GridSearchCV

gammas = np.linspace(0, 0.001, 50)
C = [1, 10, 100,1000]
param_grid = {'gamma': gammas, 'C':C}

clf = GridSearchCV(SVC(), param_grid, cv=5, return_train_score=True)
clf.fit(X, y)
print("best param: {0}\nbest score: {1}".format(clf.best_params_,
                                                clf.best_score_))

best param: {‘C’: 1000, ‘gamma’: 0.0008979591836734694}
best score: 0.9789103690685413

繪製學習曲線：

import time
from utils import plot_learning_curve
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
title = 'Learning Curves for Gaussian Kernel'

start = time.clock()
plt.figure(figsize=(10, 4), dpi=144)
plot_learning_curve(plt, SVC(C=1000, kernel='rbf', gamma=0.0008979591836734694),
                    title, X, y, ylim=(0.5, 1.01), cv=cv)

print('elaspe: {0:.6f}'.format(time.clock()-start))

elaspe: 0.340826
在這裡插入圖片描述

多項式核函式（poly）

簡單測試一下，執行明顯比高斯核函式慢一些。

from sklearn.svm import SVC

clf = SVC(C=1.0, kernel='poly', degree=2)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))

train score: 0.967032967032967; test score: 0.9473684210526315

擬合情況比較好！

繪製學習曲線：

import time
from utils import plot_learning_curve
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
title = 'Learning Curves with degree={0}'
degrees = [1, 2]

start = time.clock()
plt.figure(figsize=(12, 4), dpi=144)
for i in range(len(degrees)):
    plt.subplot(1, len(degrees), i + 1)
    plot_learning_curve(plt, SVC(C=1.0, kernel='poly', degree=degrees[i]),
                        title.format(degrees[i]), X, y, ylim=(0.8, 1.01), cv=cv, n_jobs=-1)

print('elaspe: {0:.6f}'.format(time.clock()-start))

elaspe: 431.939271
在這裡插入圖片描述

計算代價非常高！

一階多項式核函式分數偏高一些。但仍不如高斯核函式。

多項式 LinearSVC

LinearSVC() 與 SVC(kernel=‘linear’) 的區別：

LinearSVC() 最小化 hinge loss的平方，
SVC(kernel=‘linear’) 最小化 hinge loss；
LinearSVC() 使用 one-vs-rest 處理多類問題，
SVC(kernel=‘linear’) 使用 one-vs-one 處理多類問題；
LinearSVC() 使用 liblinear 執行，
SVC(kernel=‘linear’)使用 libsvm 執行；
LinearSVC() 可以選擇正則項和損失函式，
SVC(kernel=‘linear’)使用預設設定。

一句話，大規模線性可分問題上 LinearSVC 更快。

from sklearn.svm import LinearSVC
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

def create_model(degree=2, **kwarg):
    polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
    scaler = MinMaxScaler()
    linear_svc = LinearSVC(**kwarg)
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("scaler", scaler),
                         ("linear_svc", linear_svc)])
    return pipeline

clf = create_model(penalty='l1', dual=False)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))

train score: 0.9824175824175824; test score: 0.9649122807017544

import time
from utils import plot_learning_curve
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
title = 'Learning Curves for LinearSVC with Degree={0}'
degrees = [1, 2]

start = time.clock()
plt.figure(figsize=(12, 4), dpi=144)
for i in range(len(degrees)):
    plt.subplot(1, len(degrees), i + 1)
    plot_learning_curve(plt, create_model(penalty='l1', dual=False, degree=degrees[i]),
                        title.format(degrees[i]), X, y, ylim=(0.8, 1.01), cv=cv)

print('elaspe: {0:.6f}'.format(time.clock()-start))

在這裡插入圖片描述

擬合情況略差於高斯核函式，但好於多項式核函式！關鍵是比較快！

參考:
common\utils
第3章 plot_learning_curve 繪製學習曲線

注意事項：

必須 values.ravel() ，否則：

C:\Python36\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)

GridSearchCV 利用 mean_train_score 等引數時，必須有 return_train_score=True ，否則會報錯。

plot_param_curve()
train_scores_mean = cv_results[‘mean_train_score’]
train_scores_std = cv_results[‘std_train_score’]
test_scores_mean = cv_results[‘mean_test_score’]
test_scores_std = cv_results[‘std_test_score’]

第8章 SVC/LinearSVC（乳腺癌檢測）

資料預處理不同於匯入 scikit-learn 自有乳腺癌資料集，採用 pandas 讀取下載的資料集。 # 載入資料 from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() X

第6章 LogisticR/SGDC（乳腺癌檢測）

LogisticRegression原理及演算法該資料共有569個樣本，每個樣本有11列不同的數值：第一列是檢索的ID，中間9列是與腫瘤相關的醫學特徵，以及一列表徵腫瘤型別的數值。所有9列用於表示腫瘤醫學特質的數值均被量化為1-10之間的數字。 import pandas as p

《Java多執行緒程式設計實戰》—— 第8章 Active Object（主動物件）模式

Active Object模式是一種非同步程式設計模式。（跟Promise模式有什麼區別呢？）通過對方法的呼叫與方法的執行進行解耦來提高併發性。類圖當Active Object模式對外暴露的非同步方法被呼叫時，與該方法呼叫相關的上下文資訊，包括被呼叫的非同步方法名、引數等，會被

第8章傳輸層（1）_TCP/UDP協議的應用場景

一個數選擇 str 根據 connect .cn eight 安全器） 1. 傳輸層的兩個協議 1.1 TCP和UDP協議的應用場景（1）TCP協議：如果要傳輸的內容比較多，需要將發送的內容分成多個數據包發送。這就要求在傳輸層用TCP協議，在發送方和接收方建立連接

第8章傳輸層（2）_UDP協議

之前用戶數發送數據 1-1 效率沒有 strong 而是系統 2. 用戶數據報協議（UDP） 2.1 UDP的特點（1）UDP是無連接的，即發送數據之前不需要建立連接，因此減少了開銷和發送數據之前的時延。（2）UDP使用了盡最大努力交付，即不保證可靠交付，因此主

第8章傳輸層（6）_擁塞控制

法線 enter 部分 col alt 概念接下來 tran 增加 6. 擁塞控制 6.1 擁塞控制的原理（1）理想狀態下：路由器R1和R2向R3提供負載不超過1000Mb/s，都能從R3發送到R4。當提供的負載超過1000Mb/s後，不能再提高了，多余的數據包將被

2017.12.8 軟件工程-----第五章總體設計（復習）

中一計劃整體推薦滿足集中用戶重要文檔軟件工程-----第五章總體設計（復習）（1）概要經過需求分析階段的工作，系統必須“做什麽”已經很清楚了，現在是決定“怎樣做”的時候了。總體設計的基本目的是系統應該如何實現。他最重要的一項工作是設計軟件結構。因此，

Mysql 8.0 第3章簡單教程（翻譯+理解）

教程 3.1 從伺服器連線和斷開 3.2 輸入查詢 3.3 建立和使用資料庫 3.3.1 建立和選擇資料庫 3.3.2 建立表 3.3.3 將資料載入到表中 3.3.4 從表中檢索資訊 3.5 在批處理模

概論論與數理統計嚴繼高版第六章習題答案（含過程）

com 概論 img 9.png ima mage bubuko 技術 image 第八題在下一頁概論論與數理統計嚴繼高版第六章習題答案（含過程）

概率論與數理統計嚴繼高版第七章習題答案（含過程）

src mage 習題答案 .com 概率技術分享統計 http com 無7.3（不考）總習題我只有草稿，忘記帶了，想起來就更概率論與數理統計嚴繼高版第七章習題答案（含過程）

JavaSE習題第八章執行緒（未完成）

問答題 1.執行緒和程序是什麼關係？　　程序是程式的一次動態執行，對應了從程式碼載入，執行至執行完畢的一個完整的過程　　執行緒是比程序更小的執行單位，一個程序在其執行過程中可以產生多個執行緒，形成多條執行線索 2.執行緒有幾種狀態？　　4種，新建，執行，中斷，死亡 3.引起執行緒中斷的常見原

【練習題】第四章--互動設計（Think Python）

1.寫一個函式叫做square（譯者注：就是正方形的意思），有一個名叫t的引數，這個t是一個turtle。用這個turtle來畫一個正方形。寫一個函式呼叫，把bob作為引數傳遞給square，然後再執行這個程式。 code： import turtle def square(t): &n

【練習題】第七章--迭代（Think Python）

相比之下，與其對比x和y是否精確相等，倒不如以下方法更安全：用內建的絕對值函式來計算一下差值的絕對值，也叫做數量級。 if abs(y-x) < epsilon: break 這裡可以讓epsilon的值為like 0.0000001，差值比這個小就說明已經足夠接近了。

【練習題】第五章--條件迴圈（Think Python）

//--地板除。例：5//4=1 %--求模。例：5//3=2 如果你用Python2的話，除法是不一樣的。在兩邊都是整形的時候，常規除法運算子/就會進行地板除法，而兩邊只要有一側是浮點數就會進行浮點除法。複合語句中語句體內的語句數量是不限制的，但至少要有一個。有的時候會遇到一個語句體

《機器學習》周志華學習筆記第四章決策樹（課後習題）python 實現

一、基本內容 1.基本流程決策樹的生成過程是一個遞迴過程，有三種情形會導致遞迴返回（1）當前節點包含的yangben全屬於同一類別，無需劃分；（2）當前屬性集為空，或是所有yangben在所有屬性上的取值相同，無法劃分；（3）當前結點包含的yangben集合為空，不能

《機器學習》周志華學習筆記第三章線性模型（課後習題）python 實現

線性模型一、內容 1.基本形式 2.線性迴歸：均方誤差是迴歸任務中最常用的效能度量 3.對數機率迴歸：對數機率函式（logistic function）對率函式是任意階可導的凸函式，這是非常重要的性質。 4.線性判別分析（LDA 是一種降維的方法） 5.多分類學習：

《機器學習》周志華學習筆記第八章整合學習（課後習題）python實現

1.個體與整合 1.1同質整合 1.2異質整合 2.boosting:代表AdaBoost演算法 3.Bagging與隨機森林 3.1Bagging 是並行式整合學習方法最著名的代表（基於自主取樣法bootstrap sampling）自己學習時編寫了

《機器學習》周志華學習筆記第五章神經網路（課後習題） python實現

1.神經元模型 2.感知機與多層網路 3.誤差逆傳播演算法 (A)BP演算法：最小化訓練集D上的累積誤差標準BP演算法：更新規則基於單個Ek推導而得兩種策略防止過擬合：（1）早停（通過驗證集來判斷，訓練集誤差降低，驗證集誤差升高）（2）正則化：在誤差目標函式中引入描述網

周志華西瓜書第16章強化學習（習題答案）（轉）

原文轉自： https://blog.csdn.net/icefire_tyh/article/details/53691569

Essential c++ 第七章異常處理（exception handling）課後練習

練習7.1 請找出以下函式中所有可能發生錯誤的地方。 int *alloc_and_init(string file_name) { ifstream infile(file_name.c_str()); int elem_cnt; infile >> elem_c

第8章 SVC/LinearSVC（乳腺癌檢測）

資料預處理

高斯核函式（rbf）

多項式核函式（poly）

多項式 LinearSVC

相關推薦