吳恩達機器學習作業Python實現(六)：SVM支援向量機

阿新 • • 發佈：2019-01-29

1 Support Vector Machines

1.1 Example Dataset 1

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
from sklearn import svm

大多數SVM的庫會自動幫你新增額外的特徵 $x_0$ 已經 $\theta_0$ ，所以無需手動新增。

mat = loadmat('./data/ex6data1.mat')
print 
(mat.keys())
# dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])
X = mat['X']
y = mat['y']

def plotData(X, y):
    plt.figure(figsize=(8,5))
    plt.scatter(X[:,0], X[:,1], c=y.flatten(), cmap='rainbow')
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.legend() 
plotData(X, y)

def 
 plotBoundary(clf, X):
    '''plot decision bondary'''
    x_min, x_max = X[:,0].min()*1.2, X[:,0].max()*1.1
    y_min, y_max = X[:,1].min()*1.1,X[:,1].max()*1.1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                         np.linspace(y_min, y_max, 500))
    Z = clf.predict(np.c_[xx.ravel( 
), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z)

models = [svm.SVC(C, kernel='linear') for C in [1, 100]]
clfs = [model.fit(X, y.ravel()) for model in models]

title = ['SVM Decision Boundary with C = {} (Example Dataset 1'.format(C) for C in [1, 100]]
for model,title in zip(clfs,title):
    plt.figure(figsize=(8,5))
    plotData(X, y)
    plotBoundary(model, X)
    plt.title(title)

可以從上圖看到，當C比較小時模型對誤分類的懲罰增大，比較嚴格，誤分類少，間隔比較狹窄。

當C比較大時模型對誤分類的懲罰增大，比較寬鬆，允許一定的誤分類存在，間隔較大。

1.2 SVM with Gaussian Kernels

這部分，使用SVM做非線性分類。我們將使用高斯核函式。

為了用SVM找出一個非線性的決策邊界，我們首先要實現高斯核函式。我可以把高斯核函式想象成一個相似度函式，用來測量一對樣本的距離， $(x^{(i)}, y^{(j)})$ 。

這裡我們用sklearn自帶的svm中的核函式即可。

1.2.1 Gaussian Kernel

def gaussKernel(x1, x2, sigma):
    return np.exp(- ((x1 - x2) ** 2).sum() / (2 * sigma ** 2))

gaussKernel(np.array([1, 2, 1]),np.array([0, 4, -1]), 2.)  # 0.32465246735834974

1.2.2 Example Dataset 2

mat = loadmat('./data/ex6data2.mat')
X2 = mat['X']
y2 = mat['y']

plotData(X2, y2)

sigma = 0.1
gamma = np.power(sigma,-2.)/2
clf = svm.SVC(C=1, kernel='rbf', gamma=gamma)
modle = clf.fit(X2, y2.flatten())
plotData(X2, y2)
plotBoundary(modle, X2)

1.2.3 Example Dataset 3

mat3 = loadmat('data/ex6data3.mat')
X3, y3 = mat3['X'], mat3['y']
Xval, yval = mat3['Xval'], mat3['yval']
plotData(X3, y3)

Cvalues = (0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.)
sigmavalues = Cvalues
best_pair, best_score = (0, 0), 0

for C in Cvalues:
    for sigma in sigmavalues:
        gamma = np.power(sigma,-2.)/2
        model = svm.SVC(C=C,kernel='rbf',gamma=gamma)
        model.fit(X3, y3.flatten())
        this_score = model.score(Xval, yval)
        if this_score > best_score:
            best_score = this_score
            best_pair = (C, sigma)
print('best_pair={}, best_score={}'.format(best_pair, best_score))
# best_pair=(1.0, 0.1), best_score=0.965

model = svm.SVC(C=1., kernel='rbf', gamma = np.power(.1, -2.)/2)
model.fit(X3, y3.flatten())
plotData(X3, y3)
plotBoundary(model, X3)

# 這我的一個練習畫圖的，和作業無關，給個畫圖的參考。
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

# we create 40 separable points
np.random.seed(0)

X = np.array([[3,3],[4,3],[1,1]])
Y = np.array([1,1,-1])

# fit the model
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)

# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]

# plot the parallels to the separating hyperplane that pass through the
# support vectors
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

# plot the line, the points, and the nearest vectors to the plane
plt.figure(figsize=(8,5))
plt.plot(xx, yy, 'k-')
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')
# 圈出支援向量
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
            s=150, facecolors='none', edgecolors='k', linewidths=1.5)
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.rainbow)

plt.axis('tight')
plt.show()

print(clf.decision_function(X))

[ 1.   1.5 -1. ]

2 Spam Classification

2.1 Preprocessing Emails

這部分用SVM建立一個垃圾郵件分類器。你需要將每個email變成一個n維的特徵向量，這個分類器將判斷給定一個郵件x是垃圾郵件(y=1)或不是垃圾郵件(y=0)。

take a look at examples from the dataset

with open('data/emailSample1.txt', 'r') as f:
    email = f.read()
    print(email)

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
[email protected]

可以看到，郵件內容包含 a URL, an email address(at the end), numbers, and dollar amounts. 很多郵件都會包含這些元素，但是每封郵件的具體內容可能會不一樣。因此，處理郵件經常採用的方法是標準化這些資料，把所有URL當作一樣，所有數字看作一樣。

例如，我們用唯一的一個字串‘httpaddr’來替換所有的URL，來表示郵件包含URL，而不要求具體的URL內容。這通常會提高垃圾郵件分類器的效能，因為垃圾郵件傳送者通常會隨機化URL，因此在新的垃圾郵件中再次看到任何特定URL的機率非常小。

我們可以做如下處理：

  1. Lower-casing: 把整封郵件轉化為小寫。
  2. Stripping HTML: 移除所有HTML標籤，只保留內容。
  3. Normalizing URLs: 將所有的URL替換為字串 “httpaddr”.
  4. Normalizing Email Addresses: 所有的地址替換為 “emailaddr”
  5. Normalizing Dollars: 所有dollar符號($)替換為“dollar”.
  6. Normalizing Numbers: 所有數字替換為“number”
  7. Word Stemming(詞幹提取): 將所有單詞還原為詞源。例如，“discount”, “discounts”, “discounted” and “discounting”都替換為“discount”。
  8. Removal of non-words: 移除所有非文字型別，所有的空格(tabs, newlines, spaces)調整為一個空格.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
from sklearn import svm
import re #regular expression for e-mail processing

# 這是一個可用的英文分詞演算法(Porter stemmer)
from stemming.porter2 import stem

# 這個英文演算法似乎更符合作業裡面所用的程式碼，與上面效果差不多
import nltk, nltk.stem.porter

def processEmail(email):
    """做除了Word Stemming和Removal of non-words的所有處理"""
    email = email.lower()
    email = re.sub('<[^<>]>', ' ', email)  # 匹配<開頭，然後所有不是< ,> 的內容，知道>結尾，相當於匹配<...>
    email = re.sub('(http|https)://[^\s]*', 'httpaddr', email )  # 匹配//後面不是空白字元的內容，遇到空白字元則停止
    email = re.sub('[^\s][email protected][^\s]+', 'emailaddr', email)
    email = re.sub('[\$]+', 'dollar', email)
    email = re.sub('[\d]+', 'number', email) 
    return email

接下來就是提取詞幹，以及去除非字元內容。

def email2TokenList(email):
    """預處理資料，返回一個乾淨的單詞列表"""
    
    # I'll use the NLTK stemmer because it more accurately duplicates the
    # performance of the OCTAVE implementation in the assignment
    stemmer = nltk.stem.porter.PorterStemmer()
    
    email = preProcess(email)

    # 將郵件分割為單個單詞，re.split() 可以設定多種分隔符
    tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)
    
    # 遍歷每個分割出來的內容
    tokenlist = []
    for token in tokens:
        # 刪除任何非字母數字的字元
        token = re.sub('[^a-zA-Z0-9]', '', token);
        # Use the Porter stemmer to 提取詞根
        stemmed = stemmer.stem(token)
        # 去除空字串‘’，裡面不含任何字元
        if not len(token): continue
        tokenlist.append(stemmed)
            
    return tokenlist

2.1.1 Vocabulary List(詞彙表)

在對郵件進行預處理之後，我們有一個處理後的單詞列表。下一步是選擇我們想在分類器中使用哪些詞，我們需要去除哪些詞。

我們有一個詞彙表vocab.txt，裡面儲存了在實際中經常使用的單詞，共1899個。

我們要算出處理後的email中含有多少vocab.txt中的單詞，並返回在vocab.txt中的index，這就我們想要的訓練單詞的索引。

def email2VocabIndices(email, vocab):
    """提取存在單詞的索引"""
    token = email2TokenList(email)
    index = [i for i in range(len(vocab)) if vocab[i] in token ]
    return index

2.2 Extracting Features from Emails

def email2FeatureVector(email):
    """
    將email轉化為詞向量，n是vocab的長度。存在單詞的相應位置的值置為1，其餘為0
    """
    df = pd.read_table('data/vocab.txt',names=['words'])
    vocab = df.as_matrix()  # return array
    vector = np.zeros(len(vocab))  # init vector
    vocab_indices = email2VocabIndices(email, vocab)  # 返回含有單詞的索引
    # 將有單詞的索引置為1
    for i in vocab_indices:
        vector[i] = 1
    return vector

vector = email2FeatureVector(email)
print('length of vector = {}\nnum of non-zero = {}'.format(len(vector), int(vector.sum())))

length of vector = 1899
num of non-zero = 45

2.3 Training SVM for Spam Classification

讀取已經訓提取好的特徵向量以及相應的標籤。分訓練集和測試集。

# Training set
mat1 = loadmat('data/spamTrain.mat')
X, y = mat1['X'], mat1['y']

# Test set
mat2 = scipy.io.loadmat('data/spamTest.mat')
Xtest, ytest = mat2['Xtest'], mat2['ytest']

clf = svm.SVC(C=0.1, kernel='linear')
clf.fit(X, y)

2.4 Top Predictors for Spam

predTrain = clf.score(X, y)
predTest = clf.score(Xtest, ytest)
predTrain, predTest

(0.99825, 0.989)

吳恩達機器學習作業Python實現(六)：SVM支援向量機

1 Support Vector Machines

1.1 Example Dataset 1

1.2 SVM with Gaussian Kernels

1.2.1 Gaussian Kernel

1.2.2 Example Dataset 2

1.2.3 Example Dataset 3

2 Spam Classification

2.1 Preprocessing Emails

2.1.1 Vocabulary List(詞彙表)

2.2 Extracting Features from Emails

2.3 Training SVM for Spam Classification

2.4 Top Predictors for Spam

吳恩達機器學習作業Python實現(六)：SVM支援向量機

吳恩達機器學習作業Python實現(一)：線性迴歸

吳恩達機器學習作業（五）：支援向量機

機器學習總結（三）：SVM支援向量機（面試必考）

吳恩達機器學習作業程式碼1

演算法工程師修仙之路：吳恩達機器學習作業（一）

吳恩達機器學習筆記 —— 19 應用舉例：照片OCR（光學字符識別）

吳恩達機器學習筆記 —— 19 應用舉例：照片OCR（光學字元識別）

吳恩達機器學習筆記（十六）-推薦系統

吳恩達機器學習筆記59-向量化：低秩矩陣分解與均值歸一化（Vectorization: Low Rank Matrix Factorization & Mean Normalization）

機器學習數學原理（7）——SVM支援向量機

周志華《機器學習》之第六章（支援向量機）概念總結

機器學習實戰【5】（SVM-支援向量機）

機器學習 | 吳恩達機器學習第四周程式設計作業(Python版本)

機器學習 | 吳恩達機器學習第二週程式設計作業(Python版）

機器學習 | 吳恩達機器學習第三週程式設計作業(Python版)

機器學習 | 吳恩達機器學習第八週程式設計作業(Python版）

機器學習 | 吳恩達機器學習第七週程式設計作業(Python版)

機器學習 | 吳恩達機器學習第六週程式設計作業(Python版）

吳恩達機器學習邏輯迴歸python實現（未正則化）[對應ex2-ex2data2.txt資料集]

吳恩達機器學習作業Python實現(六)：SVM支援向量機

1 Support Vector Machines

1.1 Example Dataset 1

1.2 SVM with Gaussian Kernels

1.2.1 Gaussian Kernel

1.2.2 Example Dataset 2

1.2.3 Example Dataset 3

2 Spam Classification

2.1 Preprocessing Emails

2.1.1 Vocabulary List(詞彙表)

2.2 Extracting Features from Emails

2.3 Training SVM for Spam Classification

2.4 Top Predictors for Spam

相關推薦