1. 程式人生 > >機器學習專案實戰--邏輯迴歸(Logistic Regression)

機器學習專案實戰--邏輯迴歸(Logistic Regression)

(一)邏輯迴歸

邏輯迴歸演算法是一種廣義的線性迴歸分析模型, 可用於二分類和多分類問題, 常用於資料探勘、疾病自動診斷、經濟預測等領域。通俗來說, 邏輯迴歸演算法通過將資料進行擬合成一個邏輯函式來預估一個事件出現的概率,因此被稱為邏輯迴歸。因為演算法輸出的為事件發生概率, 所以其輸出值應該在0至1之間。

邏輯迴歸的優點:是速度快,特別是對於二分類;可以適用於連續性和類別性自變數;輸出的是機率,更容易解釋和使用。不足:容易發生過擬合問題,最終導致演算法無法對未知資料進行很好的預測。

具體過程:



(二)Python實現

1  匯入資料

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(12)
num_observations = 5000

x1 = np.random.multivariate_normal([0, 0], [[1, .75],[.75, 1]], num_observations)
x2 = np.random.multivariate_normal([1, 4], [[1, .75],[.75, 1]], num_observations)

simulated_separableish_features = np.vstack((x1, x2)).astype(np.float32)
simulated_labels = np.hstack((np.zeros(num_observations),
                              np.ones(num_observations)))

2  定義Sigmoid函式

def sigmoid(scores):
    return 1 / (1 + np.exp(-scores))

3  定義最大似然估計函式

def log_likelihood(features, target, weights):
    scores = np.dot(features, weights)
    ll = np.sum( target*scores - np.log(1 + np.exp(scores)) )
    return ll

4  構建邏輯迴歸函式

def logistic_regression(features, target, num_steps, learning_rate, add_intercept = False):
    if add_intercept:
        intercept = np.ones((features.shape[0], 1))
        features = np.hstack((intercept, features))
        
    weights = np.zeros(features.shape[1])
    
    for step in xrange(num_steps):
        scores = np.dot(features, weights)
        predictions = sigmoid(scores)

        # Update weights with gradient
        output_error_signal = target - predictions
        gradient = np.dot(features.T, output_error_signal)
        weights += learning_rate * gradient
        
        # Print log-likelihood every so often
        if step % 10000 == 0:
            print log_likelihood(features, target, weights)
        
    return weights

5  與sklearn的Logistic迴歸進行比較、評估模型

clf = LogisticRegression(fit_intercept=True, C = 1e15)
clf.fit(simulated_separableish_features, simulated_labels)
weights = logistic_regression(simulated_separableish_features, simulated_labels,
                     num_steps = 300000, learning_rate = 5e-5, add_intercept=True)
data_with_intercept = np.hstack((np.ones((simulated_separableish_features.shape[0], 1)),
                                 simulated_separableish_features))
final_scores = np.dot(data_with_intercept, weights)
preds = np.round(sigmoid(final_scores))

print ('Accuracy from scratch: {0}'.format((preds == simulated_labels).sum().astype(float) / len(preds)))
print ('Accuracy from sk-learn: {0}'.format(clf.score(simulated_separableish_features, simulated_labels)))
plt.figure(figsize = (12, 8))
plt.scatter(simulated_separableish_features[:, 0], simulated_separableish_features[:, 1],
            c = preds == simulated_labels - 1, alpha = .8, s = 50)

需要在程式開始引入sklearn中的迴歸模組

from sklearn.linear_model import LogisticRegression

邏輯迴歸演算法推導詳見:http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/

本專案參見:https://beckernick.github.io/logistic-regression-from-scratch/