1. 程式人生 > >利用python實現梯度下降和邏輯迴歸原理(Python詳細原始碼:預測學生是否被錄取)

利用python實現梯度下降和邏輯迴歸原理(Python詳細原始碼:預測學生是否被錄取)

本案例主要是:建立邏輯迴歸模型預測一個學生是否被大學錄取,沒有詳細介紹演算法推到,

讀者可查閱其他部落格理解梯度下降演算法的實現:https://blog.csdn.net/wangliang0633/article/details/79082901

資料格式如下:第三列表示錄取狀態,0---未錄取,1---已錄取,前兩列是成績

原始碼:

#!/usr/bin/env python
# encoding: utf-8
"""
@Company:華中科技大學電氣學院聚變與等離子研究所
@version: V1.0
@author: Victor
@contact: [email protected]
or [email protected] 2018--2020 @software: PyCharm @file: LogisticsRegression.py @time: 2018/11/12 15:10 @Desc:建立邏輯迴歸模型預測一個學生是否被大學錄取 """ import numpy as np import pandas as pd import matplotlib.pyplot as plt import os path="E:\PycharmWorks\Files"+os.sep+"LogiReg_data.txt" pdData = pd.read_csv(path,header=None,names=['Exam1','Exam2','Admitted']) #pdData.head(3) ##畫出錄取和未錄取的散點分佈圖 positive = pdData[pdData['Admitted'] == 1] negative = pdData[pdData['Admitted'] == 0] #plt.scatter(positive['Exam1'],positive['Exam2'],s=30,c='b',marker='o',label='Admitted') #plt.scatter(negative['Exam1'],negative['Exam2'],s=30,c='r',marker='x',label='UNAdmitted') #plt.legend() #plt.xlabel("Exam1 Score") #plt.ylabel("Exam2 Score") #plt.show() '''目標:建立分類器 設定閾值:根據閾值判斷錄取結果 要完成的模組: sigmodi:對映到概率的函式 model:返回預測結果值 cost:根據引數計算損失 gradient:計算每個引數的梯度方向 descent:進行引數更新 accuracy:計算精度''' def sigmoid(z): return 1/(1+np.exp(-z)) def model(X,theta): return sigmoid(np.dot(X,theta.T)) pdData.insert(0,'Ones',1) #print(pdData) orig_data = pdData.as_matrix() ##變為矩陣 ##print(orig_data) cols = orig_data.shape[1] X = orig_data[:,0:cols-1] #print(X[:5]) ##前5行 y = orig_data[:,cols-1:cols] #print(y[:4]) ##構建引數矩陣 theta = np.zeros([1,3]) #print(theta) ####損失函式(實現似然函式), def cost(X,y,theta): left = np.multiply(-y,np.log(model(X,theta))) right = np.multiply(1 - y,np.log(1 - model(X,theta))) return np.sum(left-right)/(len(X))/n #print(cost(X,y,theta)) ####計算梯度,計算每個引數的梯度 def gradient(X,y,theta): grad = np.zeros(theta.shape) ##佔位 error = (model(X,theta)-y)[:,1] for j in range(len(theta.ravel())): term = np.multiply(error,X[:,j])###X的行表示樣本,列表示特徵 grad[0,j] = np.sum(term) / len(X) return grad #print(gradient(X,y,theta)) ###比較三種不同的梯度下降方法 STOP_ITER = 0 STOP_COST = 1 STOP_GRAD = 2 def stopCriterion(type,value,threshod): if type == STOP_ITER: return value > threshod elif type == STOP_COST: return abs(value[-1]-value[-2] < threshod) elif type == STOP_GRAD: return np.linalg.norm(value) < threshod ###洗牌,避免資料收集過程中有規律,打亂資料,可以得到更好的模型 import numpy.random def shuffleData(data): np.random.shuffle(data) cols = data.shape[1] X = data[:,0:cols-1] y = data[:,cols-1] return X,y ####梯度下降求解 import time def descent(data,theta,batchSize,stopType,thresh,alpha): init_time = time.time() i = 0 #迭代次數 k = 0 #batch X,y = shuffleData(data) grad = np.zeros(theta.shape) costs = [cost(X,y,theta)] while True: grad = gradient(X[k:k+batchSize],y[k:k+batchSize],theta) k += batchSize if k >= 100: k = 0 X,y = shuffleData(data) theta = theta -alpha*grad ##引數更新 costs.append(cost(X,y,theta)) ##計算新的損失 i += 1 if stopType == STOP_ITER: value = i elif stopType == STOP_COST: value = costs elif stopType == STOP_GRAD: value = grad if stopCriterion(stopType,value,thresh):break return theta,i-1,costs,grad,time.time()-init_time def RunExp(data,theta,batchSize,stopType,thresh,alpha): theta,iter,costs,grad,dur = descent(data,theta,batchSize,stopType,thresh,alpha) name = "Original" if (data[:,1]>2).sum() > 1 else "Scaled" name += "data- learning rate:{}-".format(alpha) print("***{}\nTheta:{}-Iter:{}-Last cost:{:03.2f} - Duration:{:03.2f}s".format(name,theta,iter,costs[-1],dur)) plt.plot(np.arange(len(costs)),costs,'r') plt.xlabel("Iterations") plt.ylabel("Cost") plt.title("Error vs Itetarion") plt.show() return theta n=100 RunExp(orig_data,theta,n,STOP_ITER,thresh=12000,alpha=0.00000012) ###計算模型精度 ##設定閾值 def predict(X,theta): return [1 if x >= 0.5 else 0 for x in model(X,theta)] scaled_X = orig_data[:,:3] y = orig_data[:,3] predicts = predict(scaled_X,theta) correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a,b) in zip(predicts,y)] accuracy = (correct.count(1) % len(correct)) print("accuracy = {0}%".format(accuracy))

結果顯示:

 

結果預測:

可見準確率不高,還需調整引數,增加樣本。