1. 程式人生 > >數據回歸分類預測的基本算法及python實現

數據回歸分類預測的基本算法及python實現

sse 最小 die href cos xgboost 但是 預測 split

數據回歸分類預測的基本算法及python實現

關於數據的回歸和分類以及分析預測。討論分析幾種比較基礎的算法,也可以算作是比較簡單的機器學習算法。

一. KNN算法

鄰近算法,可以用來做回歸分析也可以用來做分類分析。主要思想是采取K個最為鄰近的自變量來求取其應變量的平均值,從而做一個回歸或者是分類。一般來說,K取值越大,output的var會更小,但bias相應會變大。反之,則可能會造成過擬合。因此,合理的選取K的值是KNN算法當中一個很重要的步驟。

Advantages

First, it is simple and effective. Second, the cost of retraining is low (changes in the category system and training set changes are common in Web environments and e-commerce applications). Third, the calculation of time and space is linear to the size of the training set (in some cases not too large). Fourth, since the KNN method mainly depends on the neighboring limited samples, rather than determining the category by means of discriminating the class domain, the KNN method is better than the other for the sample sets that have overlapping or overlapping classes. The method is more suitable. Fifth, this algorithm is more suitable for the automatic classification of class domains with large sample sizes, and those class domains with smaller sample sizes are more prone to misclassification using this algorithm.

Disadvantages

? The estimate of the regression function can be highly unstable as it is an average of only a few points. This is the price that we pay for flexibility.

? Curse of dimensionality.

? Generating predictions is computationally expensive

總結來說,就是鄰近算法簡單易懂,但是當處理大數據量和多維度的時候,算法計算量會增大很多,因此在這種情況下不推薦使用。

KNN的python 實現:

(選取數據)

技術分享圖片

或者有兩個變量:

技術分享圖片

(以k取值2和50 為例)

技術分享圖片

(最後,對於模型的評估)

技術分享圖片

(或者)

技術分享圖片

值得一提的是,K的取值是需要抉擇的。

技術分享圖片

可以通過列舉出各種K的取值來找出Test 數據集中的rmse最小值。(training 數據集中的rmse 會隨K 的增大而增大)

二. 正則化回歸

我們在做回歸的時候,主要要考慮的就是var和bias兩個方面的東西。當我們采取ols的方法時,僅僅考慮到了他的bias,隨著predictor的增加還有復雜度的增加,bias會越來越小但是與之相對的var就會增大,從而對我們的預測產生很大的影響。如何平衡這兩者,是最為值得要考慮的。

因此,就有了正則化的概念(regularization)

2.1 ridge regression

技術分享圖片

第二項也稱之為l2 regularisation

Advantages

Solving multilinearity is one of the advantages of ridge regression. Using the ridge model can improve predicted performance. Another advantage is that the ridge model can significantly solve the over-adjustment problem by introducing a penalty term. Therefore, the unimportant features from the use of burrs to the regularization of features become infinitely close to zero, efficiently reducing the variance and improving the performance of the prediction model.

Disadvantages

Since the coefficients of the penalty term can become infinitely close to zero but it can not be zero, there are still many features that can not be explained completely.

總結來說,能解決多重共線性問題(多重共線性是指線性回歸模型中的解釋變量之間由於存在精確相關關系或高度相關關系而使模型估計失真或難以估計準確)。對於過擬合也有懲罰性,實際應用中可以嘗試看看具體誤差的大小。

Python 的實現:

技術分享圖片

2.2 Lasso regression

技術分享圖片

第二項也稱之為l1 regularisation

優缺點和ridge regression 相類似。

Python 的實現:

技術分享圖片

技術分享圖片

值得一提的是,python的好處在於,輸入從cv的值,python會自動交互式選取出最佳的lamda。

三. XGB BOOST

XGB boost是目前為止,對於數據分類預測最為有效的的實現方法。其準確率是在所有方法中獨占鰲頭的,因此具有很大的現實意義。

XGB boost的原理主要是基於決策樹的分類方式。不同決策樹的累加求得最後的分類。對於普通的決策樹而言,首先是建立盡可能大的樹,然後開始用貪心算法開始裁剪。而XGB有所不同的點在於,他新添加的每一棵樹都是用了最優的添加。使得最後的結果能達到一個最優解。另一方面,XGB加入了復雜度的懲罰,即正則項,正則項裏包含了樹的葉子節點個數、每個葉子節點上輸出的score的L2模的平方和(對於其具體的原理,理解的還不夠透徹)。

Algorithm

技術分享圖片

Advantages

1. Comparing with gradient boost, XGBoost is faster, since the weight of XGBoost is known as Newton “step”, which does not need line search, the step length has been naturally known as ‘1’.

2. Advantage in characteristics rank, since XGBoost ranks the data and set the result as block types before the training, the block data type can be used repeatedly in further boosting.

3. XGBoost dealing with bias-variance tradeoff, the result of regularization term can control the complex level, and avoiding overfitting.

總結來說,XGB 就是一種很好用的算法。

Python的實現:

技術分享圖片

(選取參數)

技術分享圖片

如何調參是很關鍵的一步,在給定一定範圍條件下python會自動選取出其中最優的解。

在這裏可參考相關的博客(https://blog.csdn.net/sb19931201/article/details/52557382)

技術分享圖片

資料來源:(https://blog.csdn.net/sb19931201/article/details/52557382)

LGB boost

LGB boost是微軟公司2016推出的算法,其是在XGB算法上面的改進。主要提升了XGB算法的運行的速度,與之相對應的代價就是精度的損失。

Algorithm

The algorithm is similar with XGBoost, except the tree learning growth direction, when the data is small, LightGBM is to growth trees leaf-wise. The other traditional algorithm is to grow trees by depth-wise. The parallel features which is the most different with the other has been shown below (Sphinx):

1. Workers find local best split point {feature, threshold} on local feature set.

2. Communicate local best splits with each other and get the best one.

3. Perform the optimum split.

Advantages

1. Optimization in speed and reducing memory usage, especially large number data training.

2. Optimization in accuracy, differ with the most tree learning algorithms, LightGBM does not grow trees by depth-wise, it grows trees leaf-wise, when the data is small.

3. Optimal split for categorical features, since LightGBM uses its accumulated values to sorts the histogram, and then benefit from this idea, the best split on the sorted histogram has been found.

Python 的實現:

技術分享圖片

(本文中所有的英語部分是摘自學習過程中與小組成員共同完成的報告)

數據回歸分類預測的基本算法及python實現