ml課程:XGBoost和lightGBM工具庫學習及相關案例程式碼
以下是我的學習筆記,以及總結,如有錯誤之處請不吝賜教。
本文主要介紹kaggle大殺器xgboost和lightgbm兩個工具庫的簡單使用,以及相關案例程式碼。
首先回憶一下boosting原理,以及由boosting衍生出來的演算法:Adaboost和GBDT以及後面更強的xgboost,忘記的同學可以查閱我之前的文章:ml課程:決策樹、隨機森林、GBDT、XGBoost相關(含程式碼實現),除此之外當然還有樹模型的相關整合演算法的內容:ml課程:模型融合與調優及相關案例程式碼。回憶殺完了,我們進入正文。
XGboost:
是eXtreme Gradient Boosting的簡稱,原始碼在這:
xgboost計算速度更快的原因有以下幾點:
- Parallelization:訓練是可以用所有的cpu核心來並行化建樹(單棵樹)。
- Distributed Computing :用分散式計算來訓練非常大的模型。
- Out-of-Core Computing:對於非常大的資料集還可以進行out-of-core computing.
- Cache Optimization of data structures and algorithms:可以更好的利用硬體。
下圖是XGBoost與其他gradient boosting和bagged decision trees效果比較:
xgboost另一個優點是預測問題中模型表現非常好,具體可以看下面幾個比賽大牛的連結:
- Vlad Sandulescu, Mihai Chiru, 1st place of the KDD Cup 2016 competition. Link to the arxiv paper.
- Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the Dato Truely Native?
- Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition. Link to the Kaggle interview.
最常用XGboost部分:
與sklearn類似,這個庫也有以下幾個常用的部分:
- XGBoost Tutorials,主要是如何使用這個庫的一些案例介紹。
- XGBoost Parameters,主要是需要調節的引數:通用引數(general parameters)、整合引數(booster parameters)、任務引數(task parameters)。
- Python API Reference:各種api介面。 4.高階用法:在github上獲取原始碼,更改相關引數;例如:我們可以自定義損失函式和評價指標
#注意:我們呼叫原資料需要轉換為.train和.test #!/usr/bin/python import numpy as np import xgboost as xgb ### # advanced: customized loss function # print('start running example to used customized objective function') dtrain = xgb.DMatrix('../data/agaricus.txt.train') dtest = xgb.DMatrix('../data/agaricus.txt.test') # note: for customized objective function, we leave objective as default # note: what we are getting is margin value in prediction # you must know what you are doing param = {'max_depth': 2, 'eta': 1, 'silent': 1} watchlist = [(dtest, 'eval'), (dtrain, 'train')] num_round = 2 # user define objective function, given prediction, return gradient and second order gradient # this is log likelihood loss def logregobj(preds, dtrain): labels = dtrain.get_label() preds = 1.0 / (1.0 + np.exp(-preds)) grad = preds - labels hess = preds * (1.0 - preds) return grad, hess #grad和hess分別表示一階導數和二階導數 # user defined evaluation function, return a pair metric_name, result # NOTE: when you do customized loss function, the default prediction value is margin # this may make builtin evaluation metric not function properly # for example, we are doing logistic loss, the prediction is score before logistic transformation # the builtin evaluation error assumes input is after logistic transformation # Take this in mind when you use the customization, and maybe you need write customized evaluation function def evalerror(preds, dtrain): labels = dtrain.get_label() # return a pair metric_name, result. The metric name must not contain a colon (:) or a space # since preds are margin(before logistic transformation, cutoff at 0) return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels) # training with customized objective, we can also do step by step training # simply look at xgboost.py's implementation of train bst = xgb.train(param, dtrain, num_round, watchlist, obj=logregobj, feval=evalerror)
相關連結:
xgboost完整流程小專案:https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/
xgboost sklearn庫API介面:https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
xgboost API:https://xgboost.readthedocs.io/en/latest/
github原始碼:https://github.com/dmlc/xgboost
lightGBM:
與XGboost類似,lightGBM也是微軟開源的一個工具庫,它與XGboost的區別是執行計算更快,尤其是在大資料的情況下,支援的演算法也更多。
最常用lightGBM部分:
最後,還是回到案例程式碼上:歡迎關注我的github
To be continue......