XGBoost小記

xgboost · 發表 2019-02-01 23:28:00

摘要： 1.原理 //TODO 2.Python Package Scikit-Learn API 2.1輸入資料的特徵分為兩類，一類是連續型，比如：體重，一種是分型別，比如性別。在scikit-learn中的Glossary of Common Terms and API E...

1.原理

//TODO

2.Python Package Scikit-Learn API

2.1輸入

資料的特徵分為兩類，一類是連續型，比如：體重，一種是分型別，比如性別。

在scikit-learn中的Glossary of Common Terms and API Elements有這麼一段話：

Categorical Feature

A categorical or nominal feature is one that has a finite set of discrete values across the population of data. These are commonly represented as columns of integers or strings. Strings will be rejected by most scikit-learn estimators, and integers will be treated as ordinal or count-valued. For the use with most estimators, categorical variables should be one-hot encoded.Notable exceptions include tree-based models such as random forests and gradient boosting models that often work better and faster with integer-coded categorical variables. OrdinalEncoder helps encoding string-valued categorical features as ordinal integers, and OneHotEncoder can be used to one-hot encode categorical features. See also Encoding categorical features and the http://contrib.scikit-learn.org/categorical-encoding package for tools related to encoding categorical features.

大意是在利用基於樹的模型訓練時推薦使用數值編碼而不是one-hot編碼。

詳情：https://scikit-learn.org/stable/glossary.html#glossary

2.2輸出

在這裡只說兩點：multi:softmax和multi:softprob，官方文件是這麼說的：

multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes) multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.

在這裡略坑，建立model時無論填哪一個，在model fit之後，列印模型時引數卻都是multi:softprob，但是predict的結果也和上述解釋也不一致，結果是multi:softmax的結果，只有預測的標籤，沒有概率分佈。

官方程式碼如下：可見num_class也是不用設定的，objective被強制替換成了multi:softprob.最後若想輸出概率分佈請用predict_proba函式來預測.

self.classes_ = np.unique(y)
self.n_classes_ = len(self.classes_)

if self.n_classes_ > 2:
# Switch to using a multiclass objective in the underlying XGB instance
xgb_options["objective"] = "multi:softprob"
xgb_options['num_class'] = self.n_classes_

3.DEMO