1. 程式人生 > >gcForest 官方程式碼詳解

gcForest 官方程式碼詳解

1.介紹

gcForest v1.1.1是gcForest的一個官方託管在GitHub上的版本,是由Ji Feng(Deep Forest的paper的作者之一)維護和開發,該版本支援Python3.5,且有類似於Scikit-Learn的API介面風格,在該專案中提供了一些呼叫例子,目前支援的基分類器有RandomForestClassifier,XGBClassifer,ExtraTreesClassifier,LogisticRegression,SGDClassifier如果採用XGBoost的基分類器還可以使用GPU

本文采用的是v1.1.1版本,github地址https://github.com/kingfengji/gcForest

如果想增加其他基分類器,可以在模組中的lib/gcforest/estimators/__init__.py中新增

使用該模組需要依賴安裝如下模組:

  • argparse
  • joblib
  • keras
  • psutil
  • scikit-learn>=0.18.1
  • scipy
  • simplejson
  • tensorflow
  • xgboost

2.API呼叫樣例

 

這裡先列出gcForest提供的API介面:

  • fit_tranform(X_train,y_train) 是gcForest模型最後一層每個估計器預測的概率concatenated的結果

  • fit_transform(X_train,y_train,X_test=x_test,y_test=y_test) 測試資料的準確率在訓練的過程中也會被記錄下來

  • set_keep_model_mem(False) 如果你的快取不夠,把該引數設定成False(預設為True),如果設定成False,你需要使用fit_transform(X_train,y_train,X_test=x_test,y_test=y_test)來評估你的模型

  • predict(X_test) # 模型預測

  • transform(X_test)


程式碼主要分為兩部分:examples資料夾下是主程式碼.py和配置檔案.json;libs資料夾下是程式碼中用到的庫

主程式碼的實現

最簡單的呼叫gcForest的方式如下:


# 匯入必要的模組
from gcforest.gcforest import GCForest

# 初始化一個gcForest物件
gc = GCForest(config) # config是一個字典結構

# gcForest模型最後一層每個估計器預測的概率concatenated的結果
X_train_enc = gc.fit_transform(X_train,y_train)

# 測試集的預測
y_pred = gc.predict(X_test)

 

lib庫的詳解

gcforest.py 整個框架的實現
fgnet.py 多粒度部分,FineGrained的實現
cascade/cascade_classifier 級聯分類器的實現
datasets/.... 包含一系列資料集的定義
estimator/... 包含決策樹在進行評估用到的函式(多種分類器的預估)
layer/... 包含不同的層操作,如連線、池化、滑窗等
utils/.. 包含各種功能函式,譬如計算準確率、win_vote、win_avg、get_windows等

 

json配置檔案的詳解

引數介紹

  • max_depth: 決策樹最大深度。預設為"None",決策樹在建立子樹的時候不會限制子樹的深度這樣建樹時,會使每一個葉節點只有一個類別,或是達到min_samples_split。一般來說,資料少或者特徵少的時候可以不管這個值。如果模型樣本量多,特徵也多的情況下,推薦限制這個最大深度,具體的取值取決於資料的分佈。常用的可以取值10-100之間。
  • estimators表示選擇的分類器
  • n_estimators 為森林裡的樹的數量
  • n_jobs: int (default=1)
    The number of jobs to run in parallel for any Random Forest fit and predict.
    If -1, then the number of jobs is set to the number of cores.

訓練的配置,分三類情況:

  1. 採用預設的模型
def get_toy_config():
    config = {}
    ca_config = {}
    ca_config["random_state"] = 0  # 0 or 1
    ca_config["max_layers"] = 100  #最大的層數,layer對應論文中的level
    ca_config["early_stopping_rounds"] = 3  #如果出現某層的三層以內的準確率都沒有提升,層中止
    ca_config["n_classes"] = 3      #判別的類別數量
    ca_config["estimators"] = []  
    ca_config["estimators"].append(
            {"n_folds": 5, "type": "XGBClassifier", "n_estimators": 10, "max_depth": 5,
             "objective": "multi:softprob", "silent": True, "nthread": -1, "learning_rate": 0.1} )
    ca_config["estimators"].append({"n_folds": 5, "type": "RandomForestClassifier", "n_estimators": 10, "max_depth": None, "n_jobs": -1})
    ca_config["estimators"].append({"n_folds": 5, "type": "ExtraTreesClassifier", "n_estimators": 10, "max_depth": None, "n_jobs": -1})
    ca_config["estimators"].append({"n_folds": 5, "type": "LogisticRegression"})
    config["cascade"] = ca_config    #共使用了四個基學習器
    return config

支援的基本分類器:
RandomForestClassifier
XGBClassifier
ExtraTreesClassifier
LogisticRegression
SGDClassifier

你可以通過下述方式手動新增任何分類器:

lib/gcforest/estimators/__init__.py
  1. 只有級聯(cascade)部分
{
"cascade": {
    "random_state": 0,
    "max_layers": 100,
    "early_stopping_rounds": 3,
    "n_classes": 10,
    "estimators": [
        {"n_folds":5,"type":"XGBClassifier","n_estimators":10,"max_depth":5,"objective":"multi:softprob", "silent":true, "nthread":-1, "learning_rate":0.1},
        {"n_folds":5,"type":"RandomForestClassifier","n_estimators":10,"max_depth":null,"n_jobs":-1},
        {"n_folds":5,"type":"ExtraTreesClassifier","n_estimators":10,"max_depth":null,"n_jobs":-1},
        {"n_folds":5,"type":"LogisticRegression"}
    ]
}
}
  1. “multi fine-grained + cascade” 兩部分
    滑動視窗的大小: {[d/16], [d/8], [d/4]},d代表輸入特徵的數量;
    "look_indexs_cycle": [
    [0, 1],
    [2, 3],
    [4, 5]]
    代表級聯多粒度的方式,第一層級聯0、1森林的輸出,第二層級聯2、3森林的輸出,第三層級聯4、5森林的輸出
{
"net":{
"outputs": ["pool1/7x7/ets", "pool1/7x7/rf", "pool1/10x10/ets", "pool1/10x10/rf", "pool1/13x13/ets", "pool1/13x13/rf"],
"layers":[
// win1/7x7
    {
        "type":"FGWinLayer",
        "name":"win1/7x7",
        "bottoms": ["X","y"],
        "tops":["win1/7x7/ets", "win1/7x7/rf"],
        "n_classes": 10,
        "estimators": [
            {"n_folds":3,"type":"ExtraTreesClassifier","n_estimators":20,"max_depth":10,"n_jobs":-1,"min_samples_leaf":10},
            {"n_folds":3,"type":"RandomForestClassifier","n_estimators":20,"max_depth":10,"n_jobs":-1,"min_samples_leaf":10}
        ],
        "stride_x": 2,
        "stride_y": 2,
        "win_x":7,
        "win_y":7
    },
// win1/10x10
    {
        "type":"FGWinLayer",
        "name":"win1/10x10",
        "bottoms": ["X","y"],
        "tops":["win1/10x10/ets", "win1/10x10/rf"],
        "n_classes": 10,
        "estimators": [
            {"n_folds":3,"type":"ExtraTreesClassifier","n_estimators":20,"max_depth":10,"n_jobs":-1,"min_samples_leaf":10},
            {"n_folds":3,"type":"RandomForestClassifier","n_estimators":20,"max_depth":10,"n_jobs":-1,"min_samples_leaf":10}
        ],
        "stride_x": 2,
        "stride_y": 2,
        "win_x":10,
        "win_y":10
    },
// win1/13x13
    {
        "type":"FGWinLayer",
        "name":"win1/13x13",
        "bottoms": ["X","y"],
        "tops":["win1/13x13/ets", "win1/13x13/rf"],
        "n_classes": 10,
        "estimators": [
            {"n_folds":3,"type":"ExtraTreesClassifier","n_estimators":20,"max_depth":10,"n_jobs":-1,"min_samples_leaf":10},
            {"n_folds":3,"type":"RandomForestClassifier","n_estimators":20,"max_depth":10,"n_jobs":-1,"min_samples_leaf":10}
        ],
        "stride_x": 2,
        "stride_y": 2,
        "win_x":13,
        "win_y":13
    },
// pool1
    {
        "type":"FGPoolLayer",
        "name":"pool1",
        "bottoms": ["win1/7x7/ets", "win1/7x7/rf", "win1/10x10/ets", "win1/10x10/rf", "win1/13x13/ets", "win1/13x13/rf"],
        "tops": ["pool1/7x7/ets", "pool1/7x7/rf", "pool1/10x10/ets", "pool1/10x10/rf", "pool1/13x13/ets", "pool1/13x13/rf"],
        "pool_method": "avg",
        "win_x":2,
        "win_y":2
    }
]

},

"cascade": {
    "random_state": 0,
    "max_layers": 100,
    "early_stopping_rounds": 3,
    "look_indexs_cycle": [
        [0, 1],
        [2, 3],
        [4, 5]
    ],
    "n_classes": 10,
    "estimators": [
        {"n_folds":5,"type":"ExtraTreesClassifier","n_estimators":1000,"max_depth":null,"n_jobs":-1},
        {"n_folds":5,"type":"RandomForestClassifier","n_estimators":1000,"max_depth":null,"n_jobs":-1}
    ]
}
}

3.MNIST樣例

下面我們使用MNIST資料集來演示gcForest的使用及程式碼的詳細說明:

# 匯入必要的模組

import argparse # 命令列引數呼叫模組
import numpy as np 
import sys
from keras.datasets import mnist # MNIST資料集
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
sys.path.insert(0, "lib")

from gcforest.gcforest import GCForest
from gcforest.utils.config_utils import load_json


def parse_args():
	'''
	解析終端命令列引數(model)
	'''
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", dest="model", type=str, default=None, 
	help="gcfoest Net Model File")
    args = parser.parse_args()
    return args


def get_toy_config():
	'''
	生成級聯結構的相關結構
	'''
    config = {}
    ca_config = {}
    ca_config["random_state"] = 0
    ca_config["max_layers"] = 100
    ca_config["early_stopping_rounds"] = 3
    ca_config["n_classes"] = 10
    ca_config["estimators"] = []
    ca_config["estimators"].append(
            {"n_folds": 5, "type": "XGBClassifier", "n_estimators": 10, 
		"max_depth": 5,"objective": "multi:softprob", "silent": 
		True, "nthread": -1, "learning_rate": 0.1} )
    ca_config["estimators"].append({"n_folds": 5, "type": "RandomForestClassifier", 
	"n_estimators": 10, "max_depth": None, "n_jobs": -1})
    ca_config["estimators"].append({"n_folds": 5, "type": "ExtraTreesClassifier",
	 "n_estimators": 10, "max_depth": None, "n_jobs": -1})
    ca_config["estimators"].append({"n_folds": 5, "type": "LogisticRegression"})
    config["cascade"] = ca_config
    return config

# get_toy_config()生成的結構,如下所示:

'''
{
"cascade": {
    "random_state": 0,
    "max_layers": 100,
    "early_stopping_rounds": 3,
    "n_classes": 10,
    "estimators": [
        {"n_folds":5,"type":"XGBClassifier","n_estimators":10,"max_depth":5,
		"objective":"multi:softprob", "silent":true, 
		"nthread":-1, "learning_rate":0.1},
        {"n_folds":5,"type":"RandomForestClassifier","n_estimators":10,
		"max_depth":null,"n_jobs":-1},
        {"n_folds":5,"type":"ExtraTreesClassifier","n_estimators":10,
		"max_depth":null,"n_jobs":-1},
        {"n_folds":5,"type":"LogisticRegression"}
    ]
}
}
'''

if __name__ == "__main__":
    args = parse_args()
    if args.model is None:
        config = get_toy_config()
    else:
        config = load_json(args.model)

    gc = GCForest(config)
    # 如果模型消耗太大記憶體,可以使用如下命令使得gcforest不儲存在記憶體中
    # gc.set_keep_model_in_mem(False), 預設情況下是True.

    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    # X_train, y_train = X_train[:2000], y_train[:2000]
    # np.newaxis相當於增加了一個維度
    X_train = X_train[:, np.newaxis, :, :]
    X_test = X_test[:, np.newaxis, :, :]


    X_train_enc = gc.fit_transform(X_train, y_train)
    # X_enc是gcForest模型最後一層每個估計器預測的概率concatenated的結果
    # X_enc.shape =
    #   (n_datas, n_estimators * n_classes): 如果是級聯結構
    #   (n_datas, n_estimators * n_classes, dimX, dimY): 如果只有多粒度掃描結構

    # 可以在fit_transform方法中加入X_test,y_test,這樣測試資料的準確率在訓練的過程中
    # 也會被記錄下來。
    # X_train_enc, X_test_enc = 
	gc.fit_transform(X_train, y_train, X_test=X_test, y_test=y_test)

    # 注意: 如果設定了gc.set_keep_model_in_mem(True),必須使用
    # gc.fit_transform(X_train, y_train, X_test=X_test, y_test=y_test)
    # 評估模型

    # 測試集預測與評估
    y_pred = gc.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print("Test Accuracy of GcForest = {:.2f} %".format(acc * 100))

    # 可以使用gcForest得到的X_enc資料進行其他模型的訓練比如xgboost/RF
    # 資料的concat
    X_test_enc = gc.transform(X_test)
    X_train_enc = X_train_enc.reshape((X_train_enc.shape[0], -1))
    X_test_enc = X_test_enc.reshape((X_test_enc.shape[0], -1))
    X_train_origin = X_train.reshape((X_train.shape[0], -1))
    X_test_origin = X_test.reshape((X_test.shape[0], -1))
    X_train_enc = np.hstack((X_train_origin, X_train_enc))
    X_test_enc = np.hstack((X_test_origin, X_test_enc))

    print("X_train_enc.shape={}, X_test_enc.shape={}".format(X_train_enc.shape,
	 X_test_enc.shape))

    # 訓練一個RF
    clf = RandomForestClassifier(n_estimators=1000, max_depth=None, n_jobs=-1)
    clf.fit(X_train_enc, y_train)
    y_pred = clf.predict(X_test_enc)
    acc = accuracy_score(y_test, y_pred)
    print("Test Accuracy of Other classifier using 
	gcforest's X_encode = {:.2f} %".format(acc * 100))

    # 模型寫入pickle檔案
    with open("test.pkl", "wb") as f:
        pickle.dump(gc, f, pickle.HIGHEST_PROTOCOL)

    # 載入訓練的模型
    with open("test.pkl", "rb") as f:
        gc = pickle.load(f)
    y_pred = gc.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print("Test Accuracy of GcForest (save and load) = {:.2f} %".format(acc * 100))

這裡需要注意的是gcForest不但可以對傳統的結構化的2維資料建模,還可以對非結構化的資料比如影象,序列化的文字資料,音訊資料等進行建模,但要注意資料維度的設定:

  • 如果僅使用級聯結構,X_train,X_test對於2-D陣列其維度為(n_samples,n_features);3-D或4-D陣列會自動reshape為2-D,例如MNIST資料(60000,28,28)會reshape為(60000,784),(60000,3,28,28)會reshape為(60000,2352)。

  • 如果使用多粒度掃描結構,X_train,X_test必須是4—D的陣列,影象資料其維度是(n_samples,n_channels,n_height,n_width);序列資料其維度為(n_smaples,n_features,seq_len,1),例如對於IMDB資料,n_features為1,對於音訊MFCC特徵,其n_features可以為13,26等。

上述程式碼可以通過兩種方式執行:

  • 一種方式是通過json檔案定義模型結構,比如級聯森林結構,只需要寫一個json檔案如程式碼中顯示的結構,然後通過命令列執行python examples/demo_mnist.py --model examples/demo_mnist-gc.json就可以完成訓練;如果既使用多粒度掃面又使用級聯結構,那麼需要同時把多粒度掃描的結構定義出來。
  • 定義好的json可以通過模組中的load_json()方法載入,然後作為引數初始化模型,如下:
config = load_json(your_json_file)
gc = GCForest(config) 
  • 另一種方式是直接通過Python程式碼定義模型結構,實際上模型結構就是一個字典資料結構,即是上述程式碼中的get_toy_config()方法。