1. 程式人生 > >傳統機器學習&資料探勘比賽程式碼框架

傳統機器學習&資料探勘比賽程式碼框架

傳統資料探勘比賽中程式碼框架如下: 1.匯入庫 2.讀取資料檔案 3.定義特徵構建函式    (希望構建新的特徵提升分數,只需要新增框架中的第 3 和第 4 部分。) 4.呼叫函式,構建特徵 5.拆分資料集的特徵與標籤 6.模型的交叉驗證 7.模型的訓練與預測 8.結果檔案的寫出

# coding:utf-8


# 1. 匯入庫
import numpy as np
import pandas as pd
...

# 2. 讀取資料檔案
train = pd.read_csv('../data/input/train.csv')
test = pd.read_csv('../data/input/evaluation_public.csv')
...

# 3. 定義特徵構建函式
def get_entbase_feature(df):
	...
def get_alter_feature(df):
	...
...

# 4. 呼叫函式,構建特徵
entbase_feat = get_entbase_feature(entbase)
alter_feat = get_alter_feature(alter)
...

# 5. 拆分資料集的特徵與標籤
dataset = pd.merge(entbase_feat, alter_feat, on='EID', how='left')
...
trainset = pd.merge(train, dataset, on='EID', how='left')
testset = pd.merge(test, dataset, on='EID', how='left')
train_feature = trainset.drop(['TARGET', 'ENDDATE'], axis=1)
train_label = trainset.TARGET.values
test_feature = testset
test_index = testset.EID.values

# 6. 模型的交叉驗證
...
iterations, best_score = xgb_cv(train_feature, train_label, params, config['folds'], config['rounds'])
...

# 7. 模型的訓練與預測
...
model, pred = xgb_predict(train_feature, train_label, test_feature, iterations, params)
...

# 8. 結果檔案的寫出
res = store_result(test_index, pred, 0.18, '1207-xgb-%f(r%d)' % (best_score, iterations))