1. 程式人生 > >Kaggle之Grupo Bimbo Inventory Demand

Kaggle之Grupo Bimbo Inventory Demand

Grupo Bimbo Inventory Demand

Grupo Bimbo Inventory Demand 在這個比賽中,我們需要預測某個產品在某個銷售點每週的需求量。資料包含墨西哥9周的銷售資料。每週,貨運車輛把產品發往銷售點,每筆交易包含銷售量和退貨量,其中退貨量主要由未銷售出的和過期的產品組成。每個產品的需求量是指該商品這周的銷售量減去下週的退貨量。

幾點注意:

  1. 測試資料中可能包含訓練資料中不存在的商品。這在實際的生活中是十分常見的。所以模型必須很好的適應這一點。

  2. 同一個客戶id可能含有不同的客戶名字,因為客戶名並沒有歸一化

  3. 需求量是一個大於等於零的值,Venta_uni_hoy - Dev_uni_proxima有時候為負是因為有時候退貨量有時候累計了好幾星期。

Semana — Week number (From Thursday to Wednesday) - 星期幾
Agencia_ID — Sales Depot ID - 銷售地id
Canal_ID — Sales Channel ID - 銷售渠道id
Ruta_SAK — Route ID (Several routes = Sales Depot)
Cliente_ID — Client ID - 客戶id
NombreCliente — Client name - 客戶名
Producto_ID — Product ID - 產品id
NombreProducto — Product Name - 產品名
Venta_uni_hoy — Sales unit this week (integer
) - 本週銷量 Venta_hoy — Sales this week (unit: pesos) - 本週銷售額 Dev_uni_proxima — Returns unit next week (integer) - 退貨量 Dev_proxima — Returns next week (unit: pesos) - 退貨銷售額 Demanda_uni_equil — Adjusted Demand (integer) (This is the target you will predict) - 需求量(目標值)

銷量預測是一個非常有意義的問題,在實際應用中有多種應用價值,例如快遞的發貨量預測、運輸規劃等等。粗看起來,這個問題很像一個時間序列的預測問題,但是銷售地、人、類目的組合太多,時間累積太短,因此不足以支撐時間序列預測。因此我們把他作為一個迴歸問題來進行分析。很遺憾的是資料量巨大,解壓縮後訓練集有7000萬+資料,對於我的小破本來說已經無法支撐跑這麼大資料集的機器學習任務了。在Kaggle中我找到了一個使用XGBoost的解決方案,下面來簡單介紹一下思路和程式碼。一些關鍵程式碼的註釋我翻譯為中文了。

特徵工程的基本思路:

計算第三週的統計特徵用來預測第四周的需求、計算第四周的統計特徵用來預測第五週的需求…

# A Python implementation of the awesome script by Bohdan Pavlyshenko: http://tinyurl.com/jd6k2kr
# and with inspiration from Rodolfo Lomascolo's  http://tinyurl.com/z6qmxfk
#
# Author: willgraf
import numpy as np
import pandas as pd
import xgboost as xgb
import pdb

from sklearn.metrics import mean_squared_error
from sklearn.cross_validation import train_test_split

from subprocess import check_output

## --------------------------全域性變數-----------------------------------------
#
LAG_WEEK_VAL = 3 # set to 3 to use lagged features.
BIG_LAG = True # set to True if to use more than 1 lagged_featc
## --------------------------函式-----------------------------------------

def get_dataframe():
    '''
    載入訓練集和測試集,對於訓練集我們設定目標為『Demanda_uni_equil』列,測試集目標列暫設定為0.
    這裡有一個全域性變數LAG_WEEK_VAL,表示week大於LAG_WEEK_VAL的資料作為訓練集(訓練集中只有第三週到第9周的資料,目標是預測第10和11周的資料)
    '''
    print('Loading training data')
    train = pd.read_csv('train.csv',
                        usecols=['Semana','Agencia_ID','Ruta_SAK','Cliente_ID','Producto_ID','Demanda_uni_equil'])

    print('Loading test data')
    test = pd.read_csv('test.csv',
                       usecols=['Semana','Agencia_ID','Ruta_SAK','Cliente_ID','Producto_ID','id'])

    print('Merging train & test data')
    # lagged week features
    train = train.loc[train['Semana'] > LAG_WEEK_VAL,]
    train['id'] = 0
    test['Demanda_uni_equil'] = None
    train['target'] = train['Demanda_uni_equil']
    test['target'] = 0
    # 將訓練集和測試集結合到一起
    return pd.concat([train, test])

def create_lagged_feats(data, num_lag):
    '''
    構建滯後的銷量特徵。即使用3~8周的銷售資料構造特徵,為接下來9、10、11周的預測做準備。
    舉例來說,第三週的銷售資料統計特徵作為第四周銷量影響因子
    特徵列名命名為: target_lNUMBER
    '''
    # 根據這三列生成特徵
    keys = ['Semana', 'Cliente_ID', 'Producto_ID']
    lag_feat = 'target_l' + str(num_lag)
    print('Creating lagged feature: %s' % lag_feat)
    data1 = df.loc[: , ['Cliente_ID', 'Producto_ID']]
    # 對week數加1,為接下來的join做準備
    data1['Semana'] = df.loc[: , 'Semana'].apply(lambda x: x + 1)
    # 拷貝訓練集中的目標列,並命名為target_lNUMBER,為了接下來求平均值做準備
    data1[lag_feat] = df.loc[: , 'target']
    # groupby 'Semana', 'Cliente_ID', 'Producto_ID'並求target列的平均值
    data1 = pd.groupby(data1, keys).mean().reset_index()
    # JOIN:把平均值的統計特徵join到data中
    return pd.merge(data, data1, how='left', on=keys, left_index=False, right_index=False,
                    suffixes=('', '_lag' + str(num_lag)), copy=False)

def create_freq_feats(data, column_name):
    '''
    求列的頻率特徵,也就是求對應列每週的平均值
    '''
    freq_feat = column_name + '_freq'
    print('Creating frequency feature: %s' % freq_feat)
    # 列+周的target計數
    freq_frame = pd.groupby(data, [column_name, 'Semana'])['target'].count().reset_index()
    freq_frame.rename(columns={'target': freq_feat}, inplace=True)
    # 計算平均值
    freq_frame = pd.groupby(freq_frame, [column_name])[freq_feat].mean().reset_index()

    # 將平均值join回原來的data中
    return pd.merge(data, freq_frame, how='left', on=[column_name], left_index=False,
                    right_index=False, suffixes=('', '_freq'), copy=False)

def build_model(df, features, model_params):
    '''
    構建模型訓練和測試
      df: dataframe to be modled - 用來訓練和測試的資料
      features: column names of df that should be used in model - 特徵列
      params: {xgb_param_key: 'xgb_param_value'} - 引數
    '''
    mask = df['Demanda_uni_equil'].isnull()
    test = df[mask]
    train = df[~mask]
    train.loc[: , 'target'] = train.loc[: , 'target'].apply(lambda x: np.log(x + 1))

    xlf = xgb.XGBRegressor(**model_params)

    x_train, x_test, y_train, y_test = train_test_split(train[features], train['target'], test_size=0.01, random_state=1)

    xlf.fit(x_train, y_train, eval_metric='rmse', verbose=1, eval_set=[(x_test, y_test)], early_stopping_rounds=100)
    preds = xlf.predict(x_test)
    print('RMSE of log(Demanda_uni_equil[x_test]) : %s" ' % str(mean_squared_error(y_test,preds) ** 0.5))

    # 預測第10周的銷量
    print('Predicting Demanda_uni_equil for Semana 10')
    data_test_10 = test.loc[test['Semana'] == 10, features]
    preds = xlf.predict(data_test_10)
    data_test_10['Demanda_uni_equil'] = np.exp(preds) - 1
    data_test_10['id'] = test.loc[test['Semana'] == 10, 'id'].astype(int).tolist()

    # 把預測的第10周的銷量結果作為特徵來預測11周的資料
    print('Creating lagged demand feature for Semana 11')
    data_test_lag = data_test_10[['Cliente_ID', 'Producto_ID']]
    data_test_lag['target_l1'] = data_test_10['Demanda_uni_equil']
    data_test_lag = pd.groupby(data_test_lag,['Cliente_ID','Producto_ID']).mean().reset_index()

    # 預測第11周的銷量
    print('Predicting Demanda_uni_equil for Semana 11')
    data_test_11 = test.loc[test['Semana'] == 11, features.difference(['target_l1'])]
    data_test_11 = pd.merge(data_test_11, data_test_lag, how='left', on=['Cliente_ID', 'Producto_ID'],
                            left_index=False, right_index=False, sort=True, copy=False)

    data_test_11 = data_test_11.loc[: , features]#.replace(np.nan, 0, inplace=True)
    preds = xlf.predict(data_test_11)
    data_test_11['Demanda_uni_equil'] = np.exp(preds) - 1
    data_test_11['id'] = test.loc[test['Semana'] == 11, 'id'].astype(int).tolist()

    return pd.concat([data_test_10.loc[: , ['id', 'Demanda_uni_equil']],
                      data_test_11.loc[: , ['id', 'Demanda_uni_equil']]],
                      axis=0, copy=True)

## --------------------------執行-----------------------------------------

## 首先載入訓練集和測試集,並進行基本的資料預處理
df = get_dataframe()

## 構建特徵
for i in range(1, 1 + 5):
    if not BIG_LAG and i > 1:
        break
    df = create_lagged_feats(df, num_lag=i)

df = df[df['Semana'] > 8]

## 計算列頻率特徵
for col_name in ['Agencia_ID', 'Ruta_SAK', 'Cliente_ID', 'Producto_ID']:
    df = create_freq_feats(df, col_name)

## 選擇部分特徵來構建模型
feature_names = df.columns.difference(['id', 'target', 'Demanda_uni_equil'])
print('Data Cleaned.  Using features: %s' % feature_names)

## 模型引數
xgb_params = {
    'max_depth': 10,
    'learning_rate': 0.01,
    'n_estimators': 75,
    'subsample': .85,
    'colsample_bytree': 0.7,
    'objective': 'reg:linear',
    'silent': True,
    'nthread': -1,
    'gamma': 0,
    'min_child_weight': 1,
    'max_delta_step': 0,
    'subsample': 0.85,
    'colsample_bytree': 0.7,
    'colsample_bylevel': 1,
    'reg_alpha': 0,
    'reg_lambda': 1,
    'scale_pos_weight': 1,
    'missing': None,
    'seed': 1
}

## 訓練模型並寫入結果
submission = build_model(df, feature_names, xgb_params)
submission.to_csv('submission.csv', index=False)