1. 程式人生 > >CTR學習筆記&程式碼實現2-深度ctr模型 MLP->Wide&Deep

CTR學習筆記&程式碼實現2-深度ctr模型 MLP->Wide&Deep

## 背景 這一篇我們從基礎的深度ctr模型談起。我很喜歡Wide&Deep的框架感覺之後很多改進都可以納入這個框架中。Wide負責樣本中出現的頻繁項挖掘,Deep負責樣本中未出現的特徵泛化。而後續的改進要麼用不同的IFC讓Deep更有效的提取特徵互動資訊,要麼是讓Wide更好的記憶樣本資訊 ## Embedding + MLP 點選率模型最初在深度學習上的嘗試是從簡單的MLP開始的。把高維稀疏的離散特徵做Embedding處理,然後把Embedding拼接作為MLP的輸入,經過多層全聯接神經網路的非線性變換得到對點選率的預測。

不知道你是否也像我一樣困惑過,這個Embedding+MLP究竟學到了什麼資訊?MLP的Embedding和FM的Embedding學到的是同樣的特徵互動資訊麼?最近從大神那裡聽到一個蠻有說服力的觀點,當然keep skeptical,歡迎一起討論~ >**mlp可以學到所有特徵低階和高階的資訊表達,但依賴龐大的搜尋空間。在樣本有限,引數也有限的情況下往往只能學到有限的資訊。因此才依賴於基於業務理解的特徵工程來幫助mlp在有限的空間下學到更多有效的特徵互動資訊。FM的向量內積只是二階特徵工程的一種方法。之後針對deep的很多改進也是在探索如何把特徵工程的業務經驗用於更好的提取特徵互動資訊** ### 程式碼實現 ```python def build_features(numeric_handle): f_sparse = [] f_dense = [] for col, config in EMB_CONFIGS.items(): ind = tf.feature_column.categorical_column_with_hash_bucket(col, hash_bucket_size = config['hash_size']) one_hot = tf.feature_column.indicator_column(ind) f_sparse.append(one_hot) if numeric_handle == 'bucketize': # Method1 'onehot': bucket to one hot for col, config in BUCKET_CONFIGS.items(): num = tf.feature_column.numeric_column( col ) bucket = tf.feature_column.bucketized_column( num, boundaries=config ) f_sparse.append(bucket) else : # Method2 'dense': concatenate with embedding for col, config in BUCKET_CONFIGS.items(): num = tf.feature_column.numeric_column( col ) f_dense.append(num) return f_sparse, f_dense def model_fn(features, labels, mode, params): sparse_columns, dense_columns = build_features(params['numeric_handle']) with tf.variable_scope('EmbeddingInput'): embedding_input = [] for f_sparse in sparse_columns: sparse_input = tf.feature_column.input_layer(features, f_sparse) input_dim = sparse_input.get_shape().as_list()[-1] init = tf.random_normal(shape = [input_dim, params['embedding_dim']]) weight = tf.get_variable('w_{}'.format(f_sparse.name), dtype = tf.float32, initializer = init) embedding_input.append( tf.matmul(sparse_input, weight) ) dense = tf.concat(embedding_input, axis=1, name = 'embedding_concat') if params['numeric_handle'] == 'dense': numeric_input = tf.feature_column.input_layer(features, dense_columns) numeric_input = tf.layers.batch_normalization(numeric_input, center = True, scale = True, trainable =True, training = (mode == tf.estimator.ModeKeys.TRAIN)) dense = tf.concat([dense, numeric_input], axis = 1, name ='numeric_concat') with tf.variable_scope('MLP'): for i, unit in enumerate(params['hidden_units']): dense = tf.layers.dense(dense, units = unit, activation = 'relu', name = 'Dense_{}'.format(i)) if mode == tf.estimator.ModeKeys.TRAIN: add_layer_summary(dense.name, dense) dense = tf.layers.dropout(dense, rate = params['dropout_rate']) with tf.variable_scope('output'): y = tf.layers.dense(dense, units=2, activation = 'relu', name = 'output') if mode == tf.estimator.ModeKeys.PREDICT: predictions = { 'predict_class': tf.argmax(tf.nn.softmax(y), axis=1), 'prediction_prob': tf.nn.softmax(y) } return tf.estimator.EstimatorSpec(mode = tf.estimator.ModeKeys.PREDICT, predictions = predictions) cross_entropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits( labels=labels, logits=y )) if mode == tf.estimator.ModeKeys.TRAIN: optimizer = tf.train.AdamOptimizer(learning_rate = params['learning_rate']) update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(update_ops): train_op = optimizer.minimize(cross_entropy, global_step = tf.train.get_global_step()) return tf.estimator.EstimatorSpec(mode, loss = cross_entropy, train_op = train_op) else: eval_metric_ops = { 'accuracy': tf.metrics.accuracy(labels = labels, predictions = tf.argmax(tf.nn.softmax(y), axis=1)), 'auc': tf.metrics.auc(labels = labels , predictions = tf.nn.softmax(y)[:,1]), 'pr': tf.metrics.auc(labels = labels, predictions = tf.nn.softmax(y)[:,1], curve = 'PR') } return tf.estimator.EstimatorSpec(mode, loss = cross_entropy, eval_metric_ops = eval_metric_ops) ``` ## Wide&Deep Wide&Deep是在上述MLP的基礎上加入了Wide部分。作者認為Deep的部分負責generalization既樣本中未出現模式的泛化和模糊查詢,就是上面的Embedding+MLP。wide負責memorization既樣本中已有模式的記憶,是對離散特徵和特徵組合做Logistics Regression。Deep和Wide一起進行聯合訓練。 這樣說可能不完全準確,**作者在文中也提到wide部分只是用來錦上添花,來幫助Deep增加那些在樣本中頻繁出現的模式在預測目標上的區分度**。所以wide不需要是一個full-size模型,而更多需要業務上判斷比較核心的特徵和交叉特徵。

### 連續特徵的處理 ctr模型大多是在探討稀疏離散特徵的處理,那連續特徵應該怎麼處理呢?有幾種處理方式 1. 連續特徵離散化處理,之後可以做embedding/onehot/cross 2. 連續特徵不做處理,直接和其他離散特徵embedding後的vector拼接作為輸入。這裡要考慮對連續特徵進行歸一化處理, 不然會收斂的很慢。上面MLP嘗試了BatchNorm,Wide&Deep則直接在feature_column裡面做了歸一化。 3. 既作為連續特徵輸入,同時也做離散化和其他離散特徵進行互動 連續特徵離散化的優缺點 缺點 1. 資訊丟失,丟失多少資訊要看桶分的咋樣 2. 平滑度下降,處於分桶邊界的特徵變動可能帶來預測值比較大的波動 優點 1. 加入非線性,多數情況下連續特徵和目標之間都不是線性關係,而是在到達某個閾值對使用者存在0/1的影響 2. 更穩健,可有效避免連續特徵中的極值/長尾問題 3. 特徵互動,做離散特徵處理後方便進一步做cross特徵 4. 省事...,不需要再考慮啥正不正態要不要做歸一化之類的 ### 程式碼實現 ```python def znorm(mean, std): def znorm_helper(col): return (col-mean)/std return znorm_helper def build_features(): f_onehot = [] f_embedding = [] f_numeric = [] # categorical features for col, config in EMB_CONFIGS.items(): ind = tf.feature_column.categorical_column_with_hash_bucket(col, hash_bucket_size = config['hash_size']) f_onehot.append( tf.feature_column.indicator_column(ind)) f_embedding.append( tf.feature_column.embedding_column(ind, dimension = config['emb_size']) ) # numeric features: both in numeric feature and bucketized to discrete feature for col, config in BUCKET_CONFIGS.items(): num = tf.feature_column.numeric_column(col, normalizer_fn = znorm(NORM_CONFIGS[col]['mean'],NORM_CONFIGS[col]['std'] )) f_numeric.append(num) bucket = tf.feature_column.bucketized_column( num, boundaries=config ) f_onehot.append(bucket) # crossed features for col1,col2 in combinations(f_onehot,2): # if col is indicator of hashed bucuket, use raw feature directly if col1.parents[0].name in EMB_CONFIGS.keys(): col1 = col1.parents[0].name if col2.parents[0].name in EMB_CONFIGS.keys(): col2 = col2.parents[0].name crossed = tf.feature_column.crossed_column([col1, col2], hash_bucket_size = 20) f_onehot.append(tf.feature_column.indicator_column(crossed)) f_dense = f_embedding + f_numeric #f_dense = f_embedding + f_numeric + f_onehot f_sparse = f_onehot #f_sparse = f_onehot + f_numeric return f_sparse, f_dense def build_estimator(model_dir): sparse_feature, dense_feature= build_features() run_config = tf.estimator.RunConfig( save_summary_steps=50, log_step_count_steps=50, keep_checkpoint_max = 3, save_checkpoints_steps =50 ) dnn_optimizer = tf.train.ProximalAdagradOptimizer( learning_rate= 0.001, l1_regularization_strength=0.001, l2_regularization_strength=0.001 ) estimator = tf.estimator.DNNLinearCombinedClassifier( model_dir=model_dir, linear_feature_columns=sparse_feature, dnn_feature_columns=dense_feature, dnn_optimizer = dnn_optimizer, dnn_dropout = 0.1, batch_norm = False, dnn_hidden_units = [48,32,16], config=run_config ) return estimator ``` 完整程式碼在這裡 https://github.com/DSXiangLi/CTR CTR學習筆記&程式碼實現系列