1. 程式人生 > >Wide & Deep Learning for Recommender Systems 模型實踐

Wide & Deep Learning for Recommender Systems 模型實踐

Wide & Deep 模型實踐

部落格程式碼均以上傳至GitHub,歡迎follow和start~~!

1. 資料集

資料集如下,其中,最後一行是label,預測收入是否超過5萬美元,二分類問題。

2. Wide Linear Model

離散特徵處理分為兩種情況:

  • 知道所有的不同取值,而且取值不多。tf.feature_column.categorical_column_with_vocabulary_list
  • 不知道所有不同取值,或者取值非常多。tf.feature_column.categorical_column_with_hash_bucket

## 3.1 Base Categorical Feature Columns
# 如果我們知道所有的取值,並且取值不是很多
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    'relationship', [
        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried', 'Other-relative'
    ]
)

# 如果不知道有多少取值
occupation = tf.feature_column.
categorical_column_with_hash_bucket( 'occupation', hash_bucket_size=1000 )

原始連續特徵tf.feature_column.numeric_column

# 3.2 Base Continuous Feature Columns
age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.
numeric_column('capital_gain') capital_loss = tf.feature_column.numeric_column('capital_loss') hours_per_week = tf.feature_column.numeric_column('hours_per_week')

規範化到[0,1]的連續特徵tf.feature_column.bucketized_column

# 3.2.1 連續特徵離散化
# 之所以這麼做是因為:有些時候連續特徵和label之間不是線性的關係。
# 可能剛開始是正的線性關係,後面又變成了負的線性關係,這樣一個折線的關係整體來看就不再是線性關係。
# bucketization 裝桶
# 10個邊界,11個桶

age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

組合特徵/交叉特徵tf.feature_column.crossed_column

# 3.3 組合特徵/交叉特徵
education_x_occupation = tf.feature_column.crossed_column(
    ['education', 'occupation'], hash_bucket_size=1000)

age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
    [age_buckets, 'education', 'occupation'], hash_bucket_size=1000
)

組裝模型:這裡主要用了離散特徵 + 組合特徵

# 4. 模型
"""
之前的特徵:
1. CategoricalColumn
2. NumericalColumn
3. BucketizedColumn
4. CrossedColumn
這些特徵都是FeatureColumn的子類,可以放到一起
"""
base_columns = [
    education, marital_status, relationship, workclass, occupation,
    age_buckets,
]

crossed_column = [
    tf.feature_column.crossed_column(
        ['education', 'occupation'], hash_bucket_size=1000
    ),
    tf.feature_column.crossed_column(
        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000
    )
]

model_dir = "./model/wide_component"
model = tf.estimator.LinearClassifier(
    model_dir = model_dir, feature_columns = base_columns + crossed_column
)

訓練 & 評估

# 5. Train & Evaluate & Predict
model.train(input_fn=lambda: input_fn(data_file=train_file, num_epochs=1, shuffle=True, batch_size=512))
results = model.evaluate(input_fn=lambda: input_fn(val_file, 1, False, 512))
for key in sorted(results):
    print("{0:20}: {1:.4f}".format(key, results[key]))

執行結果

Parsing ./data/adult.data
2018-12-21 15:39:37.182512: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Parsing ./data/adult.data
accuracy            : 0.8436
accuracy_baseline   : 0.7592
auc                 : 0.8944
auc_precision_recall: 0.7239
average_loss        : 0.3395
global_step         : 256.0000
label/mean          : 0.2408
loss                : 172.7150
prediction/mean     : 0.2416
Parsing ./data/adult.test

3. Wide & Deep Model

Deep部分用的特徵: 未處理的連續特徵 + Embedding(離散特徵)

在Wide的基礎上,增加Deep部分:
離散特徵embedding之後,和連續特徵串聯。

# 3. The Deep Model: Neural Network with Embeddings
"""
1. Sparse Features -> Embedding vector -> 串聯(Embedding vector, 連續特徵) -> 輸入到Hidden Layer
2. Embedding Values隨機初始化
3. 另外一種處理離散特徵的方法是:one-hot or multi-hot representation. 但是僅僅適用於維度較低的,embedding是更加通用的做法
4. embedding_column(embedding);indicator_column(multi-hot);
"""

deep_columns = [
    age,
    education_num,
    capital_gain,
    capital_loss,
    hours_per_week,

    # 對類別少的分類特徵列做 one-hot 編碼
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(marital_status),
    tf.feature_column.indicator_column(relationship),

    # To show an example of embedding
    # 事實上,這裡只是為了作為演示用,embedding的長度一般會經驗設定為 categories ** (0.25)
    tf.feature_column.embedding_column(occupation, dimension=8)
]

組合Wide & DeepDNNLinearCombinedClassifier

# 4. Combine Wide & Deep
model = tf.estimator.DNNLinearCombinedClassifier(
    model_dir=model_dir,
    linear_feature_columns=base_columns + crossed_columns,
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50]
)

訓練 & 評估

for n in range(train_epochs // epochs_per_eval):
    model.train(input_fn=lambda: input_fn(train_file, epochs_per_eval, True, batch_size))
    results = model.evaluate(input_fn=lambda: input_fn(
        test_file, 1, False, batch_size
    ))

    # Display Eval results
    print("Results at epoch {0}".format((n+1) * epochs_per_eval))
    print('-'*30)

    for key in sorted(results):
        print("{0:20}: {1:.4f}".format(key, results[key]))

執行結果

Parsing ./data/adult.data
2018-12-21 15:35:49.183730: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Parsing ./data/adult.test
Results at epoch 2
------------------------------
accuracy            : 0.8439
accuracy_baseline   : 0.7638
auc                 : 0.8916
auc_precision_recall: 0.7433
average_loss        : 0.3431
global_step         : 6516.0000
label/mean          : 0.2362
loss                : 13.6899
prediction/mean     : 0.2274
Parsing ./data/adult.data
Parsing ./data/adult.test
Results at epoch 4
------------------------------
accuracy            : 0.8529
accuracy_baseline   : 0.7638
auc                 : 0.8970
auc_precision_recall: 0.7583
average_loss        : 0.3335
global_step         : 8145.0000
label/mean          : 0.2362
loss                : 13.3099
prediction/mean     : 0.2345
Parsing ./data/adult.data
Parsing ./data/adult.test
Results at epoch 6
------------------------------
accuracy            : 0.8540
accuracy_baseline   : 0.7638
auc                 : 0.8994
auc_precision_recall: 0.7623
average_loss        : 0.3297
global_step         : 9774.0000
label/mean          : 0.2362
loss                : 13.1567
prediction/mean     : 0.2398

Process finished with exit code 0

參考文獻

  1. Wide & Deep Learning for Recommender Systems
  2. Google AI Blog Wide & Deep Learning: Better Together with TensorFlow https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html
  3. TensorFlow Linear Model Tutorialhttps://www.tensorflow.org/tutorials/wide
  4. TensorFlow Wide & Deep Learning Tutorialhttps://www.tensorflow.org/tutorials/wide_and_deep
  5. TensorFlow 資料集和估算器介紹 http://developers.googleblog.cn/2017/09/tensorflow.html
  6. absl https://github.com/abseil/abseil-py/blob/master/smoke_tests/sample_app.py
  7. Wide & Deep 理論與實踐
  8. Wide & Deep Learning for Recommender Systems 論文閱讀總結