推薦系統演算法總結（三）——FM與DNN DeepFM

阿新 • • 發佈：2019-01-06

來源：https://blog.csdn.net/qq_23269761/article/details/81366939，如有不妥，請隨時聯絡溝通，謝謝~

0.瘋狂安利一個部落格

FM的前世今生：
https://tracholar.github.io/machine-learning/2017/03/10/factorization-machine.html#%E7%BB%BC%E8%BF%B0

1.FM 與 DNN和embedding的關係

先來複習一下FM
這裡寫圖片描述

對FM模型進行求解後，對於每一個特徵xi都能夠得到對應的隱向量vi，那麼這個vi到底是什麼呢？

想一想Google提出的word2vec，word2vec是word embedding方法的一種，word embedding的意思就是，給出一個文件，文件就是一個單詞序列，比如 “A B A C B F G”, 希望對文件中每個不同的單詞都得到一個對應的向量(往往是低維向量)表示。比如，對於這樣的“A B A C B F G”的一個序列，也許我們最後能得到：A對應的向量為[0.1 0.6 -0.5]，B對應的向量為[-0.2 0.9 0.7] 。

所以結論就是：
FM演算法是一個特徵組合以及降維的工具，它能夠將原本因為one-hot編碼產生的稀疏特徵，進行兩兩組合後還能做一個降維！！降到多少維呢？就是FM中隱因子的個數k

2.FNN

利用FM做預訓練實現embedding，再通過DNN進行訓練
這裡寫圖片描述
這樣的模型則是考慮了高階特徵，而在最後sigmoid輸出時忽略了低階特徵本身。

3.DeepFM

鑑於上述理論，目前新出的很多基於深度學習的CTR模型都從wide、deep（即低階、高階）兩方面同時進行考慮，進一步提高模型的泛化能力，比如DeepFM。
參考部落格：https://blog.csdn.net/zynash2/article/details/79348540

可以看到，整個模型大體分為兩部分：FM和DNN。簡單敘述一下模型的流程：藉助FNN的思想，利用FM進行embedding，之後的wide和deep模型共享embedding之後的結果。DNN的輸入完全和FNN相同（這裡不用預訓練，直接把embedding層看作一層的NN），而通過一定方式組合後，模型在wide上完全模擬出了FM的效果（至於為什麼，論文中沒有詳細推導，本文會稍後給出推導過程），最後將DNN和FM的結果組合後啟用輸出。

需要著重強調理解的時模型中關於FM的部分，究竟時如何搭建網路計算2階特徵的
**劃重點：**embedding層對於DNN來說時在提取特徵，對於FM來說就是他的2階特徵啊！！！！只不過FM和DNN共享embedding層而已。

4.DeepFM程式碼解讀

先放程式碼連結：
https://github.com/ChenglongChen/tensorflow-DeepFM
資料下載地址：
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

4.0 專案目錄

這裡寫圖片描述
data:儲存訓練資料與測試資料
output/fig:用來存放輸出結果和訓練曲線
config:資料獲取和特徵工程中一些引數的設定
DataReader:特徵工程，獲得真正用於訓練的特徵集合
main:主程式入口
mertics:定義了gini指標作為評價指標
DeepFM:模型定義

4.1 整體過程

推薦一篇此資料集的EDA分析，看過可以對資料集的全貌有所瞭解：
https://blog.csdn.net/qq_37195507/article/details/78553581

1._load_data()

def _load_data():
dfTrain = pd.read_csv(config.TRAIN_FILE)
dfTest = pd.read_csv(config.TEST_FILE)
def preprocess(df):
cols = [c for c in df.columns if c not in ["id", "target"]]
df["missing_feat"] = np.sum((df[cols] == -1).values, axis=1)
df["ps_car_13_x_ps_reg_03"] = df["ps_car_13"] * df["ps_reg_03"]
return df
dfTrain = preprocess(dfTrain)
dfTest = preprocess(dfTest)
cols = [c for c in dfTrain.columns if c not in ["id", "target"]]
cols = [c for c in cols if (not c in config.IGNORE_COLS)]
X_train = dfTrain[cols].values
y_train = dfTrain["target"].values
X_test = dfTest[cols].values
ids_test = dfTest["id"].values
cat_features_indices = [i for i,c in enumerate(cols) if c in config.CATEGORICAL_COLS]
return dfTrain, dfTest, X_train, y_train, X_test, ids_test, cat_features_indices

首先讀取原始資料檔案TRAIN_FILE,TEST_FILE
preprocess(df)添加了兩個特徵分別是missing_feat【缺失特徵個數】與ps_car_13_x_ps_reg_03【兩個特徵的乘積】
返回：
dfTrain, dfTest :所有特徵都存在的Dataframe形式
X_train, X_test:刪掉了IGNORE_COLS的ndarray格式【X_test後面都沒有用到啊】
y_train: label
ids_test:測試集的id,ndarray
cat_features_indices：類別特徵的特徵indices

利用X_train, y_train 進行了K折均衡交叉驗證切分資料集
DeepFM引數設定
2._run_base_model_dfm

def _run_base_model_dfm(dfTrain, dfTest, folds, dfm_params):
fd = FeatureDictionary(dfTrain=dfTrain, dfTest=dfTest,
numeric_cols=config.NUMERIC_COLS,
ignore_cols=config.IGNORE_COLS)
data_parser = DataParser(feat_dict=fd)
Xi_train, Xv_train, y_train = data_parser.parse(df=dfTrain, has_label=True)
Xi_test, Xv_test, ids_test = data_parser.parse(df=dfTest)
dfm_params["feature_size"] = fd.feat_dim
dfm_params["field_size"] = len(Xi_train[0])
y_train_meta = np.zeros((dfTrain.shape[0], 1), dtype=float)
y_test_meta = np.zeros((dfTest.shape[0], 1), dtype=float)
_get = lambda x, l: [x[i] for i in l]
gini_results_cv = np.zeros(len(folds), dtype=float)
gini_results_epoch_train = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
gini_results_epoch_valid = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
for i, (train_idx, valid_idx) in enumerate(folds):
Xi_train_, Xv_train_, y_train_ = _get(Xi_train, train_idx), _get(Xv_train, train_idx), _get(y_train, train_idx)
Xi_valid_, Xv_valid_, y_valid_ = _get(Xi_train, valid_idx), _get(Xv_train, valid_idx), _get(y_train, valid_idx)
dfm = DeepFM(**dfm_params)
dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)
y_train_meta[valid_idx,0] = dfm.predict(Xi_valid_, Xv_valid_)
y_test_meta[:,0] += dfm.predict(Xi_test, Xv_test)
gini_results_cv[i] = gini_norm(y_valid_, y_train_meta[valid_idx])
gini_results_epoch_train[i] = dfm.train_result
gini_results_epoch_valid[i] = dfm.valid_result
y_test_meta /= float(len(folds))
# save result
if dfm_params["use_fm"] and dfm_params["use_deep"]:
clf_str = "DeepFM"
elif dfm_params["use_fm"]:
clf_str = "FM"
elif dfm_params["use_deep"]:
clf_str = "DNN"
print("%s: %.5f (%.5f)"%(clf_str, gini_results_cv.mean(), gini_results_cv.std()))
filename = "%s_Mean%.5f_Std%.5f.csv"%(clf_str, gini_results_cv.mean(), gini_results_cv.std())
_make_submission(ids_test, y_test_meta, filename)
_plot_fig(gini_results_epoch_train, gini_results_epoch_valid, clf_str)
return y_train_meta, y_test_meta

經過
DataReader中的FeatureDictionary
這個物件中有一個self.feat_dict屬性，長下面這個樣子：

{'missing_feat': 0, 'ps_ind_18_bin': {0: 254, 1: 255}, 'ps_reg_01': 256, 'ps_reg_02': 257, 'ps_reg_03': 258}

DataReader中的DataParser

class DataParser(object):
def __init__(self, feat_dict):
self.feat_dict = feat_dict #這個feat_dict是FeatureDictionary物件例項
def parse(self, infile=None, df=None, has_label=False):
assert not ((infile is None) and (df is None)), "infile or df at least one is set"
assert not ((infile is not None) and (df is not None)), "only one can be set"
if infile is None:
dfi = df.copy()
else:
dfi = pd.read_csv(infile)
if has_label:
y = dfi["target"].values.tolist()
dfi.drop(["id", "target"], axis=1, inplace=True)
else:
ids = dfi["id"].values.tolist()
dfi.drop(["id"], axis=1, inplace=True)
# dfi for feature index
# dfv for feature value which can be either binary (1/0) or float (e.g., 10.24)
dfv = dfi.copy()
for col in dfi.columns:
if col in self.feat_dict.ignore_cols:
dfi.drop(col, axis=1, inplace=True)
dfv.drop(col, axis=1, inplace=True)
continue
if col in self.feat_dict.numeric_cols:
dfi[col] = self.feat_dict.feat_dict[col]
else:
dfi[col] = dfi[col].map(self.feat_dict.feat_dict[col])
dfv[col] = 1.
#dfi.to_csv('dfi.csv')
#dfv.to_csv('dfv.csv')
# list of list of feature indices of each sample in the dataset
Xi = dfi.values.tolist()
# list of list of feature values of each sample in the dataset
Xv = dfv.values.tolist()
if has_label:
return Xi, Xv, y
else:
return Xi, Xv, ids

這裡Xi,Xv都是二位陣列，可以將dfi,dfv存在csv檔案中看一下長什麼樣子，長的很奇怪【可能後面模型需要吧~】
dfi:value值為特徵index，也就是上文中feat_dict屬性儲存的值
這裡寫圖片描述
dfv:如果是數值變數，則保持原本的值，如果是分類變數，則value為1

4.2 模型架構

def _init_graph(self):
self.graph = tf.Graph()
with self.graph.as_default():
tf.set_random_seed(self.random_seed)
self.feat_index = tf.placeholder(tf.int32, shape=[None, None],
name="feat_index") # None * F
self.feat_value = tf.placeholder(tf.float32, shape=[None, None],
name="feat_value") # None * F
self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1
self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm")
self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep")
self.train_phase = tf.placeholder(tf.bool, name="train_phase")
self.weights = self._initialize_weights()
# model
self.embeddings = tf.nn.embedding_lookup(self.weights["feature_embeddings"],
self.feat_index) # None * F * K
#print(self.weights["feature_embeddings"]) shape=[259,8] n*k個隱向量
#print(self.embeddings) shape=[?,39,8] f*k 每個field取出一個隱向量[這不是FFM每個field取是在取非0量，減少計算]
feat_value = tf.reshape(self.feat_value, shape=[-1, self.field_size, 1])
#print(feat_value) shape=[?,39*1] 某一個樣本的39個Feature值
self.embeddings = tf.multiply(self.embeddings, feat_value) #multiply在有一個維度不同時，較少的維度會自行擴充套件
#print(self.embeddings) shape=[?,39*8]
# 所以這個multiply之後得到的矩陣是Vixi,方便以後進行<Vi,Vj>*xi*xj=<Vi*xi,Vj*xj>的計算，後面的計算FM被簡化為了
# sum_square part-square_sum part的形式,採用上面multiply的形式更方便啊！
# ---------- first order term ----------
self.y_first_order = tf.nn.embedding_lookup(self.weights["feature_bias"], self.feat_index) # None * F * 1
self.y_first_order = tf.reduce_sum(tf.multiply(self.y_first_order, feat_value), 2) # None * F
self.y_first_order = tf.nn.dropout(self.y_first_order, self.dropout_keep_fm[0]) # None * F
# ---------- second order term ---------------
# sum_square part
self.summed_features_emb = tf.reduce_sum(self.embeddings, 1) # None * K
self.summed_features_emb_square = tf.square(self.summed_features_emb) # None * K
# square_sum part
self.squared_features_emb = tf.square(self.embeddings)
self.squared_sum_features_emb = tf.reduce_sum(self.squared_features_emb, 1) # None * K
# second order
self.y_second_order = 0.5 * tf.subtract(self.summed_features_emb_square, self.squared_sum_features_emb) # None * K
self.y_second_order = tf.nn.dropout(self.y_second_order, self.dropout_keep_fm[1]) # None * K
# ---------- Deep component ----------
self.y_deep = tf.reshape(self.embeddings, shape=[-1, self.field_size * self.embedding_size]) # None * (F*K)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0])
for i in range(0, len(self.deep_layers)):
self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1
if self.batch_norm:
self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1
self.y_deep = self.deep_layers_activation(self.y_deep)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer
# ---------- DeepFM ----------
if self.use_fm and self.use_deep:
concat_input = tf.concat([self.y_first_order, self.y_second_order, self.y_deep], axis=1)
elif self.use_fm:
concat_input = tf.concat([self.y_first_order, self.y_second_order], axis=1)
elif self.use_deep:
concat_input = self.y_deep
self.out = tf.add(tf.matmul(concat_input, self.weights["concat_projection"]), self.weights["concat_bias"])

不知道為什麼這篇程式碼把FM寫的看起來很複雜。人家複雜是有原因的！！避免了使用one-hot編碼後的大大大矩陣
其實就是embedding層Deep和FM共用了隱向量【feature_size*k】矩陣

所以這個實現的重點在embedding層啊，這裡的實現方式通過Xi,Xv兩個較小的矩陣【n*field】注意這裡field不是FFM中的F,而是未one-hot編碼前的Feature數量。
這裡寫圖片描述
根據內積的公式我們可以得到

推薦系統演算法總結（三）——FM與DNN DeepFM

來源：https://blog.csdn.net/qq_23269761/article/details/81366939，如有不妥，請隨時聯絡溝通，謝謝~

0.瘋狂安利一個部落格

1.FM 與 DNN和embedding的關係

2.FNN

3.DeepFM

4.DeepFM程式碼解讀

推薦系統演算法總結（三）——FM與DNN DeepFM

機器學習演算法總結（三）

資料庫視屏總結（三）——連線與查詢

c++學習總結（三）——類與物件

python基礎語法總結（三）-- 數與字串

C++面試總結（三）模板與泛型程式設計

學生資訊管理系統總結（三）——優化篇

七大排序演算法的個人總結（三）

自學Linux系統的小總結（三）

mahout之推薦系統原始碼筆記（4） ---總結與優化

論文總結（三）-- 超分辨演算法基礎與綜述

String 常用方法最優演算法實現總結（三） -- findCommonSubstring 和difference

kali系統——網絡安全v6筆記總結（三）

遠程協助開發總結（三）

css基礎知識的復習總結（三）

I/O流操做總結（三）

基於大數據的電影網站項目開發之階段性總結（三）

[轉載] java多線程總結（三）

JSP學習總結（三）

springMVC學習總結（三）數據綁定

推薦系統演算法總結（三）——FM與DNN DeepFM

來源：https://blog.csdn.net/qq_23269761/article/details/81366939，如有不妥，請隨時聯絡溝通，謝謝~

0.瘋狂安利一個部落格

1.FM 與 DNN和embedding的關係

2.FNN

3.DeepFM

4.DeepFM程式碼解讀

相關推薦