1. 程式人生 > >推薦系統演算法總結(三)——FM與DNN DeepFM

推薦系統演算法總結(三)——FM與DNN DeepFM

來源:https://blog.csdn.net/qq_23269761/article/details/81366939,如有不妥,請隨時聯絡溝通,謝謝~

0.瘋狂安利一個部落格

FM的前世今生: 
https://tracholar.github.io/machine-learning/2017/03/10/factorization-machine.html#%E7%BB%BC%E8%BF%B0

1.FM 與 DNN和embedding的關係

先來複習一下FM 
這裡寫圖片描述 
這裡寫圖片描述 
對FM模型進行求解後,對於每一個特徵xi都能夠得到對應的隱向量vi,那麼這個vi到底是什麼呢?

想一想Google提出的word2vec,word2vec是word embedding方法的一種,word embedding的意思就是,給出一個文件,文件就是一個單詞序列,比如 “A B A C B F G”, 希望對文件中每個不同的單詞都得到一個對應的向量(往往是低維向量)表示。比如,對於這樣的“A B A C B F G”的一個序列,也許我們最後能得到:A對應的向量為[0.1 0.6 -0.5],B對應的向量為[-0.2 0.9 0.7] 。

所以結論就是: 
FM演算法是一個特徵組合以及降維的工具,它能夠將原本因為one-hot編碼產生的稀疏特徵,進行兩兩組合後還能做一個降維!!降到多少維呢?就是FM中隱因子的個數k

2.FNN

利用FM做預訓練實現embedding,再通過DNN進行訓練 
這裡寫圖片描述 
這樣的模型則是考慮了高階特徵,而在最後sigmoid輸出時忽略了低階特徵本身。

3.DeepFM

鑑於上述理論,目前新出的很多基於深度學習的CTR模型都從wide、deep(即低階、高階)兩方面同時進行考慮,進一步提高模型的泛化能力,比如DeepFM。 
參考部落格:https://blog.csdn.net/zynash2/article/details/79348540

 
這裡寫圖片描述 
可以看到,整個模型大體分為兩部分:FM和DNN。簡單敘述一下模型的流程:藉助FNN的思想,利用FM進行embedding,之後的wide和deep模型共享embedding之後的結果。DNN的輸入完全和FNN相同(這裡不用預訓練,直接把embedding層看作一層的NN),而通過一定方式組合後,模型在wide上完全模擬出了FM的效果(至於為什麼,論文中沒有詳細推導,本文會稍後給出推導過程),最後將DNN和FM的結果組合後啟用輸出。

需要著重強調理解的時模型中關於FM的部分,究竟時如何搭建網路計算2階特徵的 
**劃重點:**embedding層對於DNN來說時在提取特徵,對於FM來說就是他的2階特徵啊!!!!只不過FM和DNN共享embedding層而已。

4.DeepFM程式碼解讀

先放程式碼連結: 
https://github.com/ChenglongChen/tensorflow-DeepFM 
資料下載地址: 
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

4.0 專案目錄

這裡寫圖片描述 
data:儲存訓練資料與測試資料 
output/fig:用來存放輸出結果和訓練曲線 
config:資料獲取和特徵工程中一些引數的設定 
DataReader:特徵工程,獲得真正用於訓練的特徵集合 
main:主程式入口 
mertics:定義了gini指標作為評價指標 
DeepFM:模型定義

4.1 整體過程

推薦一篇此資料集的EDA分析,看過可以對資料集的全貌有所瞭解: 
https://blog.csdn.net/qq_37195507/article/details/78553581

  • 1._load_data()
 
  1. def _load_data():

  2.  
  3. dfTrain = pd.read_csv(config.TRAIN_FILE)

  4. dfTest = pd.read_csv(config.TEST_FILE)

  5.  
  6. def preprocess(df):

  7. cols = [c for c in df.columns if c not in ["id", "target"]]

  8. df["missing_feat"] = np.sum((df[cols] == -1).values, axis=1)

  9. df["ps_car_13_x_ps_reg_03"] = df["ps_car_13"] * df["ps_reg_03"]

  10. return df

  11.  
  12. dfTrain = preprocess(dfTrain)

  13. dfTest = preprocess(dfTest)

  14.  
  15. cols = [c for c in dfTrain.columns if c not in ["id", "target"]]

  16. cols = [c for c in cols if (not c in config.IGNORE_COLS)]

  17.  
  18. X_train = dfTrain[cols].values

  19. y_train = dfTrain["target"].values

  20. X_test = dfTest[cols].values

  21. ids_test = dfTest["id"].values

  22. cat_features_indices = [i for i,c in enumerate(cols) if c in config.CATEGORICAL_COLS]

  23.  
  24. return dfTrain, dfTest, X_train, y_train, X_test, ids_test, cat_features_indices

首先讀取原始資料檔案TRAIN_FILE,TEST_FILE 
preprocess(df)添加了兩個特徵分別是missing_feat【缺失特徵個數】與ps_car_13_x_ps_reg_03【兩個特徵的乘積】 
返回: 
dfTrain, dfTest :所有特徵都存在的Dataframe形式 
X_train, X_test:刪掉了IGNORE_COLS的ndarray格式 【X_test後面都沒有用到啊】 
y_train: label 
ids_test:測試集的id,ndarray 
cat_features_indices:類別特徵的特徵indices

  • 利用X_train, y_train 進行了K折均衡交叉驗證切分資料集
  • DeepFM引數設定
  • 2._run_base_model_dfm
 
  1. def _run_base_model_dfm(dfTrain, dfTest, folds, dfm_params):

  2. fd = FeatureDictionary(dfTrain=dfTrain, dfTest=dfTest,

  3. numeric_cols=config.NUMERIC_COLS,

  4. ignore_cols=config.IGNORE_COLS)

  5. data_parser = DataParser(feat_dict=fd)

  6. Xi_train, Xv_train, y_train = data_parser.parse(df=dfTrain, has_label=True)

  7. Xi_test, Xv_test, ids_test = data_parser.parse(df=dfTest)

  8.  
  9. dfm_params["feature_size"] = fd.feat_dim

  10. dfm_params["field_size"] = len(Xi_train[0])

  11.  
  12. y_train_meta = np.zeros((dfTrain.shape[0], 1), dtype=float)

  13. y_test_meta = np.zeros((dfTest.shape[0], 1), dtype=float)

  14. _get = lambda x, l: [x[i] for i in l]

  15. gini_results_cv = np.zeros(len(folds), dtype=float)

  16. gini_results_epoch_train = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)

  17. gini_results_epoch_valid = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)

  18. for i, (train_idx, valid_idx) in enumerate(folds):

  19. Xi_train_, Xv_train_, y_train_ = _get(Xi_train, train_idx), _get(Xv_train, train_idx), _get(y_train, train_idx)

  20. Xi_valid_, Xv_valid_, y_valid_ = _get(Xi_train, valid_idx), _get(Xv_train, valid_idx), _get(y_train, valid_idx)

  21.  
  22. dfm = DeepFM(**dfm_params)

  23. dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)

  24.  
  25. y_train_meta[valid_idx,0] = dfm.predict(Xi_valid_, Xv_valid_)

  26. y_test_meta[:,0] += dfm.predict(Xi_test, Xv_test)

  27.  
  28. gini_results_cv[i] = gini_norm(y_valid_, y_train_meta[valid_idx])

  29. gini_results_epoch_train[i] = dfm.train_result

  30. gini_results_epoch_valid[i] = dfm.valid_result

  31.  
  32. y_test_meta /= float(len(folds))

  33.  
  34. # save result

  35. if dfm_params["use_fm"] and dfm_params["use_deep"]:

  36. clf_str = "DeepFM"

  37. elif dfm_params["use_fm"]:

  38. clf_str = "FM"

  39. elif dfm_params["use_deep"]:

  40. clf_str = "DNN"

  41. print("%s: %.5f (%.5f)"%(clf_str, gini_results_cv.mean(), gini_results_cv.std()))

  42. filename = "%s_Mean%.5f_Std%.5f.csv"%(clf_str, gini_results_cv.mean(), gini_results_cv.std())

  43. _make_submission(ids_test, y_test_meta, filename)

  44.  
  45. _plot_fig(gini_results_epoch_train, gini_results_epoch_valid, clf_str)

  46.  
  47. return y_train_meta, y_test_meta

經過 
DataReader中的FeatureDictionary 
這個物件中有一個self.feat_dict屬性,長下面這個樣子:

{'missing_feat': 0, 'ps_ind_18_bin': {0: 254, 1: 255}, 'ps_reg_01': 256, 'ps_reg_02': 257, 'ps_reg_03': 258}
  •  

DataReader中的DataParser

 
  1. class DataParser(object):

  2. def __init__(self, feat_dict):

  3. self.feat_dict = feat_dict #這個feat_dict是FeatureDictionary物件例項

  4.  
  5. def parse(self, infile=None, df=None, has_label=False):

  6. assert not ((infile is None) and (df is None)), "infile or df at least one is set"

  7. assert not ((infile is not None) and (df is not None)), "only one can be set"

  8. if infile is None:

  9. dfi = df.copy()

  10. else:

  11. dfi = pd.read_csv(infile)

  12. if has_label:

  13. y = dfi["target"].values.tolist()

  14. dfi.drop(["id", "target"], axis=1, inplace=True)

  15. else:

  16. ids = dfi["id"].values.tolist()

  17. dfi.drop(["id"], axis=1, inplace=True)

  18. # dfi for feature index

  19. # dfv for feature value which can be either binary (1/0) or float (e.g., 10.24)

  20. dfv = dfi.copy()

  21. for col in dfi.columns:

  22. if col in self.feat_dict.ignore_cols:

  23. dfi.drop(col, axis=1, inplace=True)

  24. dfv.drop(col, axis=1, inplace=True)

  25. continue

  26. if col in self.feat_dict.numeric_cols:

  27. dfi[col] = self.feat_dict.feat_dict[col]

  28. else:

  29. dfi[col] = dfi[col].map(self.feat_dict.feat_dict[col])

  30. dfv[col] = 1.

  31. #dfi.to_csv('dfi.csv')

  32. #dfv.to_csv('dfv.csv')

  33.  
  34. # list of list of feature indices of each sample in the dataset

  35. Xi = dfi.values.tolist()

  36. # list of list of feature values of each sample in the dataset

  37. Xv = dfv.values.tolist()

  38. if has_label:

  39. return Xi, Xv, y

  40. else:

  41. return Xi, Xv, ids

這裡Xi,Xv都是二位陣列,可以將dfi,dfv存在csv檔案中看一下長什麼樣子,長的很奇怪【可能後面模型需要吧~】 
dfi:value值為特徵index,也就是上文中feat_dict屬性儲存的值 
這裡寫圖片描述
dfv:如果是數值變數,則保持原本的值,如果是分類變數,則value為1 
這裡寫圖片描述

4.2 模型架構

 
  1. def _init_graph(self):

  2. self.graph = tf.Graph()

  3. with self.graph.as_default():

  4.  
  5. tf.set_random_seed(self.random_seed)

  6.  
  7. self.feat_index = tf.placeholder(tf.int32, shape=[None, None],

  8. name="feat_index") # None * F

  9. self.feat_value = tf.placeholder(tf.float32, shape=[None, None],

  10. name="feat_value") # None * F

  11. self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1

  12. self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm")

  13. self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep")

  14. self.train_phase = tf.placeholder(tf.bool, name="train_phase")

  15.  
  16. self.weights = self._initialize_weights()

  17.  
  18. # model

  19. self.embeddings = tf.nn.embedding_lookup(self.weights["feature_embeddings"],

  20. self.feat_index) # None * F * K

  21.  
  22. #print(self.weights["feature_embeddings"]) shape=[259,8] n*k個隱向量

  23. #print(self.embeddings) shape=[?,39,8] f*k 每個field取出一個隱向量[這不是FFM每個field取是在取非0量,減少計算]

  24. feat_value = tf.reshape(self.feat_value, shape=[-1, self.field_size, 1])

  25. #print(feat_value) shape=[?,39*1] 某一個樣本的39個Feature值

  26. self.embeddings = tf.multiply(self.embeddings, feat_value) #multiply在有一個維度不同時,較少的維度會自行擴充套件

  27. #print(self.embeddings) shape=[?,39*8]

  28. # 所以這個multiply之後得到的矩陣是Vixi,方便以後進行<Vi,Vj>*xi*xj=<Vi*xi,Vj*xj>的計算,後面的計算FM被簡化為了

  29. # sum_square part-square_sum part的形式,採用上面multiply的形式更方便啊!

  30.  
  31. # ---------- first order term ----------

  32. self.y_first_order = tf.nn.embedding_lookup(self.weights["feature_bias"], self.feat_index) # None * F * 1

  33. self.y_first_order = tf.reduce_sum(tf.multiply(self.y_first_order, feat_value), 2) # None * F

  34. self.y_first_order = tf.nn.dropout(self.y_first_order, self.dropout_keep_fm[0]) # None * F

  35.  
  36. # ---------- second order term ---------------

  37. # sum_square part

  38. self.summed_features_emb = tf.reduce_sum(self.embeddings, 1) # None * K

  39. self.summed_features_emb_square = tf.square(self.summed_features_emb) # None * K

  40.  
  41. # square_sum part

  42. self.squared_features_emb = tf.square(self.embeddings)

  43. self.squared_sum_features_emb = tf.reduce_sum(self.squared_features_emb, 1) # None * K

  44.  
  45. # second order

  46. self.y_second_order = 0.5 * tf.subtract(self.summed_features_emb_square, self.squared_sum_features_emb) # None * K

  47. self.y_second_order = tf.nn.dropout(self.y_second_order, self.dropout_keep_fm[1]) # None * K

  48.  
  49. # ---------- Deep component ----------

  50. self.y_deep = tf.reshape(self.embeddings, shape=[-1, self.field_size * self.embedding_size]) # None * (F*K)

  51. self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0])

  52. for i in range(0, len(self.deep_layers)):

  53. self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1

  54. if self.batch_norm:

  55. self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1

  56. self.y_deep = self.deep_layers_activation(self.y_deep)

  57. self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer

  58.  
  59. # ---------- DeepFM ----------

  60. if self.use_fm and self.use_deep:

  61. concat_input = tf.concat([self.y_first_order, self.y_second_order, self.y_deep], axis=1)

  62. elif self.use_fm:

  63. concat_input = tf.concat([self.y_first_order, self.y_second_order], axis=1)

  64. elif self.use_deep:

  65. concat_input = self.y_deep

  66. self.out = tf.add(tf.matmul(concat_input, self.weights["concat_projection"]), self.weights["concat_bias"])

不知道為什麼這篇程式碼把FM寫的看起來很複雜。人家複雜是有原因的!!避免了使用one-hot編碼後的大大大矩陣 
其實就是embedding層Deep和FM共用了隱向量【feature_size*k】矩陣

所以這個實現的重點在embedding層啊,這裡的實現方式通過Xi,Xv兩個較小的矩陣【n*field】注意這裡field不是FFM中的F,而是未one-hot編碼前的Feature數量。 
這裡寫圖片描述
根據內積的公式我們可以得到