推薦系統演算法總結(三)——FM與DNN DeepFM
來源:https://blog.csdn.net/qq_23269761/article/details/81366939,如有不妥,請隨時聯絡溝通,謝謝~
0.瘋狂安利一個部落格
1.FM 與 DNN和embedding的關係
先來複習一下FM
對FM模型進行求解後,對於每一個特徵xi都能夠得到對應的隱向量vi,那麼這個vi到底是什麼呢?
想一想Google提出的word2vec,word2vec是word embedding方法的一種,word embedding的意思就是,給出一個文件,文件就是一個單詞序列,比如 “A B A C B F G”, 希望對文件中每個不同的單詞都得到一個對應的向量(往往是低維向量)表示。比如,對於這樣的“A B A C B F G”的一個序列,也許我們最後能得到:A對應的向量為[0.1 0.6 -0.5],B對應的向量為[-0.2 0.9 0.7] 。
所以結論就是:
FM演算法是一個特徵組合以及降維的工具,它能夠將原本因為one-hot編碼產生的稀疏特徵,進行兩兩組合後還能做一個降維!!降到多少維呢?就是FM中隱因子的個數k
2.FNN
利用FM做預訓練實現embedding,再通過DNN進行訓練
這樣的模型則是考慮了高階特徵,而在最後sigmoid輸出時忽略了低階特徵本身。
3.DeepFM
鑑於上述理論,目前新出的很多基於深度學習的CTR模型都從wide、deep(即低階、高階)兩方面同時進行考慮,進一步提高模型的泛化能力,比如DeepFM。
參考部落格:https://blog.csdn.net/zynash2/article/details/79348540
可以看到,整個模型大體分為兩部分:FM和DNN。簡單敘述一下模型的流程:藉助FNN的思想,利用FM進行embedding,之後的wide和deep模型共享embedding之後的結果。DNN的輸入完全和FNN相同(這裡不用預訓練,直接把embedding層看作一層的NN),而通過一定方式組合後,模型在wide上完全模擬出了FM的效果(至於為什麼,論文中沒有詳細推導,本文會稍後給出推導過程),最後將DNN和FM的結果組合後啟用輸出。
需要著重強調理解的時模型中關於FM的部分,究竟時如何搭建網路計算2階特徵的
**劃重點:**embedding層對於DNN來說時在提取特徵,對於FM來說就是他的2階特徵啊!!!!只不過FM和DNN共享embedding層而已。
4.DeepFM程式碼解讀
先放程式碼連結:
https://github.com/ChenglongChen/tensorflow-DeepFM
資料下載地址:
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
4.0 專案目錄
data:儲存訓練資料與測試資料
output/fig:用來存放輸出結果和訓練曲線
config:資料獲取和特徵工程中一些引數的設定
DataReader:特徵工程,獲得真正用於訓練的特徵集合
main:主程式入口
mertics:定義了gini指標作為評價指標
DeepFM:模型定義
4.1 整體過程
推薦一篇此資料集的EDA分析,看過可以對資料集的全貌有所瞭解:
https://blog.csdn.net/qq_37195507/article/details/78553581
- 1._load_data()
-
def _load_data():
-
dfTrain = pd.read_csv(config.TRAIN_FILE)
-
dfTest = pd.read_csv(config.TEST_FILE)
-
def preprocess(df):
-
cols = [c for c in df.columns if c not in ["id", "target"]]
-
df["missing_feat"] = np.sum((df[cols] == -1).values, axis=1)
-
df["ps_car_13_x_ps_reg_03"] = df["ps_car_13"] * df["ps_reg_03"]
-
return df
-
dfTrain = preprocess(dfTrain)
-
dfTest = preprocess(dfTest)
-
cols = [c for c in dfTrain.columns if c not in ["id", "target"]]
-
cols = [c for c in cols if (not c in config.IGNORE_COLS)]
-
X_train = dfTrain[cols].values
-
y_train = dfTrain["target"].values
-
X_test = dfTest[cols].values
-
ids_test = dfTest["id"].values
-
cat_features_indices = [i for i,c in enumerate(cols) if c in config.CATEGORICAL_COLS]
-
return dfTrain, dfTest, X_train, y_train, X_test, ids_test, cat_features_indices
首先讀取原始資料檔案TRAIN_FILE,TEST_FILE
preprocess(df)添加了兩個特徵分別是missing_feat【缺失特徵個數】與ps_car_13_x_ps_reg_03【兩個特徵的乘積】
返回:
dfTrain, dfTest :所有特徵都存在的Dataframe形式
X_train, X_test:刪掉了IGNORE_COLS的ndarray格式 【X_test後面都沒有用到啊】
y_train: label
ids_test:測試集的id,ndarray
cat_features_indices:類別特徵的特徵indices
- 利用X_train, y_train 進行了K折均衡交叉驗證切分資料集
- DeepFM引數設定
- 2._run_base_model_dfm
-
def _run_base_model_dfm(dfTrain, dfTest, folds, dfm_params):
-
fd = FeatureDictionary(dfTrain=dfTrain, dfTest=dfTest,
-
numeric_cols=config.NUMERIC_COLS,
-
ignore_cols=config.IGNORE_COLS)
-
data_parser = DataParser(feat_dict=fd)
-
Xi_train, Xv_train, y_train = data_parser.parse(df=dfTrain, has_label=True)
-
Xi_test, Xv_test, ids_test = data_parser.parse(df=dfTest)
-
dfm_params["feature_size"] = fd.feat_dim
-
dfm_params["field_size"] = len(Xi_train[0])
-
y_train_meta = np.zeros((dfTrain.shape[0], 1), dtype=float)
-
y_test_meta = np.zeros((dfTest.shape[0], 1), dtype=float)
-
_get = lambda x, l: [x[i] for i in l]
-
gini_results_cv = np.zeros(len(folds), dtype=float)
-
gini_results_epoch_train = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
-
gini_results_epoch_valid = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
-
for i, (train_idx, valid_idx) in enumerate(folds):
-
Xi_train_, Xv_train_, y_train_ = _get(Xi_train, train_idx), _get(Xv_train, train_idx), _get(y_train, train_idx)
-
Xi_valid_, Xv_valid_, y_valid_ = _get(Xi_train, valid_idx), _get(Xv_train, valid_idx), _get(y_train, valid_idx)
-
dfm = DeepFM(**dfm_params)
-
dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)
-
y_train_meta[valid_idx,0] = dfm.predict(Xi_valid_, Xv_valid_)
-
y_test_meta[:,0] += dfm.predict(Xi_test, Xv_test)
-
gini_results_cv[i] = gini_norm(y_valid_, y_train_meta[valid_idx])
-
gini_results_epoch_train[i] = dfm.train_result
-
gini_results_epoch_valid[i] = dfm.valid_result
-
y_test_meta /= float(len(folds))
-
# save result
-
if dfm_params["use_fm"] and dfm_params["use_deep"]:
-
clf_str = "DeepFM"
-
elif dfm_params["use_fm"]:
-
clf_str = "FM"
-
elif dfm_params["use_deep"]:
-
clf_str = "DNN"
-
print("%s: %.5f (%.5f)"%(clf_str, gini_results_cv.mean(), gini_results_cv.std()))
-
filename = "%s_Mean%.5f_Std%.5f.csv"%(clf_str, gini_results_cv.mean(), gini_results_cv.std())
-
_make_submission(ids_test, y_test_meta, filename)
-
_plot_fig(gini_results_epoch_train, gini_results_epoch_valid, clf_str)
-
return y_train_meta, y_test_meta
經過
DataReader中的FeatureDictionary
這個物件中有一個self.feat_dict屬性,長下面這個樣子:
{'missing_feat': 0, 'ps_ind_18_bin': {0: 254, 1: 255}, 'ps_reg_01': 256, 'ps_reg_02': 257, 'ps_reg_03': 258}
DataReader中的DataParser
-
class DataParser(object):
-
def __init__(self, feat_dict):
-
self.feat_dict = feat_dict #這個feat_dict是FeatureDictionary物件例項
-
def parse(self, infile=None, df=None, has_label=False):
-
assert not ((infile is None) and (df is None)), "infile or df at least one is set"
-
assert not ((infile is not None) and (df is not None)), "only one can be set"
-
if infile is None:
-
dfi = df.copy()
-
else:
-
dfi = pd.read_csv(infile)
-
if has_label:
-
y = dfi["target"].values.tolist()
-
dfi.drop(["id", "target"], axis=1, inplace=True)
-
else:
-
ids = dfi["id"].values.tolist()
-
dfi.drop(["id"], axis=1, inplace=True)
-
# dfi for feature index
-
# dfv for feature value which can be either binary (1/0) or float (e.g., 10.24)
-
dfv = dfi.copy()
-
for col in dfi.columns:
-
if col in self.feat_dict.ignore_cols:
-
dfi.drop(col, axis=1, inplace=True)
-
dfv.drop(col, axis=1, inplace=True)
-
continue
-
if col in self.feat_dict.numeric_cols:
-
dfi[col] = self.feat_dict.feat_dict[col]
-
else:
-
dfi[col] = dfi[col].map(self.feat_dict.feat_dict[col])
-
dfv[col] = 1.
-
#dfi.to_csv('dfi.csv')
-
#dfv.to_csv('dfv.csv')
-
# list of list of feature indices of each sample in the dataset
-
Xi = dfi.values.tolist()
-
# list of list of feature values of each sample in the dataset
-
Xv = dfv.values.tolist()
-
if has_label:
-
return Xi, Xv, y
-
else:
-
return Xi, Xv, ids
這裡Xi,Xv都是二位陣列,可以將dfi,dfv存在csv檔案中看一下長什麼樣子,長的很奇怪【可能後面模型需要吧~】
dfi:value值為特徵index,也就是上文中feat_dict屬性儲存的值
dfv:如果是數值變數,則保持原本的值,如果是分類變數,則value為1
4.2 模型架構
-
def _init_graph(self):
-
self.graph = tf.Graph()
-
with self.graph.as_default():
-
tf.set_random_seed(self.random_seed)
-
self.feat_index = tf.placeholder(tf.int32, shape=[None, None],
-
name="feat_index") # None * F
-
self.feat_value = tf.placeholder(tf.float32, shape=[None, None],
-
name="feat_value") # None * F
-
self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1
-
self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm")
-
self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep")
-
self.train_phase = tf.placeholder(tf.bool, name="train_phase")
-
self.weights = self._initialize_weights()
-
# model
-
self.embeddings = tf.nn.embedding_lookup(self.weights["feature_embeddings"],
-
self.feat_index) # None * F * K
-
#print(self.weights["feature_embeddings"]) shape=[259,8] n*k個隱向量
-
#print(self.embeddings) shape=[?,39,8] f*k 每個field取出一個隱向量[這不是FFM每個field取是在取非0量,減少計算]
-
feat_value = tf.reshape(self.feat_value, shape=[-1, self.field_size, 1])
-
#print(feat_value) shape=[?,39*1] 某一個樣本的39個Feature值
-
self.embeddings = tf.multiply(self.embeddings, feat_value) #multiply在有一個維度不同時,較少的維度會自行擴充套件
-
#print(self.embeddings) shape=[?,39*8]
-
# 所以這個multiply之後得到的矩陣是Vixi,方便以後進行<Vi,Vj>*xi*xj=<Vi*xi,Vj*xj>的計算,後面的計算FM被簡化為了
-
# sum_square part-square_sum part的形式,採用上面multiply的形式更方便啊!
-
# ---------- first order term ----------
-
self.y_first_order = tf.nn.embedding_lookup(self.weights["feature_bias"], self.feat_index) # None * F * 1
-
self.y_first_order = tf.reduce_sum(tf.multiply(self.y_first_order, feat_value), 2) # None * F
-
self.y_first_order = tf.nn.dropout(self.y_first_order, self.dropout_keep_fm[0]) # None * F
-
# ---------- second order term ---------------
-
# sum_square part
-
self.summed_features_emb = tf.reduce_sum(self.embeddings, 1) # None * K
-
self.summed_features_emb_square = tf.square(self.summed_features_emb) # None * K
-
# square_sum part
-
self.squared_features_emb = tf.square(self.embeddings)
-
self.squared_sum_features_emb = tf.reduce_sum(self.squared_features_emb, 1) # None * K
-
# second order
-
self.y_second_order = 0.5 * tf.subtract(self.summed_features_emb_square, self.squared_sum_features_emb) # None * K
-
self.y_second_order = tf.nn.dropout(self.y_second_order, self.dropout_keep_fm[1]) # None * K
-
# ---------- Deep component ----------
-
self.y_deep = tf.reshape(self.embeddings, shape=[-1, self.field_size * self.embedding_size]) # None * (F*K)
-
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0])
-
for i in range(0, len(self.deep_layers)):
-
self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1
-
if self.batch_norm:
-
self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1
-
self.y_deep = self.deep_layers_activation(self.y_deep)
-
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer
-
# ---------- DeepFM ----------
-
if self.use_fm and self.use_deep:
-
concat_input = tf.concat([self.y_first_order, self.y_second_order, self.y_deep], axis=1)
-
elif self.use_fm:
-
concat_input = tf.concat([self.y_first_order, self.y_second_order], axis=1)
-
elif self.use_deep:
-
concat_input = self.y_deep
-
self.out = tf.add(tf.matmul(concat_input, self.weights["concat_projection"]), self.weights["concat_bias"])
不知道為什麼這篇程式碼把FM寫的看起來很複雜。人家複雜是有原因的!!避免了使用one-hot編碼後的大大大矩陣
其實就是embedding層Deep和FM共用了隱向量【feature_size*k】矩陣
所以這個實現的重點在embedding層啊,這裡的實現方式通過Xi,Xv兩個較小的矩陣【n*field】注意這裡field不是FFM中的F,而是未one-hot編碼前的Feature數量。
根據內積的公式我們可以得到