1. 程式人生 > >【總結】推薦系統學習-LibMF

【總結】推薦系統學習-LibMF

介紹

LibMF的作者是大名鼎鼎的臺灣國立大學,他們在機器學習領域享有盛名,近年連續多屆KDD Cup競賽上均獲得優異成績,並曾連續多年獲得冠軍。業界常用的LibSVM, Liblinear等都是他們開發的,開原始碼的效率和質量都非常高。

  LibMF是在潛在空間使用兩個矩陣,接近一個不完全矩陣。(原句是:LIBMF is an open source tool for approximating an incomplete matrix using the product of two matrices in a latent space.)

  矩陣分解(MF)通常在推薦系統中使用。LibMF有如下特點:

  1、providing solvers for real-valued matrix factorization, binary matrix factorization, and one-class matrix factorization(為實值矩陣分解,二元矩陣分解和一類矩陣分解提供解決辦法)
  2、parallel computation in a multi-core machine (多核機器中平行計算)
  3、using CPU instructions (e.g., SSE) to accelerate vector operations(可使用CPU指令,比如SSE來加速向量運算)
  taking less than 20 minutes to converge to a reasonable level on a data set of 1.7B ratings(在1.7B等級大小的資料上花不到20分鐘來聚集到一個合理的級數)
  4、cross validation for parameter selection(引數選擇的交叉驗證)
  5、supporting disk-level training, which largely reduces the memory usage(支援磁碟級訓練,大大減小記憶體使用)

編譯

  在Ubuntu14.04上進行。環境需要g++4.6及以上。

  將下載的壓縮檔案上傳至Ubuntu,解壓。

  進入目錄,輸入“make”進行編譯。

  編譯後可看到這些檔案

 

資料格式

   <row_idx> <col_idx> <value>

   在demo目錄中,檔案real_matrix.tr.txt' 和 `real_matrix.te.txt'是真值矩陣分解real-valued matrix factorization (RVMF)演示的訓練和測試資料集。二元矩陣分解binary matrix factorization (BMF)中,`binary_matrix.tr.txt' 和`binary_matrix.te.txt.'中<value>集是{-1, 1}。在一類矩陣分解(one-class MF)中,所有的<value>都是正的。

模型格式

 LibMF把一個訓練矩陣R變為一個k-by-m的矩陣 `P'和一個k-by-n的矩陣 `Q',也就是R近似於 P'Q。訓練過程結束後,這兩個因子矩陣P和Q被存到一個模型檔案中。這個檔案以如下打頭:

  `f': the loss function of the solved MF problem
     `m': the number of rows in the training matrix,
     `n': the number of columns in the training matrix,
     `k': the number of latent factors,
     `b': the average of all elements in the training matrix.

從第五行開始,P和Q的列就被一行接一行的儲存。每一行,都有兩個領導標誌跟在一列值後面。第一個標誌是被儲存列的名字,第二個標誌表明值的型別。如果第二個標誌是‘T’,列是真值。否則,列的所有值是NaN。舉個例子:

            [1 NaN 2]             [-1 -2]
      P = |3 NaN 4|,     Q =  |-3 -4|,
            [5 NaN 6]             [-5 -6]

      並且b=0.5,則模型檔案的內容是:

      --------model file--------
       m 3
       n  2
       k  3
       b 0.5
       p0 T 1 3 5
       p1 F 0 0 0
       p2 T 2 4 6
       q0 T -1 -3 -5
       q1 T -2 -4 -6
       --------------------------

使用

'mf-train'

用法: mf-train [options] training_set_file [model_file]

   

   
   “mf-train”是LibMF最主要的訓練命令。每次迭代,下列資訊都被打印出來:

    - iter: the index of iteration
    - tr_xxxx: xxxx is the evaluation criterion on the training set
    - va_xxxx: the same criterion on the validation set if `-p' is set
    - obj: objective function value

    這裡的`tr_xxxx' 和 `obj' 都是估計的,因為計算真的值太耗時間了。
    對於不同的損失,標準如下:

        <loss>: <evaluation criterion>
        -       0: root mean square error (RMSE)
        -       1: mean absolute error (MAE)
        -       2: generalized KL-divergence (KL)
        -       5: logarithmic loss
        -   6 & 7: accuracy
        - 10 & 11: pair-wise logarithmic loss (BprLoss)

  'mf-predict'

     用法:mf-predict [options] test_file model_file output_file

    

  示例

在demo目錄中,有一個“demo.sh”shell指令碼,執行它可以用來演示。

    下面做一些操作: 

mf-train real_matrix.tr.txt model

     train a model using the default parameters

    

mf-train -l1 0.05 -l2 0.01 real_matrix.tr.txt model

   train a model with the following regularization coefficients:

coefficient of L1-norm regularization on P = 0.05
coefficient of L1-norm regularization on Q = 0.05
coefficient of L2-norm regularization on P = 0.01
coefficient of L2-norm regularization on Q = 0.01
 

mf-train -l1 0.015,0 -l2 0.01,0.005 real_matrix.tr.txt model

 train a model with the following regularization coefficients:

coefficient of L1-norm regularization on P = 0.05
coefficient of L1-norm regularization on Q = 0
coefficient of L2-norm regularization on P = 0.01
coefficient of L2-norm regularization on Q = 0.03
 

mf-train -f 5 -l1 0,0.02 -k 100 -t 30 -r 0.02 -s 4 binary_matrix.tr.txt model

  train a BMF model using logarithmic loss and the following parameters:

coefficient of L1-norm regularization on P = 0
coefficient of L1-norm regularization on Q = 0.01
latent factors = 100
iterations = 30
learning rate = 0.02
threads = 4
 

mf-train -p real_matrix.te.txt real_matrix.tr.txt model

use real_matrix.te.txt for hold-out validation

 

mf-train -v 5 real_matrix.tr.txt

 do five fold cross validation

 

mf-train -f 2 --nmf real_matrix.tr.txt

 do non-negative matrix factorization with generalized KL-divergence

 

mf-train --quiet real_matrix.tr.txt

 do not print message to screen

mf-train --disk real_matrix.tr.txt

 do disk-level training

 

mf-predict real_matrix.te.txt model output

  do prediction

 

mf-predict -e 1 real_matrix.te.txt model output

  do prediction and output MAE

 
 執行完以上操作後的資料夾

 

  希望操作過LibMF的人能給我一些建議,因為我在demo目錄中執行"./demo.sh"時,第一次real_matrix可以執行出來:

 

 

 原文:http://blog.csdn.net/chenkfkevin/article/details/51064292