1. 程式人生 > >基於user的推薦系統--以Mahout為例項

基於user的推薦系統--以Mahout為例項

基於使用者的協同過濾是推薦系統中最古老的演算法,而且這個演算法思路也是非常直接,通過找某個user類似的user喜好進行推薦。

具體實現流程如下:


u 代表一個user ,上述流程是一個最樸素的基於使用者的推薦流程。但是這個在實際當中效率太低下,實際中的基於使用者推薦流程如下:


最主要區別就是首先先找到相似使用者集合,然後跟相似使用者集合相關的item 稱為候選集。

一個最經典的呼叫程式碼:

package recommender;

import java.io.File;
import java.util.List;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;

class RecommenderIntro {

	private RecommenderIntro() {
	}

	public static void main(String[] args) throws Exception {
		File modelFile = null;
		if (args.length > 0)
			modelFile = new File(args[0]);
		if (modelFile == null || !modelFile.exists())
			modelFile = new File("E:\\hello.txt");
		if (!modelFile.exists()) {
			System.err
					.println("Please, specify name of file, or put file 'input.csv' into current directory!");
			System.exit(1);
		}
		DataModel model = new FileDataModel(modelFile);

		UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
		UserNeighborhood neighborhood = new NearestNUserNeighborhood(2,
				similarity, model);

		Recommender recommender = new GenericUserBasedRecommender(model,
				neighborhood, similarity);
		
		recommender.refresh(null);

		List<RecommendedItem> recommendations = recommender.recommend(1, 3);

		for (RecommendedItem recommendation : recommendations) {
			System.out.println(recommendation);
		}

	}

}

其中DataModel在前面博文中已經詳細介紹了,這裡不再贅述,這裡主要說下相似性衡量。

這裡用到是PearsonCorrelationSimilarity相關係數,這裡詳細介紹下這個皮爾遜相關係數。

假設有兩個變數X、Y,那麼兩變數間的皮爾遜相關係數可通過以下公式計算:

公式一:

皮爾遜相關係數計算公式

公式二:

 

皮爾遜相關係數計算公式

公式三:


皮爾遜相關係數計算公式

公式四:

 

皮爾遜相關係數計算公式

以上列出的四個公式等價,其中E是數學期望,cov表示協方差,N表示變數取值的個數。

皮爾遜相關度評價演算法首先會找出兩位評論者都曾評論過的物品,然後計算兩者的評分總和與平方和,並求得評分的乘積之各。利用上面的公式四計算出皮爾遜相關係數。

其實不同工具選擇了不同實現公式,但是原理肯定是一樣的,下面是一個python語言版本的實現。

critics = {
           'bob':{'A':5.0,'B':3.0,'C':2.5},
           'alice':{'A':5.0,'C':3.0}}
from math import sqrt

def sim_pearson(prefs, p1, p2):
    # Get the list of mutually rated items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1
    print si
    # if they are no ratings in common, return 0
    if len(si) == 0:
        return 0
    # Sum calculations
    n = len(si)
    # Sums of all the preferences
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])
    # Sums of the squares
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
    # Sum of the products
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
    # Calculate r (Pearson score)
    num = pSum - (sum1 * sum2 / n)
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:
        return 0
    r = num / den
    return r
if __name__=="__main__":
    print critics['bob']
    print(sim_pearson(critics,'bob','alice'))
Mahout中也有類似實現如下:
package org.apache.mahout.cf.taste.impl.similarity;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.common.Weighting;
import org.apache.mahout.cf.taste.model.DataModel;

import com.google.common.base.Preconditions;

/**
 * <p>
 * An implementation of the Pearson correlation. For users X and Y, the following values are calculated:
 * </p>
 *
 * <ul>
 * <li>sumX2: sum of the square of all X's preference values</li>
 * <li>sumY2: sum of the square of all Y's preference values</li>
 * <li>sumXY: sum of the product of X and Y's preference value for all items for which both X and Y express a
 * preference</li>
 * </ul>
 *
 * <p>
 * The correlation is then:
 *
 * <p>
 * {@code sumXY / sqrt(sumX2 * sumY2)}
 * </p>
 *
 * <p>
 * Note that this correlation "centers" its data, shifts the user's preference values so that each of their
 * means is 0. This is necessary to achieve expected behavior on all data sets.
 * </p>
 *
 * <p>
 * This correlation implementation is equivalent to the cosine similarity since the data it receives
 * is assumed to be centered -- mean is 0. The correlation may be interpreted as the cosine of the angle
 * between the two vectors defined by the users' preference values.
 * </p>
 *
 * <p>
 * For cosine similarity on uncentered data, see {@link UncenteredCosineSimilarity}.
 * </p> 
 */
public final class PearsonCorrelationSimilarity extends AbstractSimilarity {

  /**
   * @throws IllegalArgumentException if {@link DataModel} does not have preference values
   */
  public PearsonCorrelationSimilarity(DataModel dataModel) throws TasteException {
    this(dataModel, Weighting.UNWEIGHTED);
  }

  /**
   * @throws IllegalArgumentException if {@link DataModel} does not have preference values
   */
  public PearsonCorrelationSimilarity(DataModel dataModel, Weighting weighting) throws TasteException {
    super(dataModel, weighting, true);
    Preconditions.checkArgument(dataModel.hasPreferenceValues(), "DataModel doesn't have preference values");
  }
  
  @Override
  double computeResult(int n, double sumXY, double sumX2, double sumY2, double sumXYdiff2) {
    if (n == 0) {
      return Double.NaN;
    }
    // Note that sum of X and sum of Y don't appear here since they are assumed to be 0;
    // the data is assumed to be centered.
    double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
    if (denominator == 0.0) {
      // One or both parties has -all- the same ratings;
      // can't really say much similarity under this measure
      return Double.NaN;
    }
    return sumXY / denominator;
  }
  
}

看了兩個皮爾遜相關係數實現,現在來說下這個相似性度量在推薦系統中使用的問題。

先給一個實際計算例子



直觀上來看皮爾遜相關係數是不錯的,仔細分析我們可以知道這個相似性衡量的一些缺陷。

首先,皮爾遜相關係數沒有考慮兩個user Preference 重合的個數,這可能是在推薦引擎中使用的弱點,從上圖例子來說就是user1 和user5 對三個item表達了類似的Preference但是user1和user4的相似性更高,這有點反直覺的現象。

第二,如果兩個user 只對同一個item 表達了Preference,那麼這兩個user 無法計算皮爾遜相關係數,如上圖的user1 和user3。

最後,假如user5 對所有的item Preference都是3.0 ,同樣的該相似性計算是沒有定義的(參考公式4 發現分母為0)。

所以雖然很多論文都會選這個相似性衡量,但在實際當中我們需要根據業務場景進行多方面的衡量選擇相似性衡量標準。