Slope one—個性化推薦中最簡潔的協同過濾演算法

阿新 • • 發佈：2019-01-08

Slope One 是一系列應用於協同過濾的演算法的統稱。由 Daniel Lemire和Anna Maclachlan於2005年發表的論文中提出。 ^[1]有爭議的是，該演算法堪稱基於專案評價的non-trivial 協同過濾演算法最簡潔的形式。該系列演算法的簡潔特性使它們的實現簡單而高效，而且其精確度與其它複雜費時的演算法相比也不相上下。 ^[2]. 該系列演算法也被用來改進其它演算法。^[3]^[4].

協同過濾簡介及其主要優缺點[編輯]

協同過濾推薦（Collaborative Filtering recommendation）是在資訊過濾和資訊系統中正迅速成為一項很受歡迎的技術。與傳統的基於內容過濾直接分析內容進行推薦不同，協同過濾分析使用者興趣，在使用者群中找到指定使用者的相似（興趣）使用者，綜合這些相似使用者對某一資訊的評價，形成系統對該指定使用者對此資訊的喜好程度預測。
與傳統文字過濾相比，協同過濾有下列優點:
1 能夠過濾難以進行機器自動基於內容分析的資訊。如藝術品、音樂；
2 能夠基於一些複雜的，難以表達的概念（資訊質量、品位)進行過濾；
3 推薦的新穎性。
儘管協同過濾技術在個性化推薦系統中獲得了極大的成功，但隨著站點結構、內容的複雜度和使用者人數的不斷增加，協同過濾技術的一些缺點逐漸暴露出來。
主要有以下三點:
1 稀疏性(sparsity)：在許多推薦系統中，每個使用者涉及的資訊量相當有限，在一些大的系統如亞馬遜網站中，使用者最多不過就評估了上百萬本書的1%~2%。造成評估矩陣資料相當稀疏，難以找到相似使用者集，導致推薦效果大大降低。
2 擴充套件性(scalability)：“最近鄰居”演算法的計算量隨著使用者和項的增加而大大增加，對於上百萬之巨的數目，通常的演算法將遭遇到嚴重的擴充套件性問題。
3 精確性(accuracy)：通過尋找相近使用者來產生推薦集，在數量較大的情況下，推薦的可信度隨之降低。

Item-based協同過濾和過適[編輯]

當可以對一些專案評分的時候，比如人們可以對一些東西給出1到5星的評價的時候，協同過濾意圖基於一個個體過去對某些專案的評分和（龐大的）由其他使用者的評價構成的資料庫，來預測該使用者對未評價專案的評分。例如: 如果一個人給披頭士的評分為5（總分5）的話，我們能否預測他對席琳狄翁新專輯的評分呢？

這種情形下, item-based 協同過濾系統^[5]^[6] 根據其它專案的評分來預測某專案的分值，一般方法為線性迴歸 ( $f(x)=ax+b$ ). 於是，需要列出x^2個線性迴歸方程和2x^2個迴歸量，例如：當有1000個專案時，需要列多達1,000,000個線性迴歸方程，以及多達2,000,000個迴歸量。除非我們只選擇某些使用者共同評價過的專案對，否則協同過濾會遇到

過適^[2](過擬合) 問題。

另外一種更好的方法是使用更簡單一些的式子，比如 $f(x)=x+b$ ：實驗證明當使用一半的迴歸量的時候，該式子（稱為Slope One）的表現有時優於^[2] 線性迴歸方程。該簡化方法也不需要那麼多儲存空間和延遲。

Item-based 協同過濾只是協同過濾的一種形式.其它還有像 user-based 協同過濾一樣研究使用者間的聯絡的過濾系統。但是，考慮到其他使用者數量龐大，item-based協同過濾更可行一些。

電子商務中的Item-based協同過濾[編輯]

人們並不總是能給出評分，當用戶只提供二進位制資料（購買與否）的時候，就無法應用Slope One 和其它基於評分的演算法。二進位制 item-based協同過濾應用的例子之一就是Amazon的

item-to-item 專利演算法^[7] ，該演算法中用二進位制向量表示使用者-專案購買關係的矩陣，並計算二進位制向量間的cosine相關係數。

有人認為Item-to-Item 演算法甚至比Slope One 還簡單，例如：

購買統計樣本
顧客	專案 1	專案 2	專案 3
John	買過	沒買過	買過
Mark	沒買過	買過	買過
Lucy	沒買過	買過	沒買過

在本例當中，專案1和專案2間的cosine相關係數為：

${\frac {(1,0,0)\cdot (0,1,1)}{\Vert (1,0,0)\Vert \Vert (0,1,1)\Vert }}=0$ ,

專案1和專案3間的cosine相關係數為：

${\frac {(1,0,0)\cdot (1,1,0)}{\Vert (1,0,0)\Vert \Vert (1,1,0)\Vert }}={\frac {1}{{\sqrt {2}}}}$ ,

而專案2和專案3的cosine相關係數為：

${\frac {(0,1,1)\cdot (1,1,0)}{\Vert (0,1,1)\Vert \Vert (1,1,0)\Vert }}={\frac {1}{2}}$ .

於是，瀏覽專案1的顧客會被推薦買專案3(兩者相關係數最大),而瀏覽專案2的顧客會被推薦買專案3,瀏覽了專案3的會首先被推薦買專案1（再然後是專案2,因為2和3的相關係數小於1和3）。該模型只使用了每對專案間的一個引數（cosine相關係數）來產生推薦。因此，如果有n個專案，則需要計算和儲存 n（n-1）/2 個cosine相關係數。

Slope One 協同過濾[編輯]

為了大大減少過適(過擬合)的發生，提升演算法簡化實現， Slope One 系列易實現的Item-based協同過濾演算法被提了出來。本質上，該方法運用更簡單形式的迴歸表示式( $f(x)=x+b$ ) 和單一的自由引數，而不是一個專案評分和另一個專案評分間的線性迴歸 ( $f(x)=ax+b$ )。該自由引數只不過就是兩個專案評分間的平均差值。甚至在某些例項當中，它比線性迴歸的方法更準確^[2]，而且該演算法只需要一半（甚至更少）的儲存量。

例:

User A 對 Item I 評分為1 對Item J.評分為1.5
User B 對 Item I 評分為2.
你認為 User B 會給 Item J 打幾分?
Slope One 的答案是：2.5 (1.5-1+2=2.5).

舉個更實際的例子，考慮下表：

評分資料庫樣本
顧客	專案 1	專案 2	專案 3
John	5	3	2
Mark	3	4	未評分
Lucy	未評分	2	5

在本例中，專案2和1之間的平均評分差值為 (2+(-1))/2=0.5. 因此，item1的評分平均比item2高0.5。同樣的，專案3和1之間的平均評分差值為3。因此，如果我們試圖根據Lucy 對專案2的評分來預測她對專案1的評分的時候，我們可以得到 2+0.5 = 2.5。同樣，如果我們想要根據她對專案3的評分來預測她對專案1的評分的話，我們得到 5+3=8.

如果一個使用者已經評價了一些專案，可以這樣做出預測：簡單地把各個專案的預測通過加權平均值結合起來。當用戶兩個專案都評價過的時候，權值就高。在上面的例子中，專案1和專案2都評價了的使用者數為2,專案1和專案3 都評價了的使用者數為1,因此權重分別為2和1. 我們可以這樣預測Lucy對專案1的評價：

${\frac {2\times 2.5+1\times 8}{2+1}}={\frac {13}{3}}=4.33$ 於是，對“n”個專案，想要實現 Slope One，只需要計算並存儲“n”對評分間的平均差值和評價數目即可。

Slope One 的java/c#實現[編輯]

java實現

package test;

import java.util.*;

/**

* Daniel Lemire A simple implementation of the weighted slope one algorithm in
* Java for item-based collaborative filtering. Assumes Java 1.5.
* 
* See main function for example.
* 
* June 1st 2006. Revised by Marco Ponzi on March 29th 2007
*/

public class SlopeOne {

 public static void main(String args[]) {
   // this is my data base
   Map<UserId, Map<ItemId, Float>> data = new HashMap<UserId, Map<ItemId, Float>>();
   // items
   ItemId item1 = new ItemId("       candy");
   ItemId item2 = new ItemId("         dog");
   ItemId item3 = new ItemId("         cat");
   ItemId item4 = new ItemId("         war");
   ItemId item5 = new ItemId("strange food");

   mAllItems = new ItemId[] { item1, item2, item3, item4, item5 };

   // I'm going to fill it in
   HashMap<ItemId, Float> user1 = new HashMap<ItemId, Float>();
   HashMap<ItemId, Float> user2 = new HashMap<ItemId, Float>();
   HashMap<ItemId, Float> user3 = new HashMap<ItemId, Float>();
   HashMap<ItemId, Float> user4 = new HashMap<ItemId, Float>();
   user1.put(item1, 1.0f);
   user1.put(item2, 0.5f);
   user1.put(item4, 0.1f);
   data.put(new UserId("Bob"), user1);
   user2.put(item1, 1.0f);
   user2.put(item3, 0.5f);
   user2.put(item4, 0.2f);
   data.put(new UserId("Jane"), user2);
   user3.put(item1, 0.9f);
   user3.put(item2, 0.4f);
   user3.put(item3, 0.5f);
   user3.put(item4, 0.1f);
   data.put(new UserId("Jo"), user3);
   user4.put(item1, 0.1f);
   // user4.put(item2,0.4f);
   // user4.put(item3,0.5f);
   user4.put(item4, 1.0f);
   user4.put(item5, 0.4f);
   data.put(new UserId("StrangeJo"), user4);
   // next, I create my predictor engine
   SlopeOne so = new SlopeOne(data);
   System.out.println("Here's the data I have accumulated...");
   so.printData();
   // then, I'm going to test it out...
   HashMap<ItemId, Float> user = new HashMap<ItemId, Float>();
   System.out.println("Ok, now we predict...");
   user.put(item5, 0.4f);
   System.out.println("Inputting...");
   SlopeOne.print(user);
   System.out.println("Getting...");
   SlopeOne.print(so.predict(user));
   //
   user.put(item4, 0.2f);
   System.out.println("Inputting...");
   SlopeOne.print(user);
   System.out.println("Getting...");
   SlopeOne.print(so.predict(user));
 }

 Map<UserId, Map<ItemId, Float>> mData;
 Map<ItemId, Map<ItemId, Float>> mDiffMatrix;
 Map<ItemId, Map<ItemId, Integer>> mFreqMatrix;

 static ItemId[] mAllItems;

 public SlopeOne(Map<UserId, Map<ItemId, Float>> data) {
   mData = data;
   buildDiffMatrix();
 }

 /**
  * Based on existing data, and using weights, try to predict all missing
  * ratings. The trick to make this more scalable is to consider only
  * mDiffMatrix entries having a large (>1) mFreqMatrix entry.
  * 
  * It will output the prediction 0 when no prediction is possible.
  */
 public Map<ItemId, Float> predict(Map<ItemId, Float> user) {
   HashMap<ItemId, Float> predictions = new HashMap<ItemId, Float>();
   HashMap<ItemId, Integer> frequencies = new HashMap<ItemId, Integer>();
   for (ItemId j : mDiffMatrix.keySet()) {
     frequencies.put(j, 0);
     predictions.put(j, 0.0f);
   }
   for (ItemId j : user.keySet()) {
     for (ItemId k : mDiffMatrix.keySet()) {
       try {
         float newval = (mDiffMatrix.get(k).get(j).floatValue() + user.get(j)
             .floatValue()) * mFreqMatrix.get(k).get(j).intValue();
         predictions.put(k, predictions.get(k) + newval);
         frequencies.put(k, frequencies.get(k)
             + mFreqMatrix.get(k).get(j).intValue());
       } catch (NullPointerException e) {
       }
     }
   }
   HashMap<ItemId, Float> cleanpredictions = new HashMap<ItemId, Float>();
   for (ItemId j : predictions.keySet()) {
     if (frequencies.get(j) > 0) {
       cleanpredictions.put(j, predictions.get(j).floatValue()
           / frequencies.get(j).intValue());
     }
   }
   for (ItemId j : user.keySet()) {
     cleanpredictions.put(j, user.get(j));
   }
   return cleanpredictions;
 }

 /**
  * Based on existing data, and not using weights, try to predict all missing
  * ratings. The trick to make this more scalable is to consider only
  * mDiffMatrix entries having a large (>1) mFreqMatrix entry.
  */
 public Map<ItemId, Float> weightlesspredict(Map<ItemId, Float> user) {
   HashMap<ItemId, Float> predictions = new HashMap<ItemId, Float>();
   HashMap<ItemId, Integer> frequencies = new HashMap<ItemId, Integer>();
   for (ItemId j : mDiffMatrix.keySet()) {
     predictions.put(j, 0.0f);
     frequencies.put(j, 0);
   }
   for (ItemId j : user.keySet()) {
     for (ItemId k : mDiffMatrix.keySet()) {
       // System.out.println("Average diff between "+j+" and "+ k +
       // " is "+mDiffMatrix.get(k).get(j).floatValue()+" with n ="+mFreqMatrix.get(k).get(j).floatValue());
       float newval = (mDiffMatrix.get(k).get(j).floatValue() + user.get(j)
           .floatValue());
       predictions.put(k, predictions.get(k) + newval);
     }
   }
   for (ItemId j : predictions.keySet()) {
     predictions.put(j, predictions.get(j).floatValue() / user.size());
   }
   for (ItemId j : user.keySet()) {
     predictions.put(j, user.get(j));
   }
   return predictions;
 }

 public void printData() {
   for (UserId user : mData.keySet()) {
     System.out.println(user);
     print(mData.get(user));
   }
   for (int i = 0; i < mAllItems.length; i++) {
     System.out.print("\n" + mAllItems[i] + ":");
     printMatrixes(mDiffMatrix.get(mAllItems[i]),
         mFreqMatrix.get(mAllItems[i]));
   }
 }

 private void printMatrixes(Map<ItemId, Float> ratings,
     Map<ItemId, Integer> frequencies) {
   for (int j = 0; j < mAllItems.length; j++) {
     System.out.format("%10.3f", ratings.get(mAllItems[j]));
     System.out.print(" ");
     System.out.format("%10d", frequencies.get(mAllItems[j]));
   }
   System.out.println();
 }

 public static void print(Map<ItemId, Float> user) {
   for (ItemId j : user.keySet()) {
     System.out.println(" " + j + " --> " + user.get(j).floatValue());
   }
 }

 public void buildDiffMatrix() {
   mDiffMatrix = new HashMap<ItemId, Map<ItemId, Float>>();
   mFreqMatrix = new HashMap<ItemId, Map<ItemId, Integer>>();
   // first iterate through users
   for (Map<ItemId, Float> user : mData.values()) {
     // then iterate through user data
     for (Map.Entry<ItemId, Float> entry : user.entrySet()) {
       if (!mDiffMatrix.containsKey(entry.getKey())) {
         mDiffMatrix.put(entry.getKey(), new HashMap<ItemId, Float>());
         mFreqMatrix.put(entry.getKey(), new HashMap<ItemId, Integer>());
       }
       for (Map.Entry<ItemId, Float> entry2 : user.entrySet()) {
         int oldcount = 0;
         if (mFreqMatrix.get(entry.getKey()).containsKey(entry2.getKey()))
           oldcount = mFreqMatrix.get(entry.getKey()).get(entry2.getKey())
               .intValue();
         float olddiff = 0.0f;
         if (mDiffMatrix.get(entry.getKey()).containsKey(entry2.getKey()))
           olddiff = mDiffMatrix.get(entry.getKey()).get(entry2.getKey())
               .floatValue();
         float observeddiff = entry.getValue() - entry2.getValue();
         mFreqMatrix.get(entry.getKey()).put(entry2.getKey(), oldcount + 1);
         mDiffMatrix.get(entry.getKey()).put(entry2.getKey(),
             olddiff + observeddiff);
       }
     }
   }
   for (ItemId j : mDiffMatrix.keySet()) {
     for (ItemId i : mDiffMatrix.get(j).keySet()) {
       float oldvalue = mDiffMatrix.get(j).get(i).floatValue();
       int count = mFreqMatrix.get(j).get(i).intValue();
       mDiffMatrix.get(j).put(i, oldvalue / count);
     }
   }
 }

}

class UserId {

 String content;

 public UserId(String s) {
   content = s;
 }

 public int hashCode() {
   return content.hashCode();
 }

 public String toString() {
   return content;
 }

}

class ItemId {

 String content;

 public ItemId(String s) {
   content = s;
 }

 public int hashCode() {
   return content.hashCode();
 }

 public String toString() {
   return content;
 }

}

C#實現： using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace SlopeOne {

   public class Rating
   {
       public float Value { get; set; }
       public int Freq { get; set; }
       public float AverageValue
       {
           get { return Value / Freq; }
       }
   }
   public class RatingDifferenceCollection : Dictionary<string, Rating>
   {
       private string GetKey(int Item1Id, int Item2Id)
       {
           return (Item1Id < Item2Id) ? Item1Id "/" Item2Id : Item2Id "/" Item1Id ;
       }
       public bool Contains(int Item1Id, int Item2Id)
       {
           return this.Keys.Contains<string>(GetKey(Item1Id, Item2Id));
       }
       public Rating this[int Item1Id, int Item2Id]
       {
           get {
                   return this[this.GetKey(Item1Id, Item2Id)];
           }
           set { this[this.GetKey(Item1Id, Item2Id)] = value; }
       }
   }
    public class SlopeOne
   {       
       public RatingDifferenceCollection _DiffMarix = new RatingDifferenceCollection();  // The dictionary to keep the diff matrix
       public HashSet<int> _Items = new HashSet<int>();  // Tracking how many items totally
       public void AddUserRatings(IDictionary<int, float> userRatings)
       {
           foreach (var item1 in userRatings)
           {
               int item1Id = item1.Key;
               float item1Rating = item1.Value;
               _Items.Add(item1.Key);
               foreach (var item2 in userRatings)
               {
                   if (item2.Key <= item1Id) continue; // Eliminate redundancy
                   int item2Id = item2.Key;
                   float item2Rating = item2.Value;
                   Rating ratingDiff;
                   if (_DiffMarix.Contains(item1Id, item2Id))
                   {
                       ratingDiff = _DiffMarix[item1Id, item2Id];
                   }
                   else
                   {
                       ratingDiff = new Rating();
                       _DiffMarix[item1Id, item2Id] = ratingDiff;
                   }
                   ratingDiff.Value = item1Rating - item2Rating;
                   ratingDiff.Freq = 1;
               }
           }
       }
       // Input ratings of all users
       public void AddUerRatings(IList<IDictionary<int, float>> Ratings)
       {
           foreach(var userRatings in Ratings)
           {
               AddUserRatings(userRatings);
           }
       }
       public IDictionary<int, float> Predict(IDictionary<int, float> userRatings)
       {
           Dictionary<int, float> Predictions = new Dictionary<int, float>();
           foreach (var itemId in this._Items)
           {
               if (userRatings.Keys.Contains(itemId))    continue; // User has rated this item, just skip it
               Rating itemRating = new Rating();
               foreach (var userRating in userRatings)
               {
                   if (userRating.Key == itemId) continue;
                   int inputItemId = userRating.Key;
                   if (_DiffMarix.Contains(itemId, inputItemId))
                   {
                       Rating diff = _DiffMarix[itemId, inputItemId];
                       itemRating.Value = diff.Freq * (userRating.Value diff.AverageValue * ((itemId < inputItemId) ? 1 : -1));
                       itemRating.Freq = diff.Freq;
                   }
               }
               Predictions.Add(itemId, itemRating.AverageValue);               
           }
           return Predictions;
       }
       public static void Test()
       {
           SlopeOne test = new SlopeOne();
           Dictionary<int, float> userRating = new Dictionary<int, float>();
           userRating.Add(1, 5);
           userRating.Add(2, 4);
           userRating.Add(3, 4);
           test.AddUserRatings(userRating);
           userRating = new Dictionary<int, float>();
           userRating.Add(1, 4);
           userRating.Add(2, 5);
           userRating.Add(3, 3);
           userRating.Add(4, 5);
           test.AddUserRatings(userRating);
           userRating = new Dictionary<int, float>();
           userRating.Add(1, 4);
           userRating.Add(2, 4);
           userRating.Add(4, 5);
           test.AddUserRatings(userRating);
           userRating = new Dictionary<int, float>();
           userRating.Add(1, 5);
           userRating.Add(3, 4);
           IDictionary<int, float> Predictions = test.Predict(userRating);
           foreach (var rating in Predictions)
           {
               Console.WriteLine("Item " rating.Key " Rating: " rating.Value);
           }
       }
   }

}

Slope One 的演算法複雜度[編輯]

設有“n”個專案，“m”個使用者，“N”個評分。計算每對評分之間的差值需要n(n-1)/2 單位的儲存空間，最多需要 m n²步. 計算量也有可能挺悲觀的:假設使用者已經評價了最多 y 個專案, 那麼計算不超過n²+my²個專案間計算差值是可能的。 . 如果一個使用者已經評價過“x”個專案，預測單一的專案評分需要“x“步，而對其所有未評分專案做出評分預測需要最多 (n-x)x 步. 當一個使用者已經評價過“x”個專案時，當該使用者新增一個評價時，更新資料庫需要 x步.

可以通過分割資料（參照分割和稀疏儲存（沒有共同評價專案的使用者可以被忽略）來降低儲存要求，

應用Slope One的推薦系統[編輯]

AllTheBests 購物推薦引擎
腳註[編輯]
1. ^ ^2.0 ^2.1 ^2.2 ^2.3 Daniel Lemire, Anna Maclachlan, Slope One Predictors for Online Rating-Based Collaborative Filtering, In SIAM Data Mining (SDM'05), Newport Beach, California, April 21-23, 2005.
2. ^ Slobodan Vucetic, Zoran Obradovic: Collaborative Filtering Using a Regression-Based Approach. Knowl. Inf. Syst. 7(1): 1-22 (2005)
3. ^ Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Item-based collaborative filtering recommendation algorithms. WWW 2001: 285-295
4. ^ Greg Linden, Brent Smith, Jeremy York, "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 07, no. 1, pp. 76-80, Jan/Feb, 2003
5. 轉自

Slope one—個性化推薦中最簡潔的協同過濾演算法

協同過濾簡介及其主要優缺點[編輯]

Item-based協同過濾和過適[編輯]

電子商務中的Item-based協同過濾[編輯]

Slope One 協同過濾[編輯]

Slope One 的java/c#實現[編輯]

Slope One 的演算法複雜度[編輯]

應用Slope One的推薦系統[編輯]

腳註[編輯]

Slope one—個性化推薦中最簡潔的協同過濾演算法

Java推薦系統-基於使用者的最近鄰協同過濾演算法

[吳恩達機器學習筆記]16推薦系統5-6協同過濾演算法/低秩矩陣分解/均值歸一化

推薦系統中協同過濾演算法實現分析（重要兩個圖！！）

詳解個性化推薦五大最常用演算法

基於線上評分的協同過濾演算法---Slope One演算法

推薦系統實踐----基於使用者的協同過濾演算法（python程式碼實現書中案例）

機器學習 | 簡介推薦場景中的協同過濾演算法，以及SVD的使用

基於使用者的協同過濾演算法實現的商品推薦系統

資料結構中最常見的排序演算法-Java

01_從電影推薦開始，聊協同過濾

Neo4j 做推薦（12）—— 協同過濾（基於鄰域的推薦）

Neo4j 做推薦（11）—— 協同過濾（皮爾遜相似性）

Neo4j 做推薦（10）—— 協同過濾（餘弦相似度）

Neo4j 做推薦（9）—— 協同過濾（人群的智慧）

Neo4j 做推薦（8）—— 協同過濾（利用電影評級）

Neo4j 做推薦（3）—— 協同過濾

推薦系統（一）基於協同過濾演算法開發離線推薦

基於協同過濾演算法的推薦

吳恩達機器學習（十四）推薦系統（基於梯度下降的協同過濾演算法）

Slope one—個性化推薦中最簡潔的協同過濾演算法

協同過濾簡介及其主要優缺點[編輯]

Item-based協同過濾 和 過適[編輯]

電子商務中的Item-based協同過濾[編輯]

Slope One 協同過濾[編輯]

Slope One 的java/c#實現[編輯]

Slope One 的演算法複雜度[編輯]

應用Slope One的推薦系統[編輯]

腳註[編輯]

相關推薦

Item-based協同過濾和過適[編輯]