【甘道夫】基於Mahout0.9+CDH5.2執行分散式ItemCF推薦演算法

阿新 • • 發佈：2019-01-12

環境： hadoop-2.5.0-cdh5.2.0
mahout-0.9-cdh5.2.0 引言雖然Mahout已經宣佈不再繼續基於Mapreduce開發，遷移到Spark，但是實際面臨的情況是公司叢集沒有足夠的記憶體支援Spark這隻把記憶體當飯吃的猛獸，再加上專案進度的壓力以及開發人員的技能現狀，所以不得不繼續使用Mahout一段時間。今天記錄的是命令列執行ItemCF on Hadoop的過程。歷史之前讀過一些前輩們關於的Mahout ItemCF on Hadoop程式設計的相關文章，描述的都是如何基於Mahout程式設計實現ItemCF on Hadoop，由於沒空親自研究，所以一直遵循前輩們程式設計實現的做法，比如以下這段在各大部落格都頻繁出現的程式碼：

import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.mahout.cf.taste.hadoop.item.RecommenderJob; publicclass ItemCFHadoop { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(ItemCFHadoop.class

); GenericOptionsParser optionParser = new GenericOptionsParser(conf, args); String[] remainingArgs = optionParser.getRemainingArgs(); if (remainingArgs.length != 5) { System.out.println("args length: "+remainingArgs.length); System.err.println("Usage: hadoop jar <jarname> <package>.ItemCFHadoop <inputpath> <outputpath> <tmppath> <booleanData> <similarityClassname>"

); System.exit(2); } System.out.println("input : "+remainingArgs[0]); System.out.println("output : "+remainingArgs[1]); System.out.println("tempdir : "+remainingArgs[2]); System.out.println("booleanData : "+remainingArgs[3]); System.out.println("similarityClassname : "+remainingArgs[4]); StringBuilder sb = new StringBuilder(); sb.append("--input ").append(remainingArgs[0]); sb.append(" --output ").append(remainingArgs[1]); sb.append(" --tempDir ").append(remainingArgs[2]); sb.append(" --booleanData ").append(remainingArgs[3]); sb.append(" --similarityClassname ").append(remainingArgs[4]); conf.setJobName("ItemCFHadoop"); RecommenderJob job = new RecommenderJob(); job.setConf(conf); job.run(sb.toString().split(" ")); } } 以上程式碼是可執行的，只要在命令列中傳入正確的引數就可以順利完成ItemCF on Hadoop的任務。但是，如果按這麼個程式碼邏輯，實際上是在Java中做了命令列的工作，為何不直接通過命令列執行呢？ 官網資料 前輩們為我指明瞭道路，ItemCF on Hadoop的任務是通過org.apache.mahout.cf.taste.hadoop.item.RecommenderJob類實現的。官網（https://builds.apache.org/job/Mahout-Quality/javadoc/）中對於org.apache.mahout.cf.taste.hadoop.item.RecommenderJob類的說明如下： Runs a completely distributed recommender job as a series of mapreduces. Preferences in the input file should look like userID, itemID[, preferencevalue] Preference value is optional to accommodate applications that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference). The preference value is assumed to be parseable as a double. The user IDs and item IDs are parsed as longs. Command line arguments specific to this class are: --input(path): Directory containing one or more text files with the preference data --output(path): output path where recommender output should go --tempDir (path): Specifies a directory where the job may place temp files (default "temp")
--similarityClassname (classname): Name of vector similarity class to instantiate or a predefined similarity from VectorSimilarityMeasure --usersFile (path): only compute recommendations for user IDs contained in this file (optional) --itemsFile (path): only include item IDs from this file in the recommendations (optional) --filterFile (path): file containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user (optional) --numRecommendations (integer): Number of recommendations to compute per user (10) --booleanData (boolean): Treat input data as having no pref values (false) --maxPrefsPerUser (integer): Maximum number of preferences considered per user in final recommendation phase (10) --maxSimilaritiesPerItem (integer): Maximum number of similarities considered per item (100) --minPrefsPerUser (integer): ignore users with less preferences than this in the similarity computation (1) --maxPrefsPerUserInItemSimilarity (integer): max number of preferences to consider per user in the item similarity computation phase, users with more preferences will be sampled down (1000) --threshold (double): discard item pairs with a similarity value below this 為了方便具備英語閱讀能力的同學，上面保留了原文，下面是翻譯：執行一個完全分散式的推薦任務，通過一系列mapreduce任務實現。輸入檔案中的偏好資料格式為：userID, itemID[, preferencevalue]。其中，preferencevalue並不是必須的。 userID和itemID將被解析為long型別，preferencevalue將被解析為double型別。該類可以接收的命令列引數如下：

--input(path): 儲存使用者偏好資料的目錄，該目錄下可以包含一個或多個儲存使用者偏好資料的文字檔案；
--output(path): 結算結果的輸出目錄
--tempDir (path): 儲存臨時檔案的目錄
--similarityClassname (classname): 向量相似度計算類，可選的相似度演算法包括CityBlockSimilarity，CooccurrenceCountSimilarity，CosineSimilarity，CountbasedMeasure，EuclideanDistanceSimilarity，LoglikelihoodSimilarity，PearsonCorrelationSimilarity, TanimotoCoefficientSimilarity。注意引數中要帶上包名。
--usersFile (path): 指定一個包含了一個或多個儲存userID的檔案路徑，僅為該路徑下所有檔案包含的userID做推薦計算 (該選項可選)
--itemsFile (path): 指定一個包含了一個或多個儲存itemID的檔案路徑，僅為該路徑下所有檔案包含的itemID做推薦計算 (該選項可選)
--filterFile (path): 指定一個路徑，該路徑下的檔案包含了[userID,itemID]值對，userID和itemID用逗號分隔。計算結果將不會為user推薦[userID,itemID]值對中包含的item (該選項可選)
--numRecommendations (integer): 為每個使用者推薦的item數量，預設為10
--booleanData (boolean): 如果輸入資料不包含偏好數值，則將該引數設定為true，預設為false
--maxPrefsPerUser (integer): 在最後計算推薦結果的階段，針對每一個user使用的偏好資料的最大數量，預設為10
--maxSimilaritiesPerItem (integer): 針對每個item的相似度最大值，預設為100
--minPrefsPerUser (integer): 在相似度計算中，忽略所有偏好資料量少於該值的使用者，預設為1
--maxPrefsPerUserInItemSimilarity (integer): 在item相似度計算階段，針對每個使用者考慮的偏好資料最大數量，預設為1000
--threshold (double): 忽略相似度低於該閥值的item對

命令列執行 用於測試的使用者偏好資料【userID, itemID, preferencevalue】： 1,101,2 1,102,5 1,103,1 2,101,1 2,102,3 2,103,2 2,104,6 3,101,1 3,104,1 3,105,1 3,107,2 4,101,2 4,103,2 4,104,5 4,106,3 5,101,3 5,102,5 5,103,6 5,104,8 5,105,1 5,106,1 相關基礎環境配置完善後，在命令列執行如下命令即可進行ItemCF on Hadoop推薦計算： hadoop jar $MAHOUT_HOME/mahout-core-0.9-cdh5.2.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /UserPreference --output /CFOutput --tempDir /tmp --similarityClassname org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.LoglikelihoodSimilarity
注：這裡只使用了最重要的引數，更多的引數使用調優需結合實際專案進行測試。計算結果【userID [itemID1:score1,itemID2:score2......]】： 1[104:3.4706533,106:1.7326527,105:1.5989419] 2[106:3.8991857,105:3.691359] 3[106:1.0,103:1.0,102:1.0] 4[105:3.2909648,102:3.2909648] 5[107:3.2898135] ****************************************************************************************************************************************

【甘道夫】基於Mahout0.9+CDH5.2執行分散式ItemCF推薦演算法

【甘道夫】基於Mahout0.9+CDH5.2執行分散式ItemCF推薦演算法

【甘道夫】基於scikit-learn實現邏輯迴歸LogisticRegression

【甘道夫】Ubuntu14 server + Hadoop2.2.0環境下Sqoop1.99.3部署記錄

【甘道夫】Eclipse+Maven搭建HBase開發環境及HBaseDAO代碼演示樣例

【甘道夫】拷貝檔案到多臺伺服器的Shell指令碼

【甘道夫】HBase基本資料操作詳解【完整版，絕對精品】

【金陽光測試】基於控件核心技術探討---Android自己主動化系列（2）---2013年5月

【ALB學習筆記】基於事件觸發方式的串行通信接口數據接收案例

【ALB學習筆記】基於.NET環境的高頻RFID卡讀寫設備的基本操作案例

【神經網絡篇】--基於數據集cifa10的經典模型實例

不吹不擂，你想要的Python面試都在這裏了【315+道題】

【Django Series - 05】基於 "xlsxwriter + BytesIO"（Python3）生成 Excel 報表 ||| Python2 StringIO.StringIO()

【視訊免費分享】基於Spring Boot技術棧部落格系統企業級前後端實戰

【Windows語音識別】基於SAPI v5.1的語音識別程式配置

【Python量化投資】基於技術分析研究股票市場

【Python-GPU加速】基於Numba的GPU計算加速（一）基本

【劍指offer】面試題9:用兩個棧實現佇列【C++版本】

【上傳檔案】基於阿里雲的視訊點播VOD、物件儲存OSS實現音視訊圖片等檔案上傳

【機器學習筆記】基於k-近鄰演算法的數字識別

【大資料安全】基於Kerberos的大資料安全驗證方案

【甘道夫】基於Mahout0.9+CDH5.2執行分散式ItemCF推薦演算法

相關推薦