Neo4j 做推薦 (7)—— 基於內容的相似度量標準
阿新 • • 發佈:2018-11-16
相似度量是用於生成個性化推薦的重要組件,這些推薦允許我們量化兩個項目的相似程度(或者我們稍後會看到,兩個用戶偏好的相似程度)。
Jaccard指數是0到1之間的數字,表示兩組的相似程度。
- 兩個相同集合的Jaccard指數是1.
- 如果兩個集合沒有公共元素,則Jaccard索引為0.
- 通過將兩個集合的交集的大小除以兩個集合的並集來計算Jaccard。
我們可以計算電影型別集的Jaccard指數,以確定兩部電影的相似程度。
哪些電影是跟《盜夢空間》基於Jaccard指數最相似的?
MATCH (m:Movie {title: "Inception"})-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(other:Movie) WITH m, other, COUNT(g) AS intersection, COLLECT(g.name) AS i MATCH (m)-[:IN_GENRE]->(mg:Genre) WITH m,other, intersection,i, COLLECT(mg.name) AS s1 MATCH (other)-[:IN_GENRE]->(og:Genre) WITH m,other,intersection,i, s1, COLLECT(og.name) AS s2 WITH m,other,intersection,s1,s2 WITH m,other,intersection,s1+filter(x IN s2 WHERE NOT x IN s1) AS union, s1, s2 RETURN m.title, other.title, s1,s2,((1.0*intersection)/SIZE(union)) AS jaccard ORDER BY jaccard DESC LIMIT 100
分析:
1. 首先查詢出電影盜夢空間和與它流派相關性的電影集other
2. count(g) 其實就是電影盜夢空間和電影集other 的流派交集的數量(共同的流派)
3. s1+filter(x IN s2 WHERE NOT x IN s1) AS union 此 union 即是s1 和 s2 的並集(集合s1 加上 s2中不包含s1 的那部分)
4. ((1.0*intersection)/SIZE(union)) AS jaccard 根據上面的Jaccard指數公式計算所得的指數。
執行結果如下:
我們可以將這個相同的方法應用於電影的所有特徵(如流派、演員、導演等):
MATCH (m:Movie {title: "Inception"})-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(t)<-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(other:Movie) WITH m, other, COUNT(t) AS intersection, COLLECT(t.name) AS i MATCH (m)-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(mt) WITH m,other, intersection,i, COLLECT(mt.name) AS s1 MATCH (other)-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(ot) WITH m,other,intersection,i, s1, COLLECT(ot.name) AS s2 WITH m,other,intersection,s1,s2 WITH m,other,intersection,s1+filter(x IN s2 WHERE NOT x IN s1) AS union, s1, s2 RETURN m.title, other.title, s1,s2,((1.0*intersection)/SIZE(union)) AS jaccard ORDER BY jaccard DESC LIMIT 100