1. 程式人生 > >Neo4j 做推薦 (7)—— 基於內容的相似度量標準

Neo4j 做推薦 (7)—— 基於內容的相似度量標準

相似度量是用於生成個性化推薦的重要件,些推薦允量化兩個目的相似程度(或者我稍後會看到,兩個用偏好的相似程度)。

http://guides.neo4j.com/sandbox/recommendations/img/jaccard.png

Jaccard指數是01之間的數字,表示兩組的相似程度。

  1. 兩個相同集合的Jaccard指數是1.
  2. 如果兩個集合沒有公共元素,則Jaccard索引為0.
  3. 通過將兩個集合的交集的大小除以兩個集合的並集來計算Jaccard

我們可以計算電影型別集的Jaccard指數,以確定兩部電影的相似程度。

哪些電影是跟《盜夢空間》基於Jaccard指數最相似的?

MATCH (m:Movie {title: "Inception"})-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(other:Movie)
WITH m, other, COUNT(g) AS intersection, COLLECT(g.name) AS i
MATCH (m)-[:IN_GENRE]->(mg:Genre)
WITH m,other, intersection,i, COLLECT(mg.name) AS s1
MATCH (other)-[:IN_GENRE]->(og:Genre)
WITH m,other,intersection,i, s1, COLLECT(og.name) AS s2

WITH m,other,intersection,s1,s2

WITH m,other,intersection,s1+filter(x IN s2 WHERE NOT x IN s1) AS union, s1, s2

RETURN m.title, other.title, s1,s2,((1.0*intersection)/SIZE(union)) AS jaccard ORDER BY jaccard DESC LIMIT 100

分析:

1. 首先查詢出電影盜夢空間和與它流派相關性的電影集other

2. count(g) 其實就是電影盜夢空間和電影集other 的流派交集的數量(共同的流派)

3. s1+filter(x IN s2 WHERE NOT x IN s1) AS union  此 union 即是s1 和 s2 的並集(集合s1 加上 s2中不包含s1 的那部分)

4. ((1.0*intersection)/SIZE(union)) AS jaccard  根據上面的Jaccard指數公式計算所得的指數。

執行結果如下:

 

我們可以將這個相同的方法應用於電影的所有特徵(如流派、演員、導演等):

MATCH (m:Movie {title: "Inception"})-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(t)<-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(other:Movie)
WITH m, other, COUNT(t) AS intersection, COLLECT(t.name) AS i
MATCH (m)-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(mt)
WITH m,other, intersection,i, COLLECT(mt.name) AS s1
MATCH (other)-[:IN_GENRE|:ACTED_IN|:DIRECTED]-(ot)
WITH m,other,intersection,i, s1, COLLECT(ot.name) AS s2

WITH m,other,intersection,s1,s2

WITH m,other,intersection,s1+filter(x IN s2 WHERE NOT x IN s1) AS union, s1, s2

RETURN m.title, other.title, s1,s2,((1.0*intersection)/SIZE(union)) AS jaccard ORDER BY jaccard DESC LIMIT 100