ElasticSearch最佳入門實踐(五十四)相關度評分 TF & IDF 演算法解密
1、演算法介紹
relevance score演算法,簡單來說,就是計算出,一個索引中的文字,與搜尋文字,他們之間的關聯匹配程度
Elasticsearch使用的是 term frequency / inverse document frequency演算法,簡稱為TF/IDF演算法
Term frequency:搜尋文字中的各個詞條在field文字中出現了多少次,出現次數越多,就越相關
搜尋請求:hello world
doc1:hello you, and world is very good
doc2:hello, how are you
Inverse document frequency:搜尋文字中的各個詞條在整個索引的所有文件中出現了多少次,出現的次數越多,就越不相關
搜尋請求:hello world
doc1:hello, today is very good
doc2:hi world, how are you
比如說,在index中有1萬條document,hello這個單詞在所有的document中,一共出現了1000次;world這個單詞在所有的document中,一共出現了100次
doc2更相關
Field-length norm:field長度,field越長,相關度越弱
搜尋請求:hello world
doc1:{ “title”: “hello article”, “content”: “babaaba 1萬個單詞” }
doc2:{ “title”: “my article”, “content”: “blablabala 1萬個單詞,hi world” }
hello world在整個index中出現的次數是一樣多的
doc1更相關,title field更短
2、_score是如何被計算出來的
可以使用關鍵字具體檢視
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 0.51623213, "hits": [ { "_shard": "[website][3]", "_node": "iYthfUSYTs-1R27gsnmCpA", "_index": "website", "_type": "article", "_id": "1", "_score": 0.51623213, "_source": { "title": "one article", "content": "this is my one article", "post_date": "2017-01-01", "author_id": 111 }, "_explanation": { "value": 0.51623213, "description": "sum of:", "details": [ { "value": 0.51623213, "description": "sum of:", "details": [ { "value": 0.25811607, "description": "weight(title:one in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.25811607, "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 1, "description": "docFreq", "details": [] }, { "value": 1, "description": "docCount", "details": [] } ] }, { "value": 0.89722675, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 2, "description": "avgFieldLength", "details": [] }, { "value": 2.56, "description": "fieldLength", "details": [] } ] } ] } ] }, { "value": 0.25811607, "description": "weight(title:article in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.25811607, "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 1, "description": "docFreq", "details": [] }, { "value": 1, "description": "docCount", "details": [] } ] }, { "value": 0.89722675, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 2, "description": "avgFieldLength", "details": [] }, { "value": 2.56, "description": "fieldLength", "details": [] } ] } ] } ] } ] }, { "value": 0, "description": "match on required clause, product of:", "details": [ { "value": 0, "description": "# clause", "details": [] }, { "value": 1, "description": "*:*, product of:", "details": [ { "value": 1, "description": "boost", "details": [] }, { "value": 1, "description": "queryNorm", "details": [] } ] } ] } ] } }, { "_shard": "[website][2]", "_node": "iYthfUSYTs-1R27gsnmCpA", "_index": "website", "_type": "article", "_id": "2", "_score": 0.25811607, "_source": { "title": "two article", "content": "this is my two article", "post_date": "2017-02-02", "author_id": 222 }, "_explanation": { "value": 0.25811607, "description": "sum of:", "details": [ { "value": 0.25811607, "description": "sum of:", "details": [ { "value": 0.25811607, "description": "weight(title:article in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.25811607, "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 1, "description": "docFreq", "details": [] }, { "value": 1, "description": "docCount", "details": [] } ] }, { "value": 0.89722675, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 2, "description": "avgFieldLength", "details": [] }, { "value": 2.56, "description": "fieldLength", "details": [] } ] } ] } ] } ] }, { "value": 0, "description": "match on required clause, product of:", "details": [ { "value": 0, "description": "# clause", "details": [] }, { "value": 1, "description": "*:*, product of:", "details": [ { "value": 1, "description": "boost", "details": [] }, { "value": 1, "description": "queryNorm", "details": [] } ] } ] } ] } }, { "_shard": "[website][4]", "_node": "iYthfUSYTs-1R27gsnmCpA", "_index": "website", "_type": "article", "_id": "3", "_score": 0.25811607, "_source": { "title": "three article", "content": "this is my three article", "post_date": "2017-03-03", "author_id": 333 }, "_explanation": { "value": 0.25811607, "description": "sum of:", "details": [ { "value": 0.25811607, "description": "sum of:", "details": [ { "value": 0.25811607, "description": "weight(title:article in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.25811607, "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 1, "description": "docFreq", "details": [] }, { "value": 1, "description": "docCount", "details": [] } ] }, { "value": 0.89722675, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 2, "description": "avgFieldLength", "details": [] }, { "value": 2.56, "description": "fieldLength", "details": [] } ] } ] } ] } ] }, { "value": 0, "description": "match on required clause, product of:", "details": [ { "value": 0, "description": "# clause", "details": [] }, { "value": 1, "description": "*:*, product of:", "details": [ { "value": 1, "description": "boost", "details": [] }, { "value": 1, "description": "queryNorm", "details": [] } ] } ] } ] } } ] } }
3、分析一個document是如何被匹配上的
{
"_index": "website",
"_type": "article",
"_id": "1",
"matched": true,
"explanation": {
"value": 0.51623213,
"description": "sum of:",
"details": [
{
"value": 0.51623213,
"description": "sum of:",
"details": [
{
"value": 0.25811607,
"description": "weight(title:one in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.25811607,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.89722675,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0.25811607,
"description": "weight(title:article in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.25811607,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.89722675,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": "*:*, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}