ES學習筆記之--fielddata的起源

阿新 • • 發佈：2018-10-21

ons factory ros text binary 有用 public ref sre

ES的官方文檔中關於檢索和排序的關系說得特別好：

Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for this document?".

搜索要解決的問題是: "哪些文檔包含給定的關鍵詞？"
排序和聚合要解決的問題是： “這個文檔的字段的值是多少？”

同樣，以需求為出發點: "檢索的結果按時間排序" 這個需求在商品搜索和日誌分析系統中是非常普遍的。眾所周知，Lucene是通過倒排索引解決了“檢索的問題”，那麽“排序的問題” 怎麽處理呢？

最開始，Lucene是通過FieldCache來解決這個需求。就是通過FieldCache建立docId - value的映射關系。但是FieldCache有個兩個致命的問題: 堆內存消耗和首次加載耗時。如過索引更新頻率較高，這兩個問題引發的GC和超時導致系統不穩定估計是程序員的噩夢。

從Lucene4.0開始，引入了新的組件IndexDocValues，就是我們常說的doc_value。

它有兩個亮點：

1. 索引數據時構建 doc-value的映射關系。註: 倒排索引構建的是value-doc的映射關系。

2. 列式存儲

這基本上就是“空間換時間”和“按需加載”的典型實踐了。而且，列式存儲基本上是所有高效NoSQL的標配，Hbase, Hive 都有劣勢存儲的身影。

IndexDocValues跟FieldCache一樣解決了“通過doc_id查詢value”的問題，同時也解決了FieldCache的兩個問題。

ES基於doc_value構建了fielddata, 用於排序和聚合兩大功能。所以，可以毫不客氣地說， doc_value是ES aggregations的基石。

那麽在ES中， fielddata如何使用呢？以binary類型為例，參考: org.elasticsearch.index.fielddata.BinaryDVFieldDataTests

s1: 建mappings時需要特殊處理

        String mapping = XContentFactory.jsonBuilder().startObject().startObject("test")
                .startObject("properties")
                .startObject("field")
                .field("type", "binary")
                .startObject("fielddata").field("format", "doc_values").endObject()
                .endObject()
                .endObject()
                .endObject().endObject().string();

s2: 通過leafreader構建doc_values

 LeafReaderContext reader = refreshReader();
        IndexFieldData<?> indexFieldData = getForField("field");
        AtomicFieldData fieldData = indexFieldData.load(reader);

        SortedBinaryDocValues bytesValues = fieldData.getBytesValues();

s3: 定位到指定文檔, 用setDocument()方法。

/**
 * A list of per-document binary values, sorted
 * according to {@link BytesRef#getUTF8SortedAsUnicodeComparator()}.
 * There might be dups however.
 */
public abstract class SortedBinaryDocValues {

    /**
     * Positions to the specified document
     */
    public abstract void setDocument(int docId);

    /**
     * Return the number of values of the current document.
     */
    public abstract int count();

    /**
     * Retrieve the value for the current document at the specified index.
     * An index ranges from {@code 0} to {@code count()-1}.
     * Note that the returned {@link BytesRef} might be reused across invocations.
     */
    public abstract BytesRef valueAt(int index);

}

註意，如果reader是組合的，也就是有多個，需要用到docBase + reader.docId。這裏是容易采坑的。

s4: 獲取文檔的指定field的value,使用 valueAt()方法。

最後總結一下，本文簡述了lucene的doc_value和 es的fielddata的關系，簡要描述了一下doc_value的基本思想。最後給出了在ES中使用fielddata的基本方法。這對於自己開發plugin是比較有用的。

ES學習筆記之--fielddata的起源

ons factory ros text binary 有用 public ref sre ES的官方文檔中關於檢索和排序的關系說得特別好： Search needs to answer the question "Which documents contain this

ES學習筆記之--fielddata的起源

ES學習筆記之--fielddata的起源

ES學習筆記之-AvgAggregation的實現過程分析

ES學習筆記之-整合測試的簡單學習

ES學習筆記之-集成測試的簡單學習

ES學習筆記之--ES的集群是如何組建起來的

Android OpenGL ES學習筆記之新增顏色

Android OpenGL ES學習筆記之常用API

ES學習筆記之--delete api的實現流程

ES學習筆記之health api的實現

ES學習筆記之-Translog實現機制的理解

ES聚合學習筆記之--HyperLogLog與BloomFilter

SAS學習筆記之函數應用

c#學習筆記之Application.DoEvents應用

[C#學習筆記之異步編程模式2]BeginInvoke和EndInvoke方法 (轉載)

.NET學習筆記之ADO.NET

,NET學習筆記之ADO.NET

Metasploit學習筆記之——情報搜集

鋒利的JQuery學習筆記之JQuery

java入門學習筆記之1(類的定義，代碼的編譯執行)

python學習筆記之列表與元組

ES學習筆記之--fielddata的起源

相關推薦