Lucene深入學習（7）Lucene的索引過程

阿新 • • 發佈：2019-01-05

摘要： 索引是Lucene最重要的過程，通過IndexWriter的addDocument()方法可以加入各種Document。本節將以addDocument為入口，探索Lucene的索引過程。本次程式碼示例基於Lucene 6.2.1.

索引呼叫方法

IndexWriter的 addDocument

public long addDocument(Iterable<? extends IndexableField> doc) {
    return updateDocument(null, doc);
  }

該方法並沒有實際的邏輯，需要注意的是它返回的是一個sequence number。

IndexWriter的 updateDocument

 public long updateDocument(Term term, Iterable<? extends IndexableField> doc){
 long seqNo = docWriter.updateDocument(doc, analyzer, term);
 }

該方法在更新操作時，先刪除包含term的doc再新增新的doc。這個操作是原子性的，也就是同一個reader在相同的索引上執行。
這裡的doc是傳入的document，analyzer是在IndexWriterConfig中設定的analyzer，也可以不設定，預設是StandardAnalyzer。

DocumentsWriter的 updateDocument

long updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer, final Term delTerm)){
    final DocumentsWriterPerThread dwpt = perThread.dwpt;
    seqNo = dwpt.updateDocument(docs, analyzer, delTerm);
}

該方法實現了對鎖的處理，正真新增的文件的方法繼續呼叫。

DocumentsWriterPerThread的 updateDocument

public long updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm){
    docState.doc = doc;
    docState.analyzer = analyzer;
    consumer.processDocument();
}

這裡使用了靜態內部類DocState傳值，處理Doc的事情交給了DocConsumer。

DocConsumer的 processDocument

public void processDocument(){

    int fieldCount = 0;
    long fieldGen = nextFieldGen++;
    for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
    }
}

DocConsumer是一個介面，預設使用到了它的實現類DefaultIndexingChain。這裡的fieldCount表示需要索引的field的個數，fieldGen表示該方法的呼叫次數（每呼叫一次，+1）。

DefaultIndexingChain的 processField

private int processField(IndexableField field, long fieldGen, int fieldCount){
String fieldName = field.name();
    IndexableFieldType fieldType = field.fieldType();
    PerField fp = null;
    if (fieldType.indexOptions() == null) {
      throw new NullPointerException("IndexOptions must not be null (field: \"" + field.name() + "\")");
    }
    // Invert indexed fields:
    if (fieldType.indexOptions() != IndexOptions.NONE) {
      // if the field omits norms, the boost cannot be indexed.
      if (fieldType.omitNorms() && field.boost() != 1.0f) {
        throw new UnsupportedOperationException("You cannot set an index-time boost: norms are omitted for field '" + field.name() + "'");
      }
      fp = getOrAddField(fieldName, fieldType, true);
      boolean first = fp.fieldGen != fieldGen;
      fp.invert(field, first);
      if (first) {
        fields[fieldCount++] = fp;
        fp.fieldGen = fieldGen;
      }
    } else {
      verifyUnIndexedFieldType(fieldName, fieldType);
    }
    // Add stored fields:
    if (fieldType.stored()) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      if (fieldType.stored()) {
        try {
          storedFieldsWriter.writeField(fp.fieldInfo, field);
        } catch (Throwable th) {
          throw AbortingException.wrap(th);
        }
      }
    }
    DocValuesType dvType = fieldType.docValuesType();
    if (dvType == null) {
      throw new NullPointerException("docValuesType must not be null (field: \"" + fieldName + "\")");
    }
    if (dvType != DocValuesType.NONE) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexDocValue(fp, dvType, field);
    }
    if (fieldType.pointDimensionCount() != 0) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexPoint(fp, field);
    }
    return fieldCount;
}

這裡的IndexableField代表索引時一個的filed。在IndexWriter中，你可以認為它就是一個document的內部表示形式。IndexableField是一個介面，它含有幾個重要的屬性：field-name, field-type, filed-value。
processField的程式碼不長，包含了索引的核心邏輯，因此我沒有刪減程式碼。可以看到幾個關鍵引數fieldGen和fieldCount是如何操作的。
最終的寫操作呼叫了writeField。

StoredFieldsWriter的 writeField

public void writeField(FieldInfo info, IndexableField field){
    if(long)    bufferedDocs.writeVLong(infoAndBits);
    if(int)     bufferedDocs.writeVInt(bytes.length);
    if(String)  bufferedDocs.writeString(string);
....
}

這裡的寫操作主要是判斷filed的型別，然後交給具體的實現邏輯GrowableByteArrayDataOutput

DataOutput的 writeXXX

public void writeByte(byte b) {
    if (length >= bytes.length) {
      bytes = ArrayUtil.grow(bytes);
    }
    bytes[length++] = b;
  }

這裡列出的是最簡單的writeByte()方法，其他方法都由該方法擴充套件而來。

public void writeLong(long i) throws IOException {
    writeInt((int) (i >> 32));
    writeInt((int) i);
  }

到這裡，整個的索引過程就結束了。

Lucene深入學習（7）Lucene的索引過程

索引呼叫方法

IndexWriter的 addDocument

IndexWriter的 updateDocument

DocumentsWriter的 updateDocument

DocumentsWriterPerThread的 updateDocument

DocConsumer的 processDocument

DefaultIndexingChain的 processField

StoredFieldsWriter的 writeField

DataOutput的 writeXXX

Lucene深入學習（7）Lucene的索引過程

Lucene深入學習（11）Lucene的索引刪除

Lucene深入學習（5）Lucene的Document與Field

ASP.NET Core on K8S深入學習（7）Dashboard知多少

一步一步跟我學習lucene（7）---lucene搜尋之IndexSearcher構建過程

帶你深入AI（7）- 深度學習重要Python庫

《構建之法》學習（7）——MSF

Java學習（7）：同步問題之生產者與消費者的問題

IDEA 學習筆記之 Java項目開發深入學習（1）

R語言學習（7）字符串和因子

HTML的學習（7）

Python學習（7）——面向物件高階編輯

memcached的學習（7）

JAVAWEB學習（7） - Session

spring深入學習（八）IOC 之解析 bean 標籤：meta、lookup-method、replace-method

spring深入學習（七）IOC 之解析 bean 標籤：BeanDefinition

spring深入學習（六） IOC 之解析 bean 標籤：開啟解析程序

spring深入學習（五） IOC 之解析Bean：解析 import 標籤

Js學習（7）標準庫-object物件

Java內容梳理（19）API學習（7）執行緒

Lucene深入學習（7）Lucene的索引過程

索引呼叫方法

IndexWriter的 addDocument

IndexWriter的 updateDocument

DocumentsWriter的 updateDocument

DocumentsWriterPerThread的 updateDocument

DocConsumer的 processDocument

DefaultIndexingChain的 processField

StoredFieldsWriter的 writeField

DataOutput的 writeXXX

相關推薦