lucene原始碼分析—倒排索引的寫過程

阿新 • • 發佈：2018-12-22

lucene將倒排索引的資訊寫入.tim和.tip檔案，這部分程式碼也是lucene最核心的一部分。倒排索引的寫過程從BlockTreeTermsWriter的write函式開始，

BlockTreeTermsWriter::write

  public void write(Fields fields) throws IOException {

    String lastField = null;
    for(String field : fields) {
      lastField = field;

      Terms terms = fields.terms(field);
      if (terms == null) {
        continue;
      }
      List<PrefixTerm> prefixTerms = null;

      TermsEnum termsEnum = terms.iterator();
      TermsWriter termsWriter = new TermsWriter(fieldInfos.fieldInfo(field));
      int prefixTermUpto = 0;

      while (true) {
        BytesRef term = termsEnum.next();
        termsWriter.write(term, termsEnum, null);
      }
      termsWriter.finish();
    }
  }

遍歷每個域，首先通過terms函式根據field名返回一個FreqProxTerms，包含了該域的所有Term；接下來fieldInfo根據域名返回域資訊，並以此建立一個TermsWriter，TermsWriter是倒排索引寫的主要類，接下來依次取出FreqProxTerms中的每個term，並呼叫TermsWriter的write函式寫入.tim檔案，並建立對應的索引資訊，最後通過TermsWriter的finish函式將索引資訊寫入.tip檔案中，下面依次來看。

BlockTreeTermsWriter::write->TermsWriter::write

    public void write(BytesRef text, TermsEnum termsEnum, PrefixTerm prefixTerm) throws IOException {

      BlockTermState state = postingsWriter.writeTerm(text, termsEnum, docsSeen);
      if (state != null) {

        pushTerm(text);

        PendingTerm term = new PendingTerm(text, state, prefixTerm);
        pending.add(term);

        if (prefixTerm == null) {
          sumDocFreq += state.docFreq;
          sumTotalTermFreq += state.totalTermFreq;
          numTerms++;
          if (firstPendingTerm == null) {
            firstPendingTerm = term;
          }
          lastPendingTerm = term;
        }
      }
    }

TermsWriter的write函式一次處理一個Term。postingsWriter是Lucene50PostingsWriter。write函式首先通過Lucene50PostingsWriter的writeTerm函式記錄每個Term以及對應文件的相應資訊。
成員變數pending是一個PendingEntry列表，PendingEntry用來儲存一個Term或者是一個Block，pending列表用來儲存多個待處理的Term。
pushTerm是write裡的核心函式，用於具體處理一個Term，後面詳細來看。write函式的最後統計文件頻和詞頻資訊並記錄到sumDocFreq和sumTotalTermFreq兩個成員變數中。

BlockTreeTermsWriter::write->TermsWriter::write->Lucene50PostingsWriter::writeTerm

  public final BlockTermState writeTerm(BytesRef term, TermsEnum termsEnum, FixedBitSet docsSeen) throws IOException {
    startTerm();
    postingsEnum = termsEnum.postings(postingsEnum, enumFlags);

    int docFreq = 0;
    long totalTermFreq = 0;
    while (true) {
      int docID = postingsEnum.nextDoc();
      if (docID == PostingsEnum.NO_MORE_DOCS) {
        break;
      }
      docFreq++;
      docsSeen.set(docID);
      int freq;
      if (writeFreqs) {
        freq = postingsEnum.freq();
        totalTermFreq += freq;
      } else {
        freq = -1;
      }
      startDoc(docID, freq);

      if (writePositions) {
        for(int i=0;i<freq;i++) {
          int pos = postingsEnum.nextPosition();
          BytesRef payload = writePayloads ? postingsEnum.getPayload() : null;
          int startOffset;
          int endOffset;
          if (writeOffsets) {
            startOffset = postingsEnum.startOffset();
            endOffset = postingsEnum.endOffset();
          } else {
            startOffset = -1;
            endOffset = -1;
          }
          addPosition(pos, payload, startOffset, endOffset);
        }
      }

      finishDoc();
    }

    if (docFreq == 0) {
      return null;
    } else {
      BlockTermState state = newTermState();
      state.docFreq = docFreq;
      state.totalTermFreq = writeFreqs ? totalTermFreq : -1;
      finishTerm(state);
      return state;
    }
  }

startTerm設定.doc、.pos和.pay三個檔案的指標。postings函式建立FreqProxPostingsEnum或者FreqProxDocsEnum，內部封裝了FreqProxTermsWriterPerField，即第五章中每個PerField的termsHashPerField成員變數，termsHashPerField的內部儲存了對應Field的所有Terms資訊。
writeTerm函式接下來通過nextDoc獲得下一個文件ID，獲得freq詞頻，並累加到totalTermFreq（總詞頻）中。再呼叫startDoc記錄文件的資訊。addPosition函式記錄詞的位置、偏移和payload資訊，必要時寫入檔案中。finishDoc記錄檔案指標等資訊。然後建立BlockTermState，設定相應詞頻和文件頻資訊以最終返回。
writeTerm函式最後通過finishTerm寫入文件資訊至.doc檔案，寫入位置資訊至.pos檔案。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm

    private void pushTerm(BytesRef text) throws IOException {

      int limit = Math.min(lastTerm.length(), text.length);
      int pos = 0;

      while (pos < limit && lastTerm.byteAt(pos) == text.bytes[text.offset+pos]) {
        pos++;
      }

      for(int i=lastTerm.length()-1;i>=pos;i--) {

        int prefixTopSize = pending.size() - prefixStarts[i];
        if (prefixTopSize >= minItemsInBlock) {
          writeBlocks(i+1, prefixTopSize);
          prefixStarts[i] -= prefixTopSize-1;
        }
      }

      if (prefixStarts.length < text.length) {
        prefixStarts = ArrayUtil.grow(prefixStarts, text.length);
      }

      for(int i=pos;i<text.length;i++) {
        prefixStarts[i] = pending.size();
      }

      lastTerm.copyBytes(text);
    }

lastTerm儲存了上一次處理的Term。pushTerm函式的核心功能是計算一定的條件，當滿足一定條件時，就表示pending列表中待處理的一個或者多個Term，需要儲存為一個block，此時呼叫writeBlocks函式進行儲存。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks

    void writeBlocks(int prefixLength, int count) throws IOException {
      int lastSuffixLeadLabel = -1;

      boolean hasTerms = false;
      boolean hasPrefixTerms = false;
      boolean hasSubBlocks = false;

      int start = pending.size()-count;
      int end = pending.size();
      int nextBlockStart = start;
      int nextFloorLeadLabel = -1;

      for (int i=start; i<end; i++) {

        PendingEntry ent = pending.get(i);
        int suffixLeadLabel;

        if (ent.isTerm) {
          PendingTerm term = (PendingTerm) ent;
          if (term.termBytes.length == prefixLength) {
            suffixLeadLabel = -1;
          } else {
            suffixLeadLabel = term.termBytes[prefixLength] & 0xff;
          }
        } else {
          PendingBlock block = (PendingBlock) ent;
          suffixLeadLabel = block.prefix.bytes[block.prefix.offset + prefixLength] & 0xff;
        }

        if (suffixLeadLabel != lastSuffixLeadLabel) {
          int itemsInBlock = i - nextBlockStart;
          if (itemsInBlock >= minItemsInBlock && end-nextBlockStart > maxItemsInBlock) {
            boolean isFloor = itemsInBlock < count;
            newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, i, hasTerms, hasPrefixTerms, hasSubBlocks));

            hasTerms = false;
            hasSubBlocks = false;
            hasPrefixTerms = false;
            nextFloorLeadLabel = suffixLeadLabel;
            nextBlockStart = i;
          }

          lastSuffixLeadLabel = suffixLeadLabel;
        }

        if (ent.isTerm) {
          hasTerms = true;
          hasPrefixTerms |= ((PendingTerm) ent).prefixTerm != null;
        } else {
          hasSubBlocks = true;
        }
      }

      if (nextBlockStart < end) {
        int itemsInBlock = end - nextBlockStart;
        boolean isFloor = itemsInBlock < count;
        newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, end, hasTerms, hasPrefixTerms, hasSubBlocks));
      }
      PendingBlock firstBlock = newBlocks.get(0);
      firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef);

      pending.subList(pending.size()-count, pending.size()).clear();
      pending.add(firstBlock);
      newBlocks.clear();
    }

hasTerms表示將要合併的項中是否含有Term（因為特殊情況下，合併的項只有子block）。
hasPrefixTerms表示是否有詞的字首，假設一直為false。
hasSubBlocks和hasTerms對應，表示將要合併的項中是否含有子block。
start和end的規定了需要合併的Term或Block在待處理的pending列表中的範圍。
writeBlocks函式接下來遍歷pending列表中每個待處理的Term或者Block，suffixLeadLabel儲存了樹中某個節點下的各個Term的byte，lastSuffixLeadLabel則是對應的最後一個不同的byte，檢查所有項中是否有Term和子block，並對hasTerms和hasSubBlocks進行相應的設定。如果pending中的Term或block太多，大於minItemsInBlock和maxItemsInBlock計算出來的閾值，就會呼叫writeBlock寫成一個block，最後也會寫一次。
writeBlocks函式接下來通過compileIndex函式將一個block的資訊寫入FST結構中（儲存在其成員變數index中），FST是有限狀態機的縮寫，其實就是將一棵樹的資訊儲存在其自身的結構中，而這顆樹是由所有Term的每個byte形成的，後面來看。
writeBlocks函式最後清空被儲存的一部分pending列表，並新增剛剛建立的block到pending列表中。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->writeBlock
第一種情況

    private PendingBlock writeBlock(int prefixLength, boolean isFloor, int floorLeadLabel, int start, int end, boolean hasTerms, boolean hasPrefixTerms, boolean hasSubBlocks) throws IOException {

      long startFP = termsOut.getFilePointer();

      boolean hasFloorLeadLabel = isFloor && floorLeadLabel != -1;

      final BytesRef prefix = new BytesRef(prefixLength + (hasFloorLeadLabel ? 1 : 0));
      System.arraycopy(lastTerm.get().bytes, 0, prefix.bytes, 0, prefixLength);
      prefix.length = prefixLength;

      int numEntries = end - start;
      int code = numEntries << 1;
      if (end == pending.size()) {
        code |= 1;
      }
      termsOut.writeVInt(code);

      boolean isLeafBlock = hasSubBlocks == false && hasPrefixTerms == false;
      final List<FST<BytesRef>> subIndices;
      boolean absolute = true;

      if (isLeafBlock) {
        subIndices = null;
        for (int i=start;i<end;i++) {
          PendingEntry ent = pending.get(i);

          PendingTerm term = (PendingTerm) ent;
          BlockTermState state = term.state;
          final int suffix = term.termBytes.length - prefixLength;

          suffixWriter.writeVInt(suffix);
          suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);
          statsWriter.writeVInt(state.docFreq);
          if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {
            statsWriter.writeVLong(state.totalTermFreq - state.docFreq);
          }

          postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);
          for (int pos = 0; pos < longsSize; pos++) {
            metaWriter.writeVLong(longs[pos]);
          }
          bytesWriter.writeTo(metaWriter);
          bytesWriter.reset();
          absolute = false;
        }
      } else {
        ...
      }

      termsOut.writeVInt((int) (suffixWriter.getFilePointer() << 1) | (isLeafBlock ? 1:0));
      suffixWriter.writeTo(termsOut);
      suffixWriter.reset();

      termsOut.writeVInt((int) statsWriter.getFilePointer());
      statsWriter.writeTo(termsOut);
      statsWriter.reset();

      termsOut.writeVInt((int) metaWriter.getFilePointer());
      metaWriter.writeTo(termsOut);
      metaWriter.reset();

      if (hasFloorLeadLabel) {
        prefix.bytes[prefix.length++] = (byte) floorLeadLabel;
      }

      return new PendingBlock(prefix, startFP, hasTerms, isFloor, floorLeadLabel, subIndices);
    }

termsOut封裝了.tim檔案的輸出流，其實是FSIndexOutput，其getFilePointer函式返回的startFP儲存了該檔案可以插入的指標。
writeBlock函式首先提取相同的字首，例如需要寫為一個block的Term有aaa，aab，aac，則相同的字首為aa，儲存在型別為BytesRef的prefix中，BytesRef用於封裝一個byte陣列。
numEntries儲存了本次需要寫入多少個Term或者Block，code封裝了numEntries的資訊，並在最後一個bit表示後面是否還有。然後將code寫入.tim檔案中。
isLeafBlock表示是否是葉子節點。bytesWriter、suffixWriter、statsWriter、metaWriter在記憶體中模擬檔案。
writeBlock函式接下來遍歷需要寫入的Term或者Block，suffix表示最後取出的不同字幕的長度，例如aaa，aab，aac則suffix為1，首先寫入該長度suffix，最終寫入suffixWriter中的為a、b、c。再往下往statsWriter中寫入詞頻和文件頻率。
再往下postingsWriter是Lucene50PostingsWriter，encodeTerm函式在longs中儲存了.doc、.pos和.pay中檔案指標的偏移，然後singletonDocID、lastPosBlockOffset、skipOffset等資訊儲存在bytesWriter中，再將longs的指標寫入metaWriter中，最後把其餘資訊寫入bytesWriter中。
再往下呼叫bytesWriter、suffixWriter、statsWriter、metaWriter的writeTo函式將記憶體中的資料寫入.tim檔案中。
writeBlock函式最後建立PendingBlock並返回，PendingBlock封裝了本次寫入的各個Term或者子Block的資訊。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->writeBlock
第二種情況

    private PendingBlock writeBlock(int prefixLength, boolean isFloor, int floorLeadLabel, int start, int end, boolean hasTerms, boolean hasPrefixTerms, boolean hasSubBlocks) throws IOException {

      long startFP = termsOut.getFilePointer();

      boolean hasFloorLeadLabel = isFloor && floorLeadLabel != -1;

      final BytesRef prefix = new BytesRef(prefixLength + (hasFloorLeadLabel ? 1 : 0));
      System.arraycopy(lastTerm.get().bytes, 0, prefix.bytes, 0, prefixLength);
      prefix.length = prefixLength;

      int numEntries = end - start;
      int code = numEntries << 1;
      if (end == pending.size()) {
        code |= 1;
      }
      termsOut.writeVInt(code);

      boolean isLeafBlock = hasSubBlocks == false && hasPrefixTerms == false;
      final List<FST<BytesRef>> subIndices;
      boolean absolute = true;

      if (isLeafBlock) {
        ...
      } else {
        subIndices = new ArrayList<>();
        boolean sawAutoPrefixTerm = false;
        for (int i=start;i<end;i++) {
          PendingEntry ent = pending.get(i);
          if (ent.isTerm) {
            PendingTerm term = (PendingTerm) ent;
            BlockTermState state = term.state;
            final int suffix = term.termBytes.length - prefixLength;
            if (minItemsInAutoPrefix == 0) {
              suffixWriter.writeVInt(suffix << 1);
              suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);
            } else {
              code = suffix<<2;
              int floorLeadEnd = -1;
              if (term.prefixTerm != null) {
                sawAutoPrefixTerm = true;
                PrefixTerm prefixTerm = term.prefixTerm;
                floorLeadEnd = prefixTerm.floorLeadEnd;

                if (prefixTerm.floorLeadStart == -2) {
                  code |= 2;
                } else {
                  code |= 3;
                }
              }
              suffixWriter.writeVInt(code);
              suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);
              if (floorLeadEnd != -1) {
                suffixWriter.writeByte((byte) floorLeadEnd);
              }
            }

            statsWriter.writeVInt(state.docFreq);
            if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {
              statsWriter.writeVLong(state.totalTermFreq - state.docFreq);
            }
            postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);
            for (int pos = 0; pos < longsSize; pos++) {
              metaWriter.writeVLong(longs[pos]);
            }
            bytesWriter.writeTo(metaWriter);
            bytesWriter.reset();
            absolute = false;
          } else {
            PendingBlock block = (PendingBlock) ent;
            final int suffix = block.prefix.length - prefixLength;
            if (minItemsInAutoPrefix == 0) {
              suffixWriter.writeVInt((suffix<<1)|1);
            } else {
              suffixWriter.writeVInt((suffix<<2)|1);
            }
            suffixWriter.writeBytes(block.prefix.bytes, prefixLength, suffix);
            suffixWriter.writeVLong(startFP - block.fp);
            subIndices.add(block.index);
          }
        }
      }

      termsOut.writeVInt((int) (suffixWriter.getFilePointer() << 1) | (isLeafBlock ? 1:0));
      suffixWriter.writeTo(termsOut);
      suffixWriter.reset();

      termsOut.writeVInt((int) statsWriter.getFilePointer());
      statsWriter.writeTo(termsOut);
      statsWriter.reset();

      termsOut.writeVInt((int) metaWriter.getFilePointer());
      metaWriter.writeTo(termsOut);
      metaWriter.reset();

      if (hasFloorLeadLabel) {
        prefix.bytes[prefix.length++] = (byte) floorLeadLabel;
      }

      return new PendingBlock(prefix, startFP, hasTerms, isFloor, floorLeadLabel, subIndices);
    }

第二種情況表示要寫入的不是葉子節點，如果是Term，和第一部分一樣，如果是一個子block，寫入子block的相應資訊，最後建立的PendingBlock需要封裝每個Block對應的FST結構，即subIndices。

writeBlocks函式呼叫完writeBlock函式後將pending列表中的Term或者Block寫入.tim檔案中，接下來要通過PendingBlock的compileIndex函式針對剛剛寫入.tim檔案中的Term建立索引資訊，最後要將這些資訊寫入.tip檔案中，用於查詢。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex

    public void compileIndex(List<PendingBlock> blocks, RAMOutputStream scratchBytes, IntsRefBuilder scratchIntsRef) throws IOException {

      scratchBytes.writeVLong(encodeOutput(fp, hasTerms, isFloor));
      if (isFloor) {
        scratchBytes.writeVInt(blocks.size()-1);
        for (int i=1;i<blocks.size();i++) {
          PendingBlock sub = blocks.get(i);
          scratchBytes.writeByte((byte) sub.floorLeadByte);
          scratchBytes.writeVLong((sub.fp - fp) << 1 | (sub.hasTerms ? 1 : 0));
        }
      }

      final ByteSequenceOutputs outputs = ByteSequenceOutputs.getSingleton();
      final Builder<BytesRef> indexBuilder = new Builder<>(FST.INPUT_TYPE.BYTE1,
                                                           0, 0, true, false, Integer.MAX_VALUE,
                                                           outputs, false,
                                                           PackedInts.COMPACT, true, 15);

      final byte[] bytes = new byte[(int) scratchBytes.getFilePointer()];
      scratchBytes.writeTo(bytes, 0);
      indexBuilder.add(Util.toIntsRef(prefix, scratchIntsRef), new BytesRef(bytes, 0, bytes.length));
      scratchBytes.reset();

      for(PendingBlock block : blocks) {
        if (block.subIndices != null) {
          for(FST<BytesRef> subIndex : block.subIndices) {
            append(indexBuilder, subIndex, scratchIntsRef);
          }
          block.subIndices = null;
        }
      }
      index = indexBuilder.finish();
    }

fp是對應.tim檔案的指標，encodeOutput函式將fp、hasTerms和isFloor資訊封裝到一個長整型中，然後將該長整型存入scratchBytes中。compileIndex函式接下來建立Builder，用於構造索引樹，再往下將scratchBytes中的資料存入byte陣列bytes中。
compileIndex最核心的部分是通過Builder的add函式依次將Term或者Term的部分字首新增到一顆樹中，由frontier陣列維護，進而新增到FST中。compileIndex最後通過Builder的finish函式將add新增後的FST樹中的資訊寫入快取中，後續新增到.tip檔案裡。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::Builder

  public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, boolean doPackFST, float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits) {

    this.minSuffixCount1 = minSuffixCount1;
    this.minSuffixCount2 = minSuffixCount2;
    this.doShareNonSingletonNodes = doShareNonSingletonNodes;
    this.shareMaxTailLength = shareMaxTailLength;
    this.doPackFST = doPackFST;
    this.acceptableOverheadRatio = acceptableOverheadRatio;
    this.allowArrayArcs = allowArrayArcs;
    fst = new FST<>(inputType, outputs, doPackFST, acceptableOverheadRatio, bytesPageBits);
    bytes = fst.bytes;
    if (doShareSuffix) {
      dedupHash = new NodeHash<>(fst, bytes.getReverseReader(false));
    } else {
      dedupHash = null;
    }
    NO_OUTPUT = outputs.getNoOutput();

    final UnCompiledNode<T>[] f = (UnCompiledNode<T>[]) new UnCompiledNode[10];
    frontier = f;
    for(int idx=0;idx<frontier.length;idx++) {
      frontier[idx] = new UnCompiledNode<>(this, idx);
    }
  }

Builder的建構函式主要是建立了一個FST，並初始化frontier陣列，frontier陣列中的每個元素UnCompiledNode代表樹中的每個節點。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add

  public void add(IntsRef input, T output) throws IOException {

    ...

    int pos1 = 0;
    int pos2 = input.offset;
    final int pos1Stop = Math.min(lastInput.length(), input.length);
    while(true) {
      frontier[pos1].inputCount++;
      if (pos1 >= pos1Stop || lastInput.intAt(pos1) != input.ints[pos2]) {
        break;
      }
      pos1++;
      pos2++;
    }
    final int prefixLenPlus1 = pos1+1;

    if (frontier.length < input.length+1) {
      final UnCompiledNode<T>[] next = ArrayUtil.grow(frontier, input.length+1);
      for(int idx=frontier.length;idx<next.length;idx++) {
        next[idx] = new UnCompiledNode<>(this, idx);
      }
      frontier = next;
    }

    freezeTail(prefixLenPlus1);

    for(int idx=prefixLenPlus1;idx<=input.length;idx++) {
      frontier[idx-1].addArc(input.ints[input.offset + idx - 1],
                             frontier[idx]);
      frontier[idx].inputCount++;
    }

    final UnCompiledNode<T> lastNode = frontier[input.length];
    if (lastInput.length() != input.length || prefixLenPlus1 != input.length + 1) {
      lastNode.isFinal = true;
      lastNode.output = NO_OUTPUT;
    }

    for(int idx=1;idx<prefixLenPlus1;idx++) {
      final UnCompiledNode<T> node = frontier[idx];
      final UnCompiledNode<T> parentNode = frontier[idx-1];

      final T lastOutput = parentNode.getLastOutput(input.ints[input.offset + idx - 1]);

      final T commonOutputPrefix;
      final T wordSuffix;

      if (lastOutput != NO_OUTPUT) {
        commonOutputPrefix = fst.outputs.common(output, lastOutput);
        wordSuffix = fst.outputs.subtract(lastOutput, commonOutputPrefix);
        parentNode.setLastOutput(input.ints[input.offset + idx - 1], commonOutputPrefix);
        node.prependOutput(wordSuffix);
      } else {
        commonOutputPrefix = wordSuffix = NO_OUTPUT;
      }

      output = fst.outputs.subtract(output, commonOutputPrefix);
    }

    if (lastInput.length() == input.length && prefixLenPlus1 == 1+input.length) {
      lastNode.output = fst.outputs.merge(lastNode.output, output);
    } else {
      frontier[prefixLenPlus1-1].setLastOutput(input.ints[input.offset + prefixLenPlus1-1], output);
    }
    lastInput.copyInts(input);
  }

add函式首先計算和上一個字串的共同字首，prefixLenPlus1表示FST數中的相同字首的長度，如果存在，後面就需要進行相應的合併。接下來通過for迴圈呼叫addArc函式依次新增input即Term中的每個byte至frontier中，形成一個FST樹，由frontier陣列維護，然後設定frontier陣列中的最後一個UnCompiledNode，將isFinal標誌位設為true。add函式最後將output中的資料（檔案指標等資訊）存入本次frontier陣列中最前面的一個UnCompiledNode中，並設定lastInput為本次的input。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail

  private void freezeTail(int prefixLenPlus1) throws IOException {
    final int downTo = Math.max(1, prefixLenPlus1);
    for(int idx=lastInput.length(); idx >= downTo; idx--) {

      boolean doPrune = false;
      boolean doCompile = false;

      final UnCompiledNode<T> node = frontier[idx];
      final UnCompiledNode<T> parent = frontier[idx-1];

      if (node.inputCount < minSuffixCount1) {
        doPrune = true;
        doCompile = true;
      } else if (idx > prefixLenPlus1) {
        if (parent.inputCount < minSuffixCount2 || (minSuffixCount2 == 1 && parent.inputCount == 1 && idx > 1)) {
          doPrune = true;
        } else {
          doPrune = false;
        }
        doCompile = true;
      } else {
        doCompile = minSuffixCount2 == 0;
      }

      if (node.inputCount < minSuffixCount2 || (minSuffixCount2 == 1 && node.inputCount == 1 && idx > 1)) {
        for(int arcIdx=0;arcIdx<node.numArcs;arcIdx++) {
          final UnCompiledNode<T> target = (UnCompiledNode<T>) node.arcs[arcIdx].target;
          target.clear();
        }
        node.numArcs = 0;
      }

      if (doPrune) {
        node.clear();
        parent.deleteLast(lastInput.intAt(idx-1), node);
      } else {

        if (minSuffixCount2 != 0) {
          compileAllTargets(node, lastInput.length()-idx);
        }
        final T nextFinalOutput = node.output;

        final boolean isFinal = node.isFinal || node.numArcs == 0;

        if (doCompile) {
          parent.replaceLast(lastInput.intAt(idx-1),
                             compileNode(node, 1+lastInput.length()-idx),
                             nextFinalOutput,
                             isFinal);
        } else {
          parent.replaceLast(lastInput.intAt(idx-1),
                             node,
                             nextFinalOutput,
                             isFinal);
          frontier[idx] = new UnCompiledNode<>(this, idx);
        }
      }
    }
  }

freezeTail函式的核心功能是將不會再變化的節點通過compileNode函式新增到FST結構中。
replaceLast函式設定父節點對應的引數，例如其子節點在bytes中的位置target，是否為最後一個節點isFinal等等。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail->compileNode

  private CompiledNode compileNode(UnCompiledNode<T> nodeIn, int tailLength) throws IOException {
    final long node;
    long bytesPosStart = bytes.getPosition();
    if (dedupHash != null && (doShareNonSingletonNodes || nodeIn.numArcs <= 1) && tailLength <= shareMaxTailLength) {
      if (nodeIn.numArcs == 0) {
        node = fst.addNode(this, nodeIn);
        lastFrozenNode = node;
      } else {
        node = dedupHash.add(this, nodeIn);
      }
    } else {
      node = fst.addNode(this, nodeIn);
    }

    long bytesPosEnd = bytes.getPosition();
    if (bytesPosEnd != bytesPosStart) {
      lastFrozenNode = node;
    }

    nodeIn.clear();

    final CompiledNode fn = new CompiledNode();
    fn.node = node;
    return fn;
  }

compileNode的核心部分是呼叫FST的addNode函式新增節點。dedupHash是一個hash快取，這裡不管它。如果bytesPosEnd不等於bytesPosStart，表示有節點寫入bytes中了，設定lastFrozenNode為當前node（其實是bytes中的快取指標位置）。compileNode函式最後建立CompiledNode，設定其中的node並返回。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail->compileNode->FST::addNode

  long addNode(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn) throws IOException {
    T NO_OUTPUT = outputs.getNoOutput();

    if (nodeIn.numArcs == 0) {
      if (nodeIn.isFinal) {
        return FINAL_END_NODE;
      } else {
        return NON_FINAL_END_NODE;
      }
    }

    final long startAddress = builder.bytes.getPosition();

    final boolean doFixedArray = shouldExpand(builder, nodeIn);
    if (doFixedArray) {
      if (builder.reusedBytesPerArc.length < nodeIn.numArcs) {
        builder.reusedBytesPerArc = new int[ArrayUtil.oversize(nodeIn.numArcs, 1)];
      }
    }

    builder.arcCount += nodeIn.numArcs;

    final int lastArc = nodeIn.numArcs-1;

    long lastArcStart = builder.bytes.getPosition();
    int maxBytesPerArc = 0;
    for(int arcIdx=0;arcIdx<nodeIn.numArcs;arcIdx++) {
      final Builder.Arc<T> arc = nodeIn.arcs[arcIdx];
      final Builder.CompiledNode target = (Builder.CompiledNode) arc.target;
      int flags = 0;

      if (arcIdx == lastArc) {
        flags += BIT_LAST_ARC;
      }

      if (builder.lastFrozenNode == target.node && !doFixedArray) {
        flags += BIT_TARGET_NEXT;
      }

      if (arc.isFinal) {
        flags += BIT_FINAL_ARC;
        if (arc.nextFinalOutput != NO_OUTPUT) {
          flags += BIT_ARC_HAS_FINAL_OUTPUT;
        }
      } else {

      }

      boolean targetHasArcs = target.node > 0;

      if (!targetHasArcs) {
        flags += BIT_STOP_NODE;
      } else if (inCounts != null) {
        inCounts.set((int) target.node, inCounts.get((int) target.node) + 1);
      }

      if (arc.output != NO_OUTPUT) {
        flags += BIT_ARC_HAS_OUTPUT;
      }

      builder.bytes.writeByte((byte) flags);
      writeLabel(builder.bytes, arc.label);

      if (arc.output != NO_OUTPUT) {
        outputs.write(arc.output, builder.bytes);
      }

      if (arc.nextFinalOutput != NO_OUTPUT) {
        outputs.writeFinalOutput(arc.nextFinalOutput, builder.bytes);
      }

      if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) {
        builder.bytes.writeVLong(target.node);
      }

    }

    final long thisNodeAddress = builder.bytes.getPosition()-1;
    builder.bytes.reverse(startAddress, thisNodeAddress);

    builder.nodeCount++;
    final long node;
    node = thisNodeAddress;

    return node;
  }

首先判斷如果是最後的節點，直接返回。接下來累加numArcs至arcCount中，統計節點arc個數。addNode函式接下來計算並設定標誌位flags，然後將flags和label寫入bytes中，label就是Term中的某個字母或者byte。addNode函式最後返回bytes即BytesStore中的位置。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail->compileNode->NodeHash::addNode

  public long add(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn) throws IOException {
    final long h = hash(nodeIn);
    long pos = h & mask;
    int c = 0;
    while(true) {
      final long v = table.get(pos);
      if (v == 0) {
        final long node = fst.addNode(builder, nodeIn);
        count++;
        table.set(pos, node);
        if (count > 2*table.size()/3) {
          rehash();
        }
        return node;
      } else if (nodesEqual(nodeIn, v)) {
        return v;
      }
      pos = (pos + (++c)) & mask;
    }
  }

dedupHash的add函式首先通過hash函式獲得該node的hash值，遍歷node內的每個arc，計算hash值。
該函式內部也是使用了FST的addNode函式新增節點，並在必要的時候通過rehash擴充套件hash陣列。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->UnCompiledNode::addArc

    public void addArc(int label, Node target) {
      if (numArcs == arcs.length) {
        final Arc<T>[] newArcs = ArrayUtil.grow(arcs, numArcs+1);
        for(int arcIdx=numArcs;arcIdx<newArcs.length;arcIdx++) {
          newArcs[arcIdx] = new Arc<>();
        }
        arcs = newArcs;
      }
      final Arc<T> arc = arcs[numArcs++];
      arc.label = label;
      arc.target = target;
      arc.output = arc.nextFinalOutput = owner.NO_OUTPUT;
      arc.isFinal = false;
    }

addArc用來將一個Term裡的字母或者byte新增到該節點UnCompiledNode的arcs陣列中，開頭的if語句用來擴充arcs陣列，然後按照順序獲取arcs陣列中的Arc，並存入label，傳入的引數target指向下一個UnCompiledNode節點。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::finish

  public FST<T> finish() throws IOException {

    final UnCompiledNode<T> root = frontier[0];

    freezeTail(0);

    if (root.inputCount < minSuffixCount1 || root.inputCount < minSuffixCount2 || root.numArcs == 0) {
      if (fst.emptyOutput == null) {
        return null;
      } else if (minSuffixCount1 > 0 || minSuffixCount2 > 0) {
        return null;
      }
    } else {
      if (minSuffixCount2 != 0) {
        compileAllTargets(root, lastInput.length());
      }
    }
    fst.finish(compileNode(root, lastInput.length()).node);

    if (doPackFST) {
      return fst.pack(this, 3, Math.max(10, (int) (getNodeCount()/4)), acceptableOverheadRatio);
    } else {
      return fst;
    }
  }

finish函式開頭的freezeTail函式傳入的引數0，代表要處理frontier陣列維護的所有節點，compileNode函式最後向bytes中寫入根節點。最後的finish函式將FST的資訊快取到成員變數blocks中去，blocks是一個byte陣列列表。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::finish->FST::finish

  void finish(long newStartNode) throws IOException {
    startNode = newStartNode;
    bytes.finish();
    cacheRootArcs();
  }

  public void finish() {
    if (current != null) {
      byte[] lastBuffer = new byte[nextWrite];
      System.arraycopy(current, 0, lastBuffer, 0, nextWrite);
      blocks.set(blocks.size()-1, lastBuffer);
      current = null;
    }
  }

回到BlockTreeTermsWriter的write函式中，接下來通過TermsWriter的finish函式將FST中的資訊寫入.tip檔案中。

BlockTreeTermsWriter::write->TermsWriter::write->finish

    public void finish() throws IOException {
      if (numTerms > 0) {
        pushTerm(new BytesRef());
        pushTerm(new BytesRef());
        writeBlocks(0, pending.size());

        final PendingBlock root = (PendingBlock) pending.get(0);
        indexStartFP = indexOut.getFilePointer();
        root.index.save(indexOut);

        BytesRef minTerm = new BytesRef(firstPendingTerm.termBytes);
        BytesRef maxTerm = new BytesRef(lastPendingTerm.termBytes);

        fields.add(new FieldMetaData(fieldInfo,
                                     ((PendingBlock) pending.get(0)).index.getEmptyOutput(),
                                     numTerms,
                                     indexStartFP,
                                     sumTotalTermFreq,
                                     sumDocFreq,
                                     docsSeen.cardinality(),
                                     longsSize,
                                     minTerm, maxTerm));
      } else {

      }
    }

root.index.save(indexOut)就是將資訊寫入.tip檔案中。

總結

總接一下本章的大體流程。
BlockTreeTermWrite的呼叫TermsWriter的write函式處理每個域中的每個Term，然後通過finish函式將資訊寫入.tip檔案。
TermsWriter的write函式針對每個Term，呼叫pushTerm函式將Term的資訊寫入.tim檔案和FST中，然後將每個Term新增到待處理列表pending中。
pushTerm函式通過計算選擇適當的時候呼叫writeBlocks函式將pending中多個Term寫成一個Block。
writeBlocks在pending列表中選擇相應的Term或者子Block，然後呼叫writeBlock函式寫入相應的資訊，然後呼叫compileIndex函式建立索引，最後刪除在pending列表中已被處理的Term或者Block。
writeBlock函式向各個檔案.doc、.pos和.pay寫入對應Term或者Block的資訊。
compileIndex函式通過Builder的add函式新增節點（每個Term的每個字母或者byte）到frontier陣列中，frontier陣列維護了UnCompiledNode節點，構成一棵樹，compileIndex內部通過freezeTail函式將樹中不會變動的節點通過compileNode函式寫入FST結構中。
BlockTreeTermWrite最後在finish函式中將FST中的資訊寫入.tip檔案中

lucene原始碼分析—倒排索引的寫過程

總結

lucene原始碼分析—倒排索引的寫過程

搜尋學習基礎--倒排索引的過程解讀

Lucene 4.X 倒排索引原理與實現: (1) 詞典的設計

lucene索引結構(五)--詞頻倒排索引(frq)檔案結構分析

lucene原始碼分析(2)讀取過程例項

Lucene全文檢索之倒排索引實現原理、API解析【2018.11】

Lucene倒排索引簡述之倒排表

Lucene倒排索引簡述細說倒排索引構建

lucene中倒排索引的記憶體結構

Lucene倒排索引原理(轉)

IR中python 寫倒排索引與查詢處理

lucene倒排索引--fst和SkipList的結合

lucene 倒排索引、反向索引概念明晰

Lucene倒排索引簡述之索引表

lucene倒排索引表搜尋原理

Hadoop--倒排索引過程詳解

Lucene倒排索引簡述番外篇

程式設計師程式設計藝術第二十六章：基於給定的文件生成倒排索引（含原始碼下載）

Lucene 初學者實戰（二）正排索引與倒排索引

lucene和倒排索引

lucene原始碼分析—倒排索引的寫過程

總結

相關推薦