lucene中倒排索引的記憶體結構

阿新 • • 發佈：2018-12-22

簡介

lucene索引格式是個老生常談的問題，網上也有一些資料，但是由於年代比較古老（大都是基於3.x或者4.x的版本），和現有程式碼較難對上，這裡基於lucene6.6重新講解下，也幫助自己理解和記憶。

基本概念

這些資訊很容易理解，看程式碼的時候也很清晰。

lucene在進行索引時，為了加速索引程序，會同時多執行緒同時進行索引，每一個執行緒在flush後都是一個完整的索引段。

對於每個索引執行緒，又會分為多個field域，每個field都是獨立的記憶體結構，記錄該field所有出現的term資訊。

對於每個term，都是獨立屬於某個field（不同field，字面值相同的term，也是不同的term），都是獨立的不可拆分的單位，是分詞之後得到的結果，是搜尋的時候的用來匹配的詞。每個term都需要記錄完整的倒排索引資訊。

基礎知識

變長整數vInt的表示：在lucene中，變長的整數，然用一種叫或然跟隨規則的形式儲存，對於一個byte，低7位來儲存資料，最高位表示是否還有下一位數字，例如127，則直接採用0x7f儲存，但是128，則使用0x80,0x01兩個位元組儲存，其中0x80二進位制最高位的1表示還有下一個位元組。0x01則表示自己是最後一個位元組，連起來表示的整數就是128。
slice連結串列：在lucene中，slice作為bytePool記憶體分配的一個重要單位，每隔slice的初始長度都是5，如果需要的位元組數大於5，則會將當前這5個位元組中的後4為作為指向下一層的指標，並在bytePool分配下一層的空間。這個在bytePool的記憶體分配寫的比較清楚、

倒排索引要存哪些資訊

這裡我們僅討論核心資訊，非核心資訊可以很容易同理可得。
- 具體的term值。
- term對應的docId。
- term在文件中的出現次數（Freq，用來打分）。
- term在文件分詞後的位置（pos，用來短語搜尋）。
- other（類似pos資訊）。

邏輯結構類似：

|+ field(name,type)
    |+ term
        |+ docId & termFreq 
            |+ [position,offset,payload]
        |+ docId & termFreq 
            |+ [position,offset,payload].
    |+ term
    |+... 

|+ field2(name,type)
|+ ...

term如何儲存

這裡我們忽略分詞的過程，假設已經拿到所有分詞結果。

term儲存，主要涉及到兩個問題：
1. term以什麼結構儲存。
2. 重複的term如何解決。

基於以上兩點，lucene設計瞭如下儲存結構：

public int add(BytesRef bytes) {
    assert bytesStart != null : "Bytesstart is null - not initialized";
    final int length = bytes.length;
    // 獲得term的hash儲存位置，hash演算法不展開。
    final int hashPos = findHash(bytes);
    // ids用來儲存hashPos對應的termId。
    int e = ids[hashPos];

    //如果為-1，則是新的term
    if (e == -1) {
      // 儲存的時候，在ByteBlockPool中的結構是：長度+具體的term。
      // lucene支援的term長度不超過2個位元組，長度採用變長整數表示，因此需要申請的儲存空間為2 + bytes.length。
      final int len2 = 2 + bytes.length;
      if (len2 + pool.byteUpto > BYTE_BLOCK_SIZE) {
        if (len2 > BYTE_BLOCK_SIZE) {
          throw new MaxBytesLengthExceededException("bytes can be at most "
              + (BYTE_BLOCK_SIZE - 2) + " in length; got " + bytes.length);
        }
        // 記憶體池擴容不展開敘述。
        pool.nextBuffer();
      }
      final byte[] buffer = pool.buffer;
      // 獲取記憶體池的起始位置
      final int bufferUpto = pool.byteUpto;
      // byteStart用來記錄termId在記憶體池中儲存的起始位置，count是總term數量。
      if (count >= bytesStart.length) {
        bytesStart = bytesStartArray.grow();
        assert count < bytesStart.length + 1 : "count: " + count + " len: "
            + bytesStart.length;
      }
      //分配termId
      e = count++;

      // 記錄對應termId在ByteStartPool中的起始位置。
      bytesStart[e] = bufferUpto + pool.byteOffset;

      // 長度小於128，則長度用一個位元組的vInt即可儲存。
      if (length < 128) {
        // 1 byte to store length
        buffer[bufferUpto] = (byte) length;
        pool.byteUpto += length + 1;
        assert length >= 0: "Length must be positive: " + length;
        System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 1,
            length);
      } else {
        // 2 byte to store length
        buffer[bufferUpto] = (byte) (0x80 | (length & 0x7f));
        buffer[bufferUpto + 1] = (byte) ((length >> 7) & 0xff);
        pool.byteUpto += length + 2;
        System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 2,
            length);
      }
      assert ids[hashPos] == -1;
      // 記錄hashPos對應的termId為e。
      ids[hashPos] = e;
      // rehash，不展開敘述。
      if (count == hashHalfSize) {
        rehash(2 * hashSize, true);
      }
      return e;
    }
    // 如果不是新的term，則直接返回。
    return -(e + 1);
  }

到此為止，我們已經把term記錄下來。下面，我們就要考慮如何把term和docId對應起來。

docId如何儲存

在我們整個索引過程，每一個field的所有term是共用記憶體池的，儲存docId的時候，要考慮到一個term可以出現在不同的文件中，對應多個不同的docId。

term的整個處理過程在TermsHashPerField中，我們可以在add()方法中看到，term的儲存只是整個term索引過程第一步。

資料結構

現在term已經儲存完成，我們搜尋請求過來時，可以很輕鬆找到自己的termId，如何從termId查詢docId是另一層對應關係需要做的事情，lucene為此，在TermsHashPerField中設計了幾個資料結構，這幾個資料結構在對term索引的時候起到了重要作用

postingsArray

這個結構中包含三個很重要的陣列，分別用來記錄不同的資訊：
- textStarts，本來是用來記錄term本身在ByteBlockPool中的起始位置的，建索引的時候沒有用到這個欄位。
- intStarts，用來記錄對應termId對應的其他資訊在IntPool中的記錄位置，intpool中記錄的具體是什麼資訊後面會說明。
- byteStarts。用來記錄termId的[docId,freq]組合在ByteBlockPool中的起始位置，注意是[docID,freq]組合，在bytePool中的儲存形式類似於[docId,freq][docId,freq][docId,freq]….這種，這個起始位置的值 + slice初始化長度就是posi資訊的起始位置。

BlockPool

在TermsHashPerField中可以看到三個blockPool
- IntBlockPool intPool;
- ByteBlockPool bytePool;
- ByteBlockPool termBytePool;

IntPool用來termID對應的資訊在bytePool中的位置，包含以下兩種：
- [docId,freq]連結串列的結束位置+1。
- 如果有posi等資訊，則用來記錄posi等資訊的結束位置+1。

至於為什麼這兩個資訊要記錄到不同位置呢？是因為[docId,freq]資訊要等一個doc處理結束才能確定，此時才會真正寫入bytePool，而posi等資訊，在處理doc的每一個term的時候都可以確定，可以直接寫入bytePool，所以這裡會分為兩個地方寫入。

bytePool和termBytePool用來儲存真正的倒排資訊，從程式碼中可以很輕鬆發現這兩個引用指向同一個物件。

具體流程

這裡我先用文字描述下即將發生的事情，後面我們跟著程式碼繼續整理：

新增term
1. 為term即將儲存的[docId,freq]資訊、posi等資訊，在bytePool中申請slice（記憶體空間），並將對應的slice起始位置作為[docId,freq]和posi等資訊的結束位置寫入intPool（由於還沒存入資訊，所以用起始位置作為結束位置），兩個資訊在bytePool中分別存在獨立的slice中。
2. 呼叫FreqProxTermsWriterPerField的newTerm方法，首先將該term的lastdocId置為當前docId，將freq置為1，將docCodes置為當前docId << 1，左移一位目的是，最後一位為0，表示後面跟隨freq資訊，在addTerm時可以看到其他處理，這個優化是因為大多數term都只會出現一次，另開一個int儲存比較浪費。
3. 然後在bytePool中寫入posi等資訊，並調整intPool中posi資訊的最後一位下標。

已有term
1. 呼叫FreqProxTermsWriterPerField的addTerm方法，首先判斷當前處理的docId和該term最後一次處理的docId是否一樣，如果一樣，則證明這是一個doc分詞出的相同term，需要累加freq，但是不需要更新docId；如果不一樣，則證明上一次的doc已經處理完畢，應當將上次的所有資訊刷入記憶體池，我們以不一樣為例講解下。
2. 如果不是一個docId，則證明上一個文件剛處理結束，當前所有記錄的資訊都是上一個doc的。如果出現頻率的頻率等於1，則沒必要寫入freq資訊，直接把docCodes最後一位置為1，寫入docCodes即可。否則，直接寫入docCodes（此時docCodes最後一位為0，在newTerm的時候有設定），並且寫入freq資訊。
3. 寫入完成後，則上一個doc處理完畢，開始處理當前文件。首先將termFreq設定為1，表明這是當前文件第一次出現這個term，然後設定docCodes，採用差值設定，並左移一位，將最後一位置為0，原理同newTerm。
4. 然後寫入posi等資訊，原理通newTerm。

至此，我們大概清楚瞭如何term到底是如何和docId對應起來的，並且這些東西使如何儲存的。嘴上得來總覺淺，下面我們直接看下程式碼到底是如何處理的：

TermHashPerField裡面的add()方法：

// 新增term，並返回termId
int termID = bytesHash.add(termAtt.getBytesRef());

//termId為正，則表明使新的term。
if (termID >= 0) {// New posting

      //這裡貌似沒什麼作用
      bytesHash.byteStart(termID);
      // numPosingInt用來記錄在intPool需要幾位來記錄資訊，intPool不夠則擴容
      if (numPostingInt + intPool.intUpto > IntBlockPool.INT_BLOCK_SIZE) {
        intPool.nextBuffer();
      }

      // 同理，判斷bytePool是否需要擴容，需要為term在bytePool中分配numPosingInt個slice，每個slice的初始大小都是FIRET_LEVEL_SIZE。
      if (ByteBlockPool.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt*ByteBlockPool.FIRST_LEVEL_SIZE) {
        bytePool.nextBuffer();
      }

      intUptos = intPool.buffer;
      intUptoStart = intPool.intUpto;
      intPool.intUpto += streamCount;

      // intStarts記錄intPool中term資訊的位置    
      postingsArray.intStarts[termID] = intUptoStart + intPool.intOffset;

      // 為每個域分配slice，並記錄結束位置，streamCount應該等同numPosingInt
      for(int i=0;i<streamCount;i++) {
        final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
        intUptos[intUptoStart+i] = upto + bytePool.byteOffset;
      }
      // 記錄[docId,freq]連結串列起始位置，intPool中記錄的理應是結束位置，但是由於此時還沒寫入內容，所以起始位置等於結束位置
      postingsArray.byteStarts[termID] = intUptos[intUptoStart];

      // 呼叫newTerm方法，執行FreqProxTermsWriterPerField的newTerm
      newTerm(termID);

    } else {
      termID = (-termID)-1;
      int intStart = postingsArray.intStarts[termID];
      // 準備一些記憶體池相關引數
      intUptos = intPool.buffers[intStart >> IntBlockPool.INT_BLOCK_SHIFT];
      intUptoStart = intStart & IntBlockPool.INT_BLOCK_MASK;
      // 呼叫addTerm，執行FreqProxTermsWriterPerField的addTerm
      addTerm(termID);
    }

FreqProxTermsWriterPerField的newTerm()方法

void newTerm(final int termID) {
    final FreqProxPostingsArray postings = freqProxPostingsArray;

    // 該term最後處理的docId就是當前docId
    postings.lastDocIDs[termID] = docState.docID;
    // 不記錄freq，只需要維護docId鏈就可以
    if (!hasFreq) {
      assert postings.termFreqs == null;
      postings.lastDocCodes[termID] = docState.docID;
    } else {
      // 記錄docId鏈，左移一位，最後一位表示後面跟隨freq
      postings.lastDocCodes[termID] = docState.docID << 1;
      postings.termFreqs[termID] = 1;
      // 寫入posi等資訊
      if (hasProx) {
        writeProx(termID, fieldState.position);
        if (hasOffsets) {
          writeOffsets(termID, fieldState.offset);
        }
      } else {
        assert !hasOffsets;
      }
    }
    fieldState.maxTermFrequency = Math.max(1, fieldState.maxTermFrequency);
    fieldState.uniqueTermCount++;
  }

FreqProxTermsWriterPerField的addTerm()方法

void addTerm(final int termID) {
    final FreqProxPostingsArray postings = freqProxPostingsArray;

    assert !hasFreq || postings.termFreqs[termID] > 0;

    // 不記錄freq的情況，比較簡單，不展開。
    if (!hasFreq) {
      assert postings.termFreqs == null;
      if (docState.docID != postings.lastDocIDs[termID]) {
        // New document; now encode docCode for previous doc:
        assert docState.docID > postings.lastDocIDs[termID];
        writeVInt(0, postings.lastDocCodes[termID]);
        postings.lastDocCodes[termID] = docState.docID - postings.lastDocIDs[termID];
        postings.lastDocIDs[termID] = docState.docID;
        fieldState.uniqueTermCount++;
      }
    } else if (docState.docID != postings.lastDocIDs[termID]) {
      // 當前處理的docId不等於上次處理的docId，則證明上次的doc已經處理完畢，需要寫入上次的資訊
      // 如果freq等於1，則將lastDocCodes最後一位置為1，表示後面不跟隨freq資訊，省掉一個記錄freq的位元組。
      if (1 == postings.termFreqs[termID]) {
        writeVInt(0, postings.lastDocCodes[termID]|1);
      } else {
        // 否則，要寫入docCodes和freq，此時docCodes最後一位是0。
        writeVInt(0, postings.lastDocCodes[termID]);
        writeVInt(0, postings.termFreqs[termID]);
      }
      // 舊的文件處理結束，開始寫入新的文件資訊，基本和newTerm()處理手段一致。
      postings.termFreqs[termID] = 1;
      fieldState.maxTermFrequency = Math.max(1, fieldState.maxTermFrequency);
      // 這裡是docId鏈採用差值法儲存，也是為了節省記憶體。
      postings.lastDocCodes[termID] = (docState.docID - postings.lastDocIDs[termID]) << 1;
      postings.lastDocIDs[termID] = docState.docID;
      if (hasProx) {
        writeProx(termID, fieldState.position);
        if (hasOffsets) {
          postings.lastOffsets[termID] = 0;
          writeOffsets(termID, fieldState.offset);
        }
      } else {
        assert !hasOffsets;
      }
      fieldState.uniqueTermCount++;
    } else {
      // 進到這裡，說明是同一個doc的同一個field中分詞分出了多個相同的term，只需要額外寫入posi等資訊即可
      fieldState.maxTermFrequency = Math.max(fieldState.maxTermFrequency, ++postings.termFreqs[termID]);
      if (hasProx) {
        writeProx(termID, fieldState.position-postings.lastPositions[termID]);
        if (hasOffsets) {
          writeOffsets(termID, fieldState.offset);
        }
      }
    }
  }

至此，整個doc資訊都已經被串聯起來並寫入記憶體了，剩下就是在合適的時候將這些資訊刷入磁碟檔案，這部分本文不做探討。為了幫助理解，我們以一份簡單的索引，來看下上面提到的這些記憶體池的結構，加深理解。

實戰

我們以下面這份簡單的索引為例，看下這份索引的記憶體結構到底是什麼樣子。

    private Document getDocument(String value) throws Exception {
        Document doc = new Document();
        FieldType fieldType = new FieldType();
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        fieldType.setTokenized(true);
        Field pathField = new Field("name", value, fieldType);
        //向document中新增資訊
        doc.add(pathField);
        return doc;
    }

    //建立索引
    public void writeToIndex() throws Exception {
        //需要建立索引的資料位置
        Document document = getDocument("lucene1");
        writer.addDocument(document);
        // breakpoint1
        document = getDocument("lucene2 lucene2");
        writer.addDocument(document);
        // breakpoint2
        document = getDocument("lucene2 lucene2 test lucene2 lucene2");
        writer.addDocument(document);
        // breakpoint3
    }

breakpoint1

下標	postindesArray.byteStarts	intPool	bytePool
0	8	8	7
1	0	14	108
2	0	0	117
3	0	0	99
4	0	0	101
5	0	0	110
6	0	0	101
7	0	0	49
8	0	0	0
9	0	0	0
10	0	0	0
11	0	0	0
12	0	0	16
13	0	0	0
14	0	0	0
15	0	0	0
16	0	0	0
17	0	0	16

在這個斷點，只有一個term出現，lucene1的termId為0。

textStarts[0] = 0，表示term字面值在bytePool中第0位開始，bytePool[0] = 7，表示term長度為7，bytePool中1~7為term字面值。

8~12是第一個slice，用來儲存[docId,freq]，最後一位16表示沒有向後延伸。

13~17是第二個slice，用來儲存posi等資訊，最後一位16表示沒有向後延伸。

再來看intStarts[0] = 0，表示term相關資訊在intPool中第0位開始，由於有posi資訊，則在intPool中需要佔兩個位置。因此intPool[0]和intPool[1]分別表示這個term在bytePool中[docId,freq]和posi等資訊的結束位置+1

byteStarts[0] = 8，表示term的[docId,freq]資訊在bytePool中從第8個位元組開始。

intPool[0] = 8，表示[docId,freq]在bytePool中結束位置 + 1 。為什麼明明有一個doc，但是intPool[0]中指示[doc,freq]的結束位置為8，等於byteStarts[0]呢，相當於沒有任何資訊呢？原因是雖然doc1已經處理完畢，但是此時對於lucene1這個term，沒有其他的doc，所以這個資訊還沒有被寫入intPool，仍存在lucene1的這個term的docCodes、freq陣列中。

intPool[1] = 14，表示pos等資訊的結束位置為14，這個資訊的長度可以通過[docId,freq]的數量計算出來，分詞後的每一個term都會存這個資訊，因此這個資訊長度為sum(freq)。這裡可以看到值為0。這個要分兩部分看，二進位制最後一位為0，表示沒有後續資訊，前7位為0，表示term在這個field原生值分詞後的第一位。

到這裡，breakpoint1的所有資訊都分析完畢。

breakpoint2

下標	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	0	0	0	26	117
3	0	0	0	33	99
4	0	0	0	0	101
5	0	0	0	0	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	0
27	0	0	0	0	0
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	2
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	16

在這個斷點，lucene2的termId為1。

textStarts[1] = 18，表示term字面值在bytePool中第18位開始，bytePool[18] = 7，表示term長度為7，bytePool中19~25為term字面值。

26~30是第一個slice，用來儲存[docId,freq]，最後一位16表示沒有向後延伸。

31~35是第二個slice，用來儲存posi等資訊，最後一位16表示沒有向後延伸。

再來看intStarts[1] = 2，表示term相關資訊在intPool中第2位開始，由於有posi資訊，則在intPool中需要佔兩個位置。因此intPool[2]和intPool[3]分別表示這個term在bytePool中[docId,freq]和posi等資訊的結束位置+1

byteStarts[1] = 26，表示term的[docId,freq]資訊在bytePool中從第26個位元組開始。

intPool[2] = 26，表示[docId,freq]在bytePool中結束位置 + 1 。為什麼等於byteStarts[1]，原因同lucene1

intPool[3] = 33，表示pos等資訊的結束位置為3。可以看到bytePool[31] = 0，表示在分詞列表中出現的位置是0，後面不跟隨其他資訊，bytePool[32] = 2，表示在分詞列表中出現的位置是1，後面不跟隨其他資訊。

到這裡，breakpoint2的所有資訊都分析完畢。

breakpoint3

下標	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	36	4	41	28	117
3	0	0	0	56	99
4	0	0	0	41	101
5	0	0	0	47	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	2
27	0	0	0	0	2
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	0
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	51
36	0	0	0	0	4
37	0	0	0	0	116
38	0	0	0	0	101
39	0	0	0	0	115
40	0	0	0	0	116
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	16
46	0	0	0	0	4
47	0	0	0	0	0
48	0	0	0	0	0
49	0	0	0	0	0
50	0	0	0	0	16
51	0	0	0	0	2
52	0	0	0	0	0
53	0	0	0	0	2
54	0	0	0	0	4
55	0	0	0	0	2
56	0	0	0	0	0
57	0	0	0	0	0
58	0	0	0	0	0
59	0	0	0	0	0
60	0	0	0	0	0
61	0	0	0	0	0
62	0	0	0	0	0
63	0	0	0	0	0
64	0	0	0	0	17
65	0	0	0	0	0
66	0	0	0	0	0
67	0	0	0	0	0

在這個斷點，lucene2是已經出現過的term，會把doc1的資訊刷入bytePool，test是新的term，會單獨儲存並分配slic。

這個field總共會分出5個term：lucene2、lucene2、test、lucene2、lucene2。我們一個個分析資訊是如何寫入bytePool中的。

第一個lucene2

首先，會發現這是已有的term，termId = 1，addTerm時發現上次的docId是1，這次的docId是2，會先將上次doc的資訊刷入bytePool。
上次的docId為1，由於termFreq = 2，需要跟隨freq資訊，因此將docId左移一位的值直接寫入bytePool，然後寫入freq，注意freq使用vInt寫入的，但是此時freq = 2，只需要一個位元組，所以寫入的值是2.
向intPool查詢當前可以寫入的位置，intPool[1] = 26，因此第26個位元組寫入2表示docId，並且後面跟隨freq，第27個位元組寫入2，表示freq = 2，並設定[docId,freq]結束位置為28。
然後，更新lastDocId等資訊，並寫入新的term posi等資訊。

第二個lucene2

這個沒什麼好說的，就是正常的addTerm，更新freq，寫入posi等資訊，freq列表為下標31~34，值為0、2、0、2。

test

新的term出現了，和之前新term處理方式一樣，寫入term字面值（bytePool下標36~40），申請[docId,freq]的splic（41~45），申請posi等資訊的slice並寫入（46~50），寫入的值為4，二進位制最後一位為0表示不跟隨其他資訊，右移一位為2表示在分詞鏈中第2個出現，因此posi結束位置為47，[doc,freq]資訊還沒刷入bytePool，結束位置為41。

第三個lucene2

正常執行addTerm方法，但是在寫入posi等資訊的時候，要寫入的位置是35，這個位置值16表示這是slice的末尾，不能寫入值。slice要擴容，並將32~34的資訊複製到新擴容的區域，重新申請slice得到的slice起始位置為51，將32~35四個位元組合併表示51，因此32~34為0，35表示51，將原本32到34的值複製到51~53，因此51~53的置為2、0、2，新的詞在分詞列表中處於第3位，上一個lucene2處於第1位，採用差值法，應當寫入2，左移一位將末尾置0，表示後面沒有其他資訊，因此54位置寫入的值為4。

第四個lucene2

同第二個lucene2，直接在55的位置寫入2，將posi資訊結束位置修改為53。

到這裡，breakpoint3的所有資訊都分析完畢。

The End

到這裡，我們已經把整個lucene倒排索引如何建立的，以及其記憶體結構講清楚了。所有複雜的結構本身都是有必須複雜的道理，lucene設計的這麼複雜的結構的目的就是為了節省記憶體，儘可能的利用每一個位元組，從而在記憶體中放更多的東西。

下標	postindesArray.byteStarts	intPool	bytePool
0	8	8	7
1	0	14	108
2	0	0	117
3	0	0	99
4	0	0	101
5	0	0	110
6	0	0	101
7	0	0	49
8	0	0	0
9	0	0	0
10	0	0	0
11	0	0	0
12	0	0	16
13	0	0	0
14	0	0	0
15	0	0	0
16	0	0	0
17	0	0	16

下標	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	0	0	0	26	117
3	0	0	0	33	99
4	0	0	0	0	101
5	0	0	0	0	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	0
27	0	0	0	0	0
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	2
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	16

下標	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	36	4	41	28	117
3	0	0	0	56	99
4	0	0	0	41	101
5	0	0	0	47	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	2
27	0	0	0	0	2
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	0
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	51
36	0	0	0	0	4
37	0	0	0	0	116
38	0	0	0	0	101
39	0	0	0	0	115
40	0	0	0	0	116
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	16
46	0	0	0	0	4
47	0	0	0	0	0
48	0	0	0	0	0
49	0	0	0	0	0
50	0	0	0	0	16
51	0	0	0	0	2
52	0	0	0	0	0
53	0	0	0	0	2
54	0	0	0	0	4
55	0	0	0	0	2
56	0	0	0	0	0
57	0	0	0	0	0
58	0	0	0	0	0
59	0	0	0	0	0
60	0	0	0	0	0
61	0	0	0	0	0
62	0	0	0	0	0
63	0	0	0	0	0
64	0	0	0	0	17
65	0	0	0	0	0
66	0	0	0	0	0
67	0	0	0	0	0

下標	postindesArray.byteStarts	intPool	bytePool
0	8	8	7
1	0	14	108
2	0	0	117
3	0	0	99
4	0	0	101
5	0	0	110
6	0	0	101
7	0	0	49
8	0	0	0
9	0	0	0
10	0	0	0
11	0	0	0
12	0	0	16
13	0	0	0
14	0	0	0
15	0	0	0
16	0	0	0
17	0	0	16

下標	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	0	0	0	26	117
3	0	0	0	33	99
4	0	0	0	0	101
5	0	0	0	0	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	0
27	0	0	0	0	0
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	2
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	16

下標	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	36	4	41	28	117
3	0	0	0	56	99
4	0	0	0	41	101
5	0	0	0	47	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	2
27	0	0	0	0	2
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	0
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	51
36	0	0	0	0	4
37	0	0	0	0	116
38	0	0	0	0	101
39	0	0	0	0	115
40	0	0	0	0	116
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	16
46	0	0	0	0	4
47	0	0	0	0	0
48	0	0	0	0	0
49	0	0	0	0	0
50	0	0	0	0	16
51	0	0	0	0	2
52	0	0	0	0	0
53	0	0	0	0	2
54	0	0	0	0	4
55	0	0	0	0	2
56	0	0	0	0	0
57	0	0	0	0	0
58	0	0	0	0	0
59	0	0	0	0	0
60	0	0	0	0	0
61	0	0	0	0	0
62	0	0	0	0	0
63	0	0	0	0	0
64	0	0	0	0	17
65	0	0	0	0	0
66	0	0	0	0	0
67	0	0	0	0	0