1. 程式人生 > >速度之王 — LZ4壓縮演算法(二)

速度之王 — LZ4壓縮演算法(二)

LZ4 (Extremely Fast Compression algorithm)

作者:Yann Collet

本文作者:zhangskd @ csdn blog

LZ4格式

The compressed block is composed of sequences.

每個資料塊可以壓縮成若干個序列,格式如下:

(1) literals

length of literals. If it is 0, then there is no literal. If it is 15, then we need to add some more bytes to indicate the

full length. Each additional byte then represent a value of 0 to 255, which is added to the previous value to produce

a total length. When the byte value is 255, another byte is output.

literals are uncompressed bytes, to be copied as-is.

(2) match

offset. It represents the position of the match to be copied from.

Note that 0 is an invalid value, never used. 1 means "current position - 1 byte".

The maximum offset value is really 65535. The value is stored using "little endian" format.

matchlength. There is an baselength to apply, which is the minimum length of a match called minmatch.

This minimum is 4. As a consequence, a value of 0 means a match length of 4 bytes, and a value of 15 means a

match length of 19+ bytes. (Similar to literal length)

(3) rules

1. The last 5 bytes are always literals.

2. The last match cannot start within the last 12 bytes.

So a file within less than 13 bytes can only be represented as literals.

(4) scan strategy

a single-cell wide hash table.

Each position in the input data block gets "hashed", using the first 4 bytes (minimatch). Then the position is stored

at the hashed position. Obviously, the smaller the table, the more collisions we get, reducing compression

effectiveness. The decoder do not care of the method used to find matches, and requires no addtional memory.

(5) Streaming format

實現

(1) 雜湊表

Each position in the input data block gets "hashed", using the first 4 bytes (minimatch). Then the position is stored

at the hashed position. Obviously, the smaller the table, the more collisions we get, reducing compression

effectiveness. The decoder do not care of the method used to find matches, and requires no addtional memory.

LZ4使用雜湊表來查詢匹配字串。這個雜湊表的對映關係(key, value):

key為4個位元組的二進位制值。

value為這4個位元組在塊中的位置。

/* Default value is 14, for 16KB, which nicely fits into Intel x86 L1 cache。
 * Increasing memory usage improves compression ratio
 * Reduced memory usage can improve speed, due to cache effect
 */

#define MEMORY_USAGE 14
#define LZ4_HASHLOG (MEMORY_USAGE - 2) /* 雜湊桶位數12 */
#define HASHTABLESIZE (1 << MEMORY_USAGE) /* 雜湊表大小2^14 = 16K */
#define HASHNBCELLS4 (1 << LZ4_HASHLOG) /* 雜湊桶個數2^12 = 4K */

選擇雜湊表的大小時,要做一個權衡:

1. 側重壓縮比,則雜湊表可以大一些。

2. 側重壓縮速度,則雜湊表應該適中,以便能裝入L1 cache。

預設的雜湊表使用的記憶體為16KB,能裝進L1 cache,這也是LZ4壓縮速度快的一個原因。

當前主流的Intel X86 L1 Data Cache為32KB,所以建議雜湊表不要超過此大小。

typedef enum { byPtr, byU32, byU16} tableType_t;

雜湊表儲存的資料為“位置”,分三種情況:

1. inputSize小於64KB時,使用byU16,表示16位的偏移值即可。

2. inputSize大於64KB時:

    2.1 指標大小為8位元組,使用byU32,表示32位的偏移值,如果用指標不划算。

    2.2 指標大小為4位元組,使用byPtr,表示32位的指標。

採用整數雜湊演算法。

2654435761U是2到2^32的黃金分割素數,2654435761 / 4294967296 = 0.618033987。

計算雜湊值,輸入為4個位元組,輸出可分為2位元組值、4位元組值兩種雜湊值。

FORCE_INLINE int LZ4_hashSequence(U32 sequence, tableType_t tableType)
{
    if (tableType == byU16)
        /* 雜湊表為16K,如果雜湊value為16位 => 雜湊key為13位 */
        return (((sequence) * 2654435761U) >> ((MINMATCH * 8) - (LZ4_HASHLOG + 1)));
    else
        /* 雜湊表為16K,如果雜湊value為32位 => 雜湊key為12位 */
        return (((sequence) * 2654435761U) >> ((MINMATCH * 8) - LZ4_HASHLOG));
}

FORCE_INLINE int LZ4_hashPosition(const BYTE *p, tableType_t tableType) \
    { return LZ4_hashSequence(A32(p), tableType); }

把地址存入到雜湊表中。

FORCE_INLINE void LZ4_putPositionOnHash(const BYTE *p, U32 h, void *tableBase, tableType_t tableType,
       const BYTE *srcBase)
{
    switch(tableType)
    {
    case byPtr: { const BYTE **hashTable = (const BYTE **) tableBase; hashTable[h] = p; break; }
    case byU32: { U32 *hashTable = (U32 *) tableBase; hashTable[h] = (U32) (p - srcBase); break; }
    case byU16: { U16 *hashTable = (U16 *) tableBase; hashTable[h] = (U16) (p - srcBase); break; }
    }
}

計算指標p指向的4位元組的雜湊值,然後把它的位置存入雜湊表中。

FORCE_INLINE void LZ4_putPosition(const BYTE *p, void *tableBase, tableType_t tableType, const BYTE *srcBase)
{
    U32 h = LZ4_hashPosition(p, tableType); /* 計算雜湊值 */
    LZ4_putPositionOnHash(p, h, tableBase, tableType, srcBase); /* 把地址存入雜湊表 */
}

根據雜湊值,獲取地址。

FORCE_INLINE const BYTE *LZ4_getPositionOnHash(U32 h, void *tableBase, tableType_t tableType,
    const BYTE *srcBase)
{
    if (tableType == byPtr) { const BYTE **hashTable = (const BYTE **) tableBase; return hashTable[h]; }
    if (tableType == byU32) { U32 *hashTable = (U32 *) tableBase; return hashTable[h] + srcBase; }
    { U16 *hashTable = (U16 *) tableBase; return hashTable[h] + srcBase; } /* default, to ensure a return */
}

根據指標p指向的4位元組,計算雜湊值,並查詢此雜湊桶是否已有賦值。

如果此雜湊桶已有賦值,則說明此時的4位元組和上次的4位元組很可能是一樣的(如果是衝突,則是不一樣的)。

FORCE_INLINE const BYTE *LZ4_getPosition(const BYTE *p, void *tableBase, tableType, const BYTE *srcBase)
{
    U32 h = LZ4_hashPosition(p, tableType);
    return LZ4_getPositionOnHash(h, tableBase, tableType, srcBase);
}

(2) 壓縮

LZ4_compress()是壓縮的一個入口函式,先申請雜湊表,然後呼叫LZ4_compress_generic()。

/* LZ4_compress():
 * Compress inputSize bytes from source into dest.
 * Destination buffer must be already allocated, and must be sized to handle worst cases situations
 * (input data not compressible)
 * Worst case size evaluation is provided by function LZ4_compressBound()
 * inputSize: Max support value is LZ4_MAX_INPUT_VALUE
 * return: the number of bytes written in buffer dest or 0 if the compression fails.
 */

int LZ4_compress(const char *source, char *dest, int inputSize)
{
#if (HEAPMODE) /* 在堆中給雜湊表分配記憶體 */
    void *ctx = ALLOCATOR(HASHNBCELLS4, 4); /* Aligned on 4-bytes boundaries */
#else /* 在棧中給雜湊表分配記憶體,比較快,預設 */
    U32 ctx[1U << (MEMORY_USAGE - 2) ] = {0}; /* Ensure data is aligned on 4-bytes boundaries */
#endif
    int result;

    /* 輸入小於64K+11,則用16位來表示滑動視窗,否則用32位*/
    if (inputSize < (int) LZ4_64KLIMIT) 
        result = LZ4_compress_generic((void *)ctx, source, dest, inputSize, 0, notLimited, byU16, noPrefix);
    else
        result = LZ4_compress_generic((void *)ctx, source, dest, inputSize, 0, notLimited,
                            (sizeof(void *) == 8 ? byU32 : byPtr, noPrefix);

#if (HEAPMODE)
    FREE(ctx);
#endif
    return result;
}
#define MINMATCH 4 /* 以4位元組為單位查詢雜湊表 */
#define COPYLENGTH 8 
#define LASTLITERALS 5
#define MFLIMIT (COPYLENGTH + MINMATCH) /* 對於最後的12個位元組,不進行查詢匹配 */
const int LZ4_minLength = (MFLIMIT + 1); /* 一個塊要>=13個字元,才會進行查詢匹配 */
#define LZ4_64KLIMIT ((1<<16) + (MFLIMIT - 1)) /* 64K + 11 */
/* Increasing this value will make the compression run slower on incompressible data。
 * 用於控制查詢匹配時的前進幅度,如果一直沒找到匹配,則加大前進幅度。
 */
#define SKIPSTRENGTH 6

LZ4_compress_generic()是主要的壓縮函式,根據指定的引數,可以執行多種不同的壓縮方案。

匹配演算法

1. 當前的地址為ip,它的雜湊值為h。

2. 下個地址為forwardIp,它的雜湊值為forwardH (下個迴圈賦值給ip、h)。

3. 按照雜湊值h,獲取雜湊表中的值ref。

    3.1 ref為初始值,沒有匹配,A32(ip) != A32(ref),繼續。

    3.2 ref不為初始值,有匹配。

          3.2.1 ref不在滑動視窗內,放棄,繼續。

          3.2.2 ref對應的U32和ip對應的U32不一樣,是衝突,繼續。

          3.3.3 ref在滑動視窗內,且對應的U32一樣,找到了match,退出。

4. 儲存ip和h的對應關係。

FORCE_INLINE int LZ4_compress_generic(void *ctx, const char *source, char *dest, int inputSize,
        int maxOutputSize, limitedOutput_directive limitedOutput, tableType_t tableType, 
        prefix64k_directive prefix)
{
    const BYTE *ip = (const BYTE *) source;
    /* 用作雜湊表中的srcBase */
    const BYTE *const base = (prefix == withPrefix) ? ((LZ4_Data_Structure *)ctx)->base : (const BYTE *)source);
    /* 前向視窗的起始地址 */
    const BYTE *const lowLimit = ((prefix == withPrefix) ? ((LZ4_Data_Structure *)ctx)->bufferStart : (const BYTE *)source);
    const BYTE *anchor = (const BYTE *)source;
    const BYTE *const iend = ip + inputSize; /* 輸入的結束地址 */
    const BYTE *const mflimit = iend - MFLIMIT; /* iend - 12,超過此處不允許再啟動一次匹配 */
    const BYTE *const matchlimit = iend - LASTLITERALS; /* iend - 5,最後5個字元不允許匹配 */

    BYTE *op = (BYTE *) dest; /* 用於操作輸出快取 */
    BYTE *const oend = op + maxOutputSize; /* 輸出快取的邊界,如果有的話 */

    int length;
    const int skipStrength = SKIPSTRENGTH; /* 6 */
    U32 forwardH;

    /* Init conditions */
    if ((U32) inputSize > (U32) LZ4_MAX_INPUT_SIZE) return 0; /* 輸入長度過大 */
    /* must continue from end of previous block */
    if ((prefix == withPrefix) && (ip != ((LZ4_Data_Structure *)ctx)->nextBlock)) return 0;
    /* do it now, due to potential early exit. 儲存下一個塊的起始地址 */
    if (prefix == withPrefix) ((LZ4_Data_Structure *)ctx)->nextBlock = iend;
    if ((tableType == byU16) && (inputSize >= LZ4_64KLIMIT)) return 0; /* Size too large (not within 64K limit) */
    if (inputSize < LZ4_minlength) goto _last_literals; /* 如果輸入長度小於13,則不查詢匹配 */
    
    /* First Byte */
    LZ4_putPosition(ip, ctx, tableType, base); /* 計算以第一個位元組開頭的U32的雜湊值,儲存其位置 */
    ip++; forwardH = LZ4_hashPosition(ip, tableType); /* 計算以第二個位元組開頭的U32的雜湊值 */

    /* Main loop,每次迴圈查詢一個匹配,產生一個序列 */
    for ( ; ; )
    {
        int findMatchAttempts = (1U << skipStrength) + 3;
        const BYTE *forwardIp = ip;
        const BYTE *ref;
        BYTE *token;

        /* Find a match,查詢一個匹配,或者到了盡頭mflimit */
        do {
            U32 h = forwardH; /* 當前ip對應的雜湊值 */
            int step = findMatchAttempts++ >> skipStrength; /* forwardIp的偏移,一般是1 */
            ip = forwardIp;
            forwardIp = ip + step; /* 前向快取中下個將檢查的地址 */
            
            if unlikely(forwardIp > mflimit) { goto _last_literals; } /* >=12位元組才會去匹配 */
            forwardH = LZ4_hashPosition(forwardIp, tableType); /* forwardIp的雜湊值 */ 

            /* 這裡是查詢的關鍵:按照雜湊值h,獲取地址ref。
             * 1. 沒有匹配,ref為srcBase。
             * 2. 有匹配。
             *     2.1 不在滑動視窗內,繼續。
             *     2.2 對應的U32不一樣,是衝突,繼續。
             *     2.3 在滑動視窗內,且對應的U32一樣,找到了match,退出。
             */ 
            ref = LZ4_getPositionOnHash(h, ctx, tableType, base); 
            LZ4_putPositionOnHash(ip, h, ctx, tableType, base); /* 儲存ip、h這個對應關係 */
        } while ((ref + MAX_DISTANCE < ip) || (A32(ref) != A32(ip)));
 
        /* 找到匹配之後,看能否向前擴大匹配 */
        while((ip > anchor) && (ref > lowLimit) && unlikely(ip[-1] == ref[-1])) { ip--; ref--; }
       
        /* Encode Literal length,賦值Literal length */
        length = (int) (ip - anchor);
        token = op++;

        /* Check output limit */
        if ((limitedOutput) & unlikely(op + length + 2 + (1 + LASTLITERALS) + (length>>8) > oend)) return 0;

        if (length >= (int) RUN_MASK) {
            int len = length - RUN_MASK;
            *token = (RUN_MASK << ML_BITS);
            for(; len >= 255; len -= 255) *op++ = 255;
            *op++ = (BYTE) len;
        } else
            *token = (BYTE) (length << ML_BITS);

        /* Copy Literals,複製不可編碼字元 */
        { BYTE * end = (op) + (length); LZ4_WILDCOPY(op, anchor, end); op = end; }

_next_match: /* 向後擴充套件匹配 */
        /* Encode Offset,賦值offset,op += 2 */
        LZ4_WRITE_LITTLEENDIAN_16(op, (U16) (ip - ref));

        /* Start Counting */
        ip += MINMATCH; ref += MINMATCH; /* MinMatch already verified */
        anchor = ip;

        while likely(ip < matchlimit - (STEPSIZE - 1)) {
            size_t diff = AARCH(ref) ^ AARCH(ip); /* 異或,值為零表示相同 */
            if (! diff) { ip += STEPSIZE; ref += STEPSIZE; continue; }
            ip += LZ4_NbCommonBytes(diff); /* STEPSIZE不同,看其中有多少個位元組是相同的 */
            goto _endCount;
        }

        if (LZ4_ARCH64) 
            if ((ip < (matchlimit - 3)) && (A32(ref) == A32(ip))) { ip += 4; ref += 4; }
        if ((ip < matchlimit - 1)) && (A16(ref) == A16(ip))) { ip += 2; ref += 2; }
        if ((ip < matchlimit) && (*ref == *ip)) ip++;

_endCount:
        /* Ecode MatchLength,賦值match length */
        length = (int) (ip - anchor);
        /* Check output limit */
        if ((limitedOutput) && unlikely(op + (1 + LASTLITERALS) + (length >> 8) > oend)) return 0;

        if (length >= (int) ML_MASK) {
            *token += ML_MASK;
            length -= ML_MASK;
            for (; length > 509; length -= 510) { *op++ = 255; *op++ = 255; }
            if (length >= 255) { length -= 255; *op++ = 255; }
            *op++ = (BYTE) (length);
        } else
            *token += (BYTE) (length);

        /* Test end of chunk */
        if (ip > mflimit) { anchor = ip; break; } /* 不再進行匹配了 */
        /* Fill table,順便儲存 */
        LZ4_putPosition(ip - 2, ctx, tableType, base);   

        /* Test next position,嘗試著找匹配 */
        ref = LZ4_getPosition(ip, ctx, tableType, base);
        LZ4_putPosition(ip, ctx, tableType, base);
        /* 如果找到匹配,說明沒有literals,可以直接跳過查詢、賦值literal length */
        if ((ref + MAX_DISTANCE >= ip) && (A32(ref) == A32(ip))) { token = op++; *token = 0; goto _next_match; }   

        /* Prepare next loop,準備進行下個迴圈 */
        anchor = ip++;
        forwardH = LZ4_hashPosition(ip, tableType);
    }

_last_literals:
    /* Encode Last Literals */
    {
        int lastRun = (int) (iend - anchor); /* 最後原字串長度 */

        if ((limitedOutput) && (((char *)op - dest) + lastRun + 1 + ((lastRun + 255 - RUN_MASK) / 255) > 
                (U32) maxOutputSize))
            return 0; /* check output limit */

        if (lastRun >= (int) RUN_MASK) { 
            *op ++ = (RUN_MASK << ML_BITS); 
            lastRun -= RUN_MASK;
            for (; lastRun >= 255; lastRun -=255) *op++ = 255;
            *op++ = (BYTE) lastRun;
        } else
            *op++ = (BYTE) (lastRun << ML_BITS);

        memcpy(op, anchor, iend - anchor); /* 複製literals */
        op += iend - anchor;
    }
 
    /* End */
    return (int) (((char *)op) - dest); /* 返回壓縮後的長度 */
}

#define LZ4_MAX_INPUT_SIZE 0x7E000000 /* 2 113 929 216 bytes */
#define ML_BITS 4 /* Token: 4-low-bits, match length */
#define ML_MASK ((1U << ML_BITS) - 1)
#define RUN_BITS (8 - ML_BITS) /* Token: 4-high-bits, literal length */
#define RUN_MASK ((1U << RUN_BITS) - 1)

#define MAXD_LOG 16 /* 滑動視窗的位數 */
#define MAX_DISTANCE ((1 << MAXD_LOG) - 1) /* 滑動視窗的最大值 */