踩過無數坑實現的哈夫曼壓縮（JAVA）

阿新 • • 發佈：2018-12-10

最近可能又是閒著沒事幹了，就想做點東西，想著還沒用JAVA弄過資料結構，之前搞過演算法，就試著寫寫哈夫曼壓縮了。

本以為半天就能寫出來，結果，踩了無數坑，花了整整兩天時間！！orz。。。不過這次踩坑，算是又瞭解了不少東西，更覺得在開發中學習是最快的了。

話不多說，進入正題

首先先來講講哈夫曼樹

哈夫曼樹屬於二叉樹，即樹的結點最多擁有2個孩子結點。若該二叉樹帶權路徑長度達到最小，稱這樣的二叉樹為最優二叉樹，也稱為哈夫曼樹(Huffman Tree)。哈夫曼樹是帶權路徑長度最短的樹，權值較大的結點離根較近。

哈夫曼樹的構造

假設有n個權值，則構造出的哈夫曼樹有n個葉子結點。 n個權值分別設為 w1、w2、…、wn，則哈夫曼樹的構造規則為：

(1) 將w1、w2、…，wn看成是有n 棵樹的森林(每棵樹僅有一個結點)；

(2) 在森林中選出兩個根結點的權值最小的樹合併，作為一棵新樹的左、右子樹，且新樹的根結點權值為其左、右子樹根結點權值之和；

(3)從森林中刪除選取的兩棵樹，並將新樹加入森林；

(4)重複(2)、(3)步，直到森林中只剩一棵樹為止，該樹即為所求得的哈夫曼樹。

哈夫曼編碼

在資料通訊中，需要將傳送的文字轉換成二進位制的字串，用0，1碼的不同排列來表示字元。例如，需傳送的報文為“HELLO WORLD”，這裡用到的字符集為“D,E,H,L,O,R,W”，各字母出現的次數為{1,1,1,3,2,1,1}。現要求為這些字母設計編碼。要區別7個字母，最簡單的二進位制編碼方式是等長編碼，固定採用3位二進位制，可分別用000、001、010、011、100、101、110對“D,E,H,L,O,R,W”進行編碼傳送，當對方接收報文時再按照三位一分進行譯碼。顯然編碼的長度取決報文中不同字元的個數。若報文中可能出現26個不同字元，則固定編碼長度為5。然而，傳送報文時總是希望總長度儘可能短。在實際應用中，各個字元的出現頻度或使用次數是不相同的，如A、B、C的使用頻率遠遠高於X、Y、Z，自然會想到設計編碼時，讓使用頻率高的用短編碼，使用頻率低的用長編碼

，以優化整個報文編碼。

此時D->0000 E->0001 W->001 H->110 R->111 L->01 0->02

固定三位時編碼長度為30，而時候哈夫曼編碼後，編碼長度為27，很明顯長度縮小了，得到優化。

下面就是程式碼實現

HuffmanCompress.java

package 哈夫曼;

import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.util.Arrays;
import java.util.HashMap;
import java.util.PriorityQueue;

 
public class HuffmanCompress {
    private PriorityQueue<HufTree> queue = null;

    public void compress(File inputFile, File outputFile) {
        Compare cmp = new Compare();
        queue = new PriorityQueue<HufTree>(12, cmp);

        // 對映位元組及其對應的哈夫曼編碼
        HashMap<Byte, String> map = new HashMap<Byte, String>();

        int i, char_kinds = 0;
        int char_tmp, file_len = 0;
        FileInputStream fis = null;
        FileOutputStream fos = null;
        DataOutputStream oos = null;

        HufTree root = new HufTree();
        String code_buf = null;

        // 臨時儲存字元頻度的陣列
        TmpNode[] tmp_nodes = new TmpNode[256];

        for (i = 0; i < 256; i++) {
            tmp_nodes[i] = new TmpNode();
            tmp_nodes[i].weight = 0;
            tmp_nodes[i].Byte = (byte) i;
        }

        try {
            fis = new FileInputStream(inputFile);
            fos = new FileOutputStream(outputFile);
            oos = new DataOutputStream(fos);

            /*
             * 統計字元頻度，計算檔案長度
             */
            while ((char_tmp = fis.read()) != -1) {
                tmp_nodes[char_tmp].weight++;
                file_len++;
            }
            fis.close();
            // 排序，將頻度為0的位元組放在最後，同時計算除位元組的種類，即有多少個不同的位元組
            Arrays.sort(tmp_nodes);
            for (i = 0; i < 256; i++) {
                if (tmp_nodes[i].weight == 0) {
                    break;
                }
                HufTree tmp = new HufTree();
                tmp.Byte = tmp_nodes[i].Byte;
                tmp.weight = tmp_nodes[i].weight;
                queue.add(tmp);
            }
            char_kinds = i;

            if (char_kinds == 1) {
                oos.writeInt(char_kinds);
                oos.writeByte(tmp_nodes[0].Byte);
                oos.writeInt(tmp_nodes[0].weight);
            } else {
                // 建樹
                createTree(queue);
                root = queue.peek();
                // 生成哈夫曼編碼
                hufCode(root, "", map);
                // 寫入位元組種類
                oos.writeInt(char_kinds);
                for (i = 0; i < char_kinds; i++) {
                    oos.writeByte(tmp_nodes[i].Byte);
                    oos.writeInt(tmp_nodes[i].weight);
                }
                oos.writeInt(file_len);
                fis = new FileInputStream(inputFile);
                code_buf = "";
                while ((char_tmp = fis.read()) != -1) {
                    code_buf += map.get((byte) char_tmp);
                    while (code_buf.length() >= 8) {
                        char_tmp = 0;
                        for (i = 0; i < 8; i++) {
                            char_tmp <<= 1;
                            if (code_buf.charAt(i) == '1')
                                char_tmp |= 1;
                        }
                        oos.writeByte((byte) char_tmp);
                        code_buf = code_buf.substring(8);
                    }
                }
                // 最後編碼長度不夠8位的時候，用0補齊
                if (code_buf.length() > 0) {
                    char_tmp = 0;
                    for (i = 0; i < code_buf.length(); ++i) {
                        char_tmp <<= 1;
                        if (code_buf.charAt(i) == '1')
                            char_tmp |= 1;
                    }
                    char_tmp <<= (8 - code_buf.length());
                    oos.writeByte((byte) char_tmp);
                }
                oos.close();
                fis.close();
            }

        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    public void extract(File inputFile, File outputFile) {
        Compare cmp = new Compare();
        queue = new PriorityQueue<HufTree>(12, cmp);

        int i;
        int file_len = 0;
        int writen_len = 0;
        FileInputStream fis = null;
        FileOutputStream fos = null;
        DataInputStream ois = null;

        int char_kinds = 0;
        HufTree root=new HufTree();
        byte code_tmp;
        try {
            fis = new FileInputStream(inputFile);
            ois = new DataInputStream(fis);
            fos = new FileOutputStream(outputFile);

            char_kinds = ois.readInt();
            // 位元組只有一種
            if (char_kinds == 1) {
                code_tmp = ois.readByte();
                file_len = ois.readInt();
                while ((file_len--) != 0) {
                    fos.write(code_tmp);
                }
            } else {
                for (i = 0; i < char_kinds; i++) {
                    HufTree tmp = new HufTree();
                    tmp.Byte = ois.readByte();
                    tmp.weight = ois.readInt();
                    System.out.println("Byte: "+tmp.Byte+" weight: "+tmp.weight);
                    queue.add(tmp);
                }

                createTree(queue);

                file_len = ois.readInt();
                root = queue.peek();
                while (true) {
                    code_tmp = ois.readByte();
                    for (i = 0; i < 8; i++) {　　　　　　　　　　　　　　//這裡為什麼是&128呢？　　　　　　　　　　　　　　//我們是按編碼順序走的，1向右，0向左，對於一串byte編碼有8位，那最高位就是2^7，就是128　　　　　　　　　　　　　　//所以通過位運算來判斷該位是0還是1　　　　　　　　　　　　　　//之前我想錯了，從後面開始走，結果亂碼，壓縮在這塊也卡了好久orz
                        if ((code_tmp&128)==128) {
                            root = root.rchild;
                        } else {
                            root = root.lchild;
                        }
                        if (root.lchild == null && root.rchild == null) {
                            fos.write(root.Byte);
                            ++writen_len;
                            if (writen_len == file_len)
                                break;
                            root = queue.peek();
                        }
                        code_tmp <<= 1;
                    }
                    if (writen_len == file_len)
                        break;
                }
            }
            fis.close();
            fos.close();

        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

    public void createTree(PriorityQueue<HufTree> queue) {
        while (queue.size() > 1) {
            HufTree min1 = queue.poll();
            HufTree min2 = queue.poll();
            System.out.print(min1.weight + " " + min2.weight + " ");

            HufTree NodeParent = new HufTree();
            NodeParent.weight = min1.weight + min2.weight;
            NodeParent.lchild = min1;
            NodeParent.rchild = min2;

            queue.add(NodeParent);
        }
    }

    public void hufCode(HufTree root, String s, HashMap<Byte, String> map) {
        if (root.lchild == null && root.rchild == null) {
            root.code = s;
            System.out.println("節點" + root.Byte + "編碼" + s);
            map.put(root.Byte, root.code);

            return;
        }
        if (root.lchild != null) {
            hufCode(root.lchild, s + '0', map);
        }
        if (root.rchild != null) {
            hufCode(root.rchild, s + '1', map);
        }

    }

}

Compare.java

package 哈夫曼;

import java.util.Comparator;

public class Compare implements Comparator<HufTree>{

    @Override
    public int compare(HufTree o1, HufTree o2) {
        if(o1.weight < o2.weight)
            return -1;
        else if(o1.weight > o2.weight)
            return 1;
        return 0;
    }

}

這裡涉及到JAVA中優先對列的過載排序，我之前一直按照C++中的過載來寫，結果發現發現壓縮後的大小是原檔案的3倍！！！！然後還一直以為是壓縮過程的問題，瘋狂看壓縮過程哪裡錯了，最後輸出了下各字元的編碼才發現問題，耗了我整整一天TAT。。附上一個對優先佇列過載講解的連結https://blog.csdn.net/u013066244/article/details/78997869

HufTree.java

package 哈夫曼;

public class HufTree{
    public byte Byte; //以8位為單元的位元組
    public int weight;//該位元組在檔案中出現的次數
    public String code; //對應的哈夫曼編碼
    public HufTree lchild,rchild;

   
}

//統計字元頻度的臨時節點
class TmpNode implements Comparable<TmpNode>{
    public byte Byte;
    public int weight;

    @Override
    public int compareTo(TmpNode arg0) {
        if(this.weight < arg0.weight)
            return 1;
        else if(this.weight > arg0.weight)
            return -1;
        return 0;
    }
}

test.java

package 哈夫曼;

import java.io.File;

public class test {

    public static void main(String[] args) {
        // TODO Auto-generated method stub
        HuffmanCompress sample = new HuffmanCompress();
    //    File inputFile = new File("C:\\Users\\long452a\\Desktop\\opencv連結文件.txt");
    //   File outputFile = new File("C:\\Users\\long452a\\Desktop\\opencv連結文件.rar");
    //    sample.compress(inputFile, outputFile);
        File inputFile = new File("C:\\Users\\long452a\\Desktop\\opencv連結文件.rar");
            File outputFile = new File("C:\\Users\\long452a\\Desktop\\opencv連結文件1.txt");
           sample.extract(inputFile, outputFile);
    }

}

踩過無數坑實現的哈夫曼壓縮（JAVA）

踩過無數坑實現的哈夫曼壓縮（JAVA）

QT實現哈夫曼壓縮（多執行緒）

哈夫曼編碼（java）

貪心演算法--哈夫曼編碼（java實現）

資料結構與演算法 (七) 哈夫曼樹（Huffman）與哈夫曼編碼

Matlab 影象處理-哈夫曼編碼（huffman）

GZIP壓縮原理分析（32）——第五章 Deflate演算法詳解（五23）動態哈夫曼編碼分析（12）構建哈夫曼樹（04）

GZIP壓縮原理分析（29）——第五章 Deflate演算法詳解（五20）動態哈夫曼編碼分析（09）構建哈夫曼樹（01）

GZIP壓縮原理分析（31）——第五章 Deflate演算法詳解（五22）動態哈夫曼編碼分析（11）構建哈夫曼樹（03）

GZIP壓縮原理分析（30）——第五章 Deflate演算法詳解（五21）動態哈夫曼編碼分析（10）構建哈夫曼樹（02）

轉載：哈夫曼樹的構造和哈夫曼編碼（C++代碼實現）

java使用優先級隊列實現哈夫曼編碼

優先佇列實現哈夫曼樹-先序遍歷&中序遍歷&後序遍歷

哈夫曼樹（優先佇列實現）

最小堆實現哈夫曼樹的構造及哈夫曼編碼、解碼

Java如何實現哈夫曼編碼

Java理解實現哈夫曼樹以其編碼解碼

C++中的位移操作以實現檔案的壓縮（實現哈夫曼對檔案壓縮與解壓時做的一個小測試）

基於哈夫曼壓縮演算法的壓縮與解壓實現（Java）

哈夫曼壓縮演算法C語言實現——步驟，詳細註釋原始碼

踩過無數坑實現的哈夫曼壓縮（JAVA）

相關推薦