java io系列14：DataInputStream(資料輸入流)的認知、原始碼和示例

阿新 • • 發佈：2019-01-08

本章介紹DataInputStream。我們先對DataInputStream有個大致認識，然後再深入學習它的原始碼，最後通過示例加深對它的瞭解。

轉載請註明出處：http://www.cnblogs.com/skywang12345/p/io_14.html

DataInputStream 介紹

DataInputStream 是資料輸入流。它繼承於FilterInputStream。
DataInputStream 是用來裝飾其它輸入流，它“允許應用程式以與機器無關方式從底層輸入流中讀取基本 Java 資料型別”。應用程式可以使用DataOutputStream(資料輸出流)寫入由DataInputStream(資料輸入流)讀取的資料。

DataInputStream 函式列表

DataInputStream(InputStream in)
final int     read(byte[] buffer, int offset, int length)
final int     read(byte[] buffer)
final boolean     readBoolean()
final byte     readByte()
final char     readChar()
final double     readDouble()
final float     readFloat()
final void     readFully(byte[] dst)
final void     readFully(byte[] dst, int offset, int byteCount)
final int     readInt()
final String     readLine()
final long     readLong()
final short     readShort()
final static String     readUTF(DataInput in)
final String     readUTF()
final int     readUnsignedByte()
final int     readUnsignedShort()
final int     skipBytes(int count)

DataInputStream.java原始碼分析(基於jdk1.7.40)

View Code

說明：
DataInputStream 的作用就是“允許應用程式以與機器無關方式從底層輸入流中讀取基本 Java 資料型別。應用程式可以使用資料輸出流寫入稍後由資料輸入流讀取的資料。”
DataInputStream 中比較難以理解的函式就只有 readUTF(DataInput in)；下面，對這個函式進行詳細的介紹，其它的函式請參考原始碼中的註釋。

readUTF(DataInput in)原始碼如下：

  1 public final static String readUTF(DataInput in) throws IOException {
  2     // 從“資料輸入流”中讀取“無符號的short型別”的值：
  3     // 注意：UTF-8輸入流的前2個位元組是資料的長度
  4     int utflen = in.readUnsignedShort();
  5     byte[] bytearr = null;
  6     char[] chararr = null;
  7 
  8     // 如果in本身是“資料輸入流”，
  9     // 則，設定位元組陣列bytearr = "資料輸入流"的成員bytearr
 10     //     設定字元陣列chararr = "資料輸入流"的成員chararr
 11     // 否則的話，新建陣列bytearr和chararr
 12     if (in instanceof DataInputStream) {
 13         DataInputStream dis = (DataInputStream)in;
 14         if (dis.bytearr.length < utflen){
 15             dis.bytearr = new byte[utflen*2];
 16             dis.chararr = new char[utflen*2];
 17         }
 18         chararr = dis.chararr;
 19         bytearr = dis.bytearr;
 20     } else {
 21         bytearr = new byte[utflen];
 22         chararr = new char[utflen];
 23     }
 24 
 25     int c, char2, char3;
 26     int count = 0;
 27     int chararr_count=0;
 28 
 29     // 從“資料輸入流”中讀取資料並存儲到位元組陣列bytearr中；從bytearr的位置0開始儲存，儲存長度為utflen。
 30     // 注意，這裡是儲存到位元組陣列！而且讀取的是全部的資料。
 31     in.readFully(bytearr, 0, utflen);
 32 
 33     // 將“位元組陣列bytearr”中的資料 拷貝到 “字元陣列chararr”中
 34     // 注意：這裡相當於“預處理的輸入流中單位元組的符號”，因為UTF-8是1-4個位元組可變的。
 35     while (count < utflen) {
 36         // 將每個位元組轉換成int值
 37         c = (int) bytearr[count] & 0xff;
 38         // UTF-8的每個位元組的值都不會超過127；所以，超過127，則退出。
 39         if (c > 127) break;
 40         count++;
 41         // 將c儲存到“字元陣列chararr”中
 42         chararr[chararr_count++]=(char)c;
 43     }
 44 
 45     // 處理完輸入流中單位元組的符號之後，接下來我們繼續處理。
 46     while (count < utflen) {
 47         // 下面語句執行了2步操作。
 48         // (01) 將位元組由 “byte型別” 轉換成 “int型別”。
 49         //      例如， “11001010” 轉換成int之後，是 “00000000 00000000 00000000 11001010”
 50         // (02) 將 “int型別” 的資料左移4位
 51         //      例如， “00000000 00000000 00000000 11001010” 左移4位之後，變成 “00000000 00000000 00000000 00001100”
 52         c = (int) bytearr[count] & 0xff;
 53         switch (c >> 4) {
 54             // 若 UTF-8 是單位元組，即 bytearr[count] 對應是 “0xxxxxxx” 形式；
 55             // 則 bytearr[count] 對應的int型別的c的取值範圍是 0-7。
 56             case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
 57                 /* 0xxxxxxx*/
 58                 count++;
 59                 chararr[chararr_count++]=(char)c;
 60                 break;
 61 
 62             // 若 UTF-8 是雙位元組，即 bytearr[count] 對應是 “110xxxxx  10xxxxxx” 形式中的第一個，即“110xxxxx”
 63             // 則 bytearr[count] 對應的int型別的c的取值範圍是 12-13。
 64             case 12: case 13:
 65                 /* 110x xxxx   10xx xxxx*/
 66                 count += 2;
 67                 if (count > utflen)
 68                     throw new UTFDataFormatException(
 69                         "malformed input: partial character at end");
 70                 char2 = (int) bytearr[count-1];
 71                 if ((char2 & 0xC0) != 0x80)
 72                     throw new UTFDataFormatException(
 73                         "malformed input around byte " + count);
 74                 chararr[chararr_count++]=(char)(((c & 0x1F) << 6) |
 75                                                 (char2 & 0x3F));
 76                 break;
 77 
 78             // 若 UTF-8 是三位元組，即 bytearr[count] 對應是 “1110xxxx  10xxxxxx  10xxxxxx” 形式中的第一個，即“1110xxxx”
 79             // 則 bytearr[count] 對應的int型別的c的取值是14 。
 80             case 14:
 81                 /* 1110 xxxx  10xx xxxx  10xx xxxx */
 82                 count += 3;
 83                 if (count > utflen)
 84                     throw new UTFDataFormatException(
 85                         "malformed input: partial character at end");
 86                 char2 = (int) bytearr[count-2];
 87                 char3 = (int) bytearr[count-1];
 88                 if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80))
 89                     throw new UTFDataFormatException(
 90                         "malformed input around byte " + (count-1));
 91                 chararr[chararr_count++]=(char)(((c     & 0x0F) << 12) |
 92                                                 ((char2 & 0x3F) << 6)  |
 93                                                 ((char3 & 0x3F) << 0));
 94                 break;
 95 
 96             // 若 UTF-8 是四位元組，即 bytearr[count] 對應是 “11110xxx 10xxxxxx  10xxxxxx  10xxxxxx” 形式中的第一個，即“11110xxx”
 97             // 則 bytearr[count] 對應的int型別的c的取值是15 
 98             default:
 99                 /* 10xx xxxx,  1111 xxxx */
100                 throw new UTFDataFormatException(
101                     "malformed input around byte " + count);
102         }
103     }
104     // The number of chars produced may be less than utflen
105     return new String(chararr, 0, chararr_count);
106 }

說明：

(01) readUTF()的作用，是從輸入流中讀取UTF-8編碼的資料，並以String字串的形式返回。
(02) 知道了readUTF()的作用之後，下面開始介紹readUTF()的流程：

第1步，讀取出輸入流中的UTF-8資料的長度。程式碼如下：

int utflen = in.readUnsignedShort();

UTF-8資料的長度包含在它的前兩個位元組當中；我們通過readUnsignedShort()讀取出前兩個位元組對應的正整數就是UTF-8資料的長度。

第2步，建立2個數組：位元組陣列bytearr 和字元陣列chararr。程式碼如下：

 1 if (in instanceof DataInputStream) {
 2     DataInputStream dis = (DataInputStream)in;
 3     if (dis.bytearr.length < utflen){
 4         dis.bytearr = new byte[utflen*2];
 5         dis.chararr = new char[utflen*2];
 6     }
 7     chararr = dis.chararr;
 8     bytearr = dis.bytearr;
 9 } else {
10     bytearr = new byte[utflen];
11     chararr = new char[utflen];
12 }

首先，判斷該輸入流本身是不是DataInputStream，即資料輸入流；若是的話，
則，設定位元組陣列bytearr = "資料輸入流"的成員bytearr
設定字元陣列chararr = "資料輸入流"的成員chararr
否則的話，新建陣列bytearr和chararr。

第3步，將UTF-8資料全部讀取到“位元組陣列bytearr”中。程式碼如下：

in.readFully(bytearr, 0, utflen);

注意: 這裡是儲存到位元組陣列，而不是字元陣列！而且讀取的是全部的資料。

第4步，對UTF-8中的單位元組資料進行預處理。程式碼如下：

1 while (count < utflen) {
2     // 將每個位元組轉換成int值
3     c = (int) bytearr[count] & 0xff;
4     // UTF-8的單位元組資料的值都不會超過127；所以，超過127，則退出。
5     if (c > 127) break;
6     count++;
7     // 將c儲存到“字元陣列chararr”中
8     chararr[chararr_count++]=(char)c;
9 }

UTF-8的資料是變長的，可以是1-4個位元組；在readUTF()中，我們最終是將全部的UTF-8資料儲存到“字元陣列(而不是位元組陣列)”中，再將其轉換為String字串。
由於UTF-8的單位元組和ASCII相同，所以這裡就將它們進行預處理，直接儲存到“字元陣列chararr”中。對於其它的UTF-8資料，則在後面進行處理。

第5步，對“第4步預處理”之後的資料，接著進行處理。程式碼如下：

// 處理完輸入流中單位元組的符號之後，接下來我們繼續處理。
while (count < utflen) {
    // 下面語句執行了2步操作。
    // (01) 將位元組由 “byte型別” 轉換成 “int型別”。
    //      例如， “11001010” 轉換成int之後，是 “00000000 00000000 00000000 11001010”
    // (02) 將 “int型別” 的資料左移4位
    //      例如， “00000000 00000000 00000000 11001010” 左移4位之後，變成 “00000000 00000000 00000000 00001100”
    c = (int) bytearr[count] & 0xff;
    switch (c >> 4) {
        // 若 UTF-8 是單位元組，即 bytearr[count] 對應是 “0xxxxxxx” 形式；
        // 則 bytearr[count] 對應的int型別的c的取值範圍是 0-7。
        case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
            /* 0xxxxxxx*/
            count++;
            chararr[chararr_count++]=(char)c;
            break;

        // 若 UTF-8 是雙位元組，即 bytearr[count] 對應是 “110xxxxx  10xxxxxx” 形式中的第一個，即“110xxxxx”
        // 則 bytearr[count] 對應的int型別的c的取值範圍是 12-13。
        case 12: case 13:
            /* 110x xxxx   10xx xxxx*/
            count += 2;
            if (count > utflen)
                throw new UTFDataFormatException(
                    "malformed input: partial character at end");
            char2 = (int) bytearr[count-1];
            if ((char2 & 0xC0) != 0x80)
                throw new UTFDataFormatException(
                    "malformed input around byte " + count);
            chararr[chararr_count++]=(char)(((c & 0x1F) << 6) |
                                            (char2 & 0x3F));
            break;

        // 若 UTF-8 是三位元組，即 bytearr[count] 對應是 “1110xxxx  10xxxxxx  10xxxxxx” 形式中的第一個，即“1110xxxx”
        // 則 bytearr[count] 對應的int型別的c的取值是14 。
        case 14:
            /* 1110 xxxx  10xx xxxx  10xx xxxx */
            count += 3;
            if (count > utflen)
                throw new UTFDataFormatException(
                    "malformed input: partial character at end");
            char2 = (int) bytearr[count-2];
            char3 = (int) bytearr[count-1];
            if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80))
                throw new UTFDataFormatException(
                    "malformed input around byte " + (count-1));
            chararr[chararr_count++]=(char)(((c     & 0x0F) << 12) |
                                            ((char2 & 0x3F) << 6)  |
                                            ((char3 & 0x3F) << 0));
            break;

        // 若 UTF-8 是四位元組，即 bytearr[count] 對應是 “11110xxx 10xxxxxx  10xxxxxx  10xxxxxx” 形式中的第一個，即“11110xxx”
        // 則 bytearr[count] 對應的int型別的c的取值是15 
        default:
            /* 10xx xxxx,  1111 xxxx */
            throw new UTFDataFormatException(
                "malformed input around byte " + count);
    }
}

(a) 我們將下面的兩條語句一起進行說明

c = (int) bytearr[count] & 0xff;
switch (c >> 4) { ... }

首先，我們必須要理解 為什麼要這麼做(執行上面2條語句)呢？
原因很簡單，這麼做的目的就是為了區分UTF-8資料是幾位的；因為UTF-8的資料是1～4位元組不等。

我們先看看UTF-8在1～4位情況下的格式。

--------------------+---------------------------------------------
1位元組 UTF-8的通用格式  | 0xxxxxxx
2位元組 UTF-8的通用格式  | 110xxxxx 10xxxxxx
3位元組 UTF-8的通用格式  | 1110xxxx 10xxxxxx 10xxxxxx
4位元組 UTF-8的通用格式  | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

執行 c = (int) bytearr[count] & 0xff; 和 c>>4 這2項操作之後，上面的資料變成

--------------------+---------------------------------------------
1位元組 UTF-8的變換後對應的int型別值  | 00000000 00000000 00000000 00000xxx    (範圍是0~7) 
2位元組 UTF-8的變換後對應的int型別值  | 00000000 00000000 00000000 0000110x    (範圍是12~13) 
3位元組 UTF-8的變換後對應的int型別值  | 00000000 00000000 00000000 00001110    (範圍是14) 
4位元組 UTF-8的變換後對應的int型別值  | 00000000 00000000 00000000 00001111    (範圍是15)

為什麼會是這樣呢？
我們以“2位元組 UTF-8的通用格式”來說明。
它的通用格式是 “110xxxxx 10xxxxxx”，我們在操作時，只會操作第1個位元組，即只會操作“110xxxxx”
(a.1) 在執行 c = (int) bytearr[count] & 0xff; 時，首先將 bytearr[count] 轉換成int。

“110xxxxx”

轉成int型別之後，變成

“11111111 11111111 11111111 110xxxxx”

因為“110xxxxx”是負數(第1為是1)，所以轉換成int型別時多出來的位補1。

(a.2) 接著 c = (int) bytearr[count] & 0xff; 中，會將 “轉換成int型別後的bytearr[count]” 與 “0xff”進行邏輯與(即&) 操作。結果如下：

“00000000 00000000 00000000 110xxxxx”

(a.3) 執行 c>>4 時，會將上面的結果左移4位。得到的結果如下：

“00000000 00000000 00000000 0000110x”

(b) 上面的理解之後，swicth (c>>4) { ... } 其中的省略號部分就相當容易理解了。
我們還是以“2位元組 UTF-8的通用格式”來說明。
它會執行 case 12 和 case 13；原始碼如下：

count += 2;
if (count > utflen)
    throw new UTFDataFormatException(
        "malformed input: partial character at end");
char2 = (int) bytearr[count-1];
if ((char2 & 0xC0) != 0x80)
    throw new UTFDataFormatException(
        "malformed input around byte " + count);
chararr[chararr_count++]=(char)(((c & 0x1F) << 6) | (char2 & 0x3F));

(b.1) 由於這種情況對應的UTF-8資料是“2位元組”的，因此，執行count+2；直接跳過2個位元組。
(b.2) 由於chararr的元素是字元型別，而一個字元正好佔2個位元組；因為正好將(((c & 0x1F) << 6) | (char2 & 0x3F)); 的結果轉換成char，然後儲存在chararr陣列中。

第6步，將字元陣列轉換成String字串，並返回。程式碼如下：

return new String(chararr, 0, chararr_count);

示例程式碼

關於DataInputStream中API的詳細用法，參考示例程式碼(DataInputStreamTest.java)：

  1 import java.io.DataInputStream;
  2 import java.io.DataOutputStream;
  3 import java.io.ByteArrayInputStream;
  4 import java.io.File;
  5 import java.io.InputStream;
  6 import java.io.FileInputStream;
  7 import java.io.FileOutputStream;
  8 import java.io.IOException;
  9 import java.io.FileNotFoundException;
 10 import java.lang.SecurityException;
 11 
 12 /**
 13  * DataInputStream 和 DataOutputStream測試程式
 14  *
 15  * @author skywang
 16  */
 17 public class DataInputStreamTest {
 18 
 19     private static final int LEN = 5;
 20 
 21     public static void main(String[] args) {
 22         // 測試DataOutputStream，將資料寫入到輸出流中。
 23         testDataOutputStream() ;
 24         // 測試DataInputStream，從上面的輸出流結果中讀取資料。
 25         testDataInputStream() ;
 26     }
 27 
 28     /**
 29      * DataOutputStream的API測試函式
 30      */
 31     private static void testDataOutputStream() {
 32 
 33         try {
 34             File file = new File("file.txt");
 35             DataOutputStream out =
 36                   new DataOutputStream(
 37                       new FileOutputStream(file));
 38 
 39             out.writeBoolean(true);
 40             out.writeByte((byte)0x41);
 41             out.writeChar((char)0x4243);
 42             out.writeShort((short)0x4445);
 43             out.writeInt(0x12345678);
 44             out.writeLong(0x0FEDCBA987654321L);
 45 
 46             out.writeUTF("abcdefghijklmnopqrstuvwxyz嚴12");
 47 
 48             out.close();
 49        } catch (FileNotFoundException e) {
 50            e.printStackTrace();
 51        } catch (SecurityException e) {
 52            e.printStackTrace();
 53        } catch (IOException e) {
 54            e.printStackTrace();
 55        }
 56     }
 57     /**
 58      * DataInputStream的API測試函式
 59      */
 60     private static void testDataInputStream() {
 61 
 62         try {
 63             File file = new File("file.txt");
 64             DataInputStream in =
 65                   new DataInputStream(
 66                       new FileInputStream(file));
 67 
 68             System.out.printf("byteToHexString(0x8F):0x%s\n", byteToHexString((byte)0x8F));
 69             System.out.printf("charToHexString(0x8FCF):0x%s\n", charToHexString((char)0x8FCF));
 70 
 71             System.out.printf("readBoolean():%s\n", in.readBoolean());
 72             System.out.printf("readByte():0x%s\n", byteToHexString(in.readByte()));
 73             System.out.printf("readChar():0x%s\n", charToHexString(in.readChar()));
 74             System.out.printf("readShort():0x%s\n", shortToHexString(in.readShort()));
 75             System.out.printf("readInt():0x%s\n", Integer.toHexString(in.readInt()));
 76             System.out.printf("readLong():0x%s\n", Long.toHexString(in.readLong()));
 77             System.out.printf("readUTF():%s\n", in.readUTF());
 78 
 79             in.close();
 80        } catch (FileNotFoundException e) {
 81            e.printStackTrace();
 82        } catch (SecurityException e) {
 83            e.printStackTrace();
 84        } catch (IOException e) {
 85            e.printStackTrace();
 86        }
 87     }
 88 
 89     // 列印byte對應的16進位制的字串
 90     private static String byteToHexString(byte val) {
 91         return Integer.toHexString(val & 0xff);
 92     }
 93 
 94     // 列印char對應的16進位制的字串
 95     private static String charToHexString(char val) {
 96         return Integer.toHexString(val);
 97     }
 98 
 99     // 列印short對應的16進位制的字串
100     private static String shortToHexString(short val) {
101         return Integer.toHexString(val & 0xffff);
102     }
103 }

執行結果：

byteToHexString(0x8F):0x8f
charToHexString(0x8FCF):0x8fcf
readBoolean():true
readByte():0x41
readChar():0x4243
readShort():0x4445
readInt():0x12345678
readLong():0xfedcba987654321
readUTF():abcdefghijklmnopqrstuvwxyz嚴12

結果說明：
(01) 檢視file.txt文字。16進位制的資料顯示如下：

001f 對應的int值是31。它表示的含義是後面的UTF-8資料的長度。字串“abcdefghijklmnopqrstuvwxyz嚴12”中字母“ab...xyz”的長度是26，“嚴”對應的UTF-8資料長度是3；“12”長度是2。總的長度=26+3+2=31。

(02) 返回byte對應的16進位制的字串
原始碼如下：

private static String byteToHexString(byte val) {
    return Integer.toHexString(val & 0xff);
}

想想為什麼程式碼是：

return Integer.toHexString(val & 0xff);

而不是

return Integer.toHexString(val);

我們先看看 byteToHexString((byte)0x8F); 在上面兩種情況下的輸出結果。
return Integer.toHexString(val & 0xff); 對應的輸出是“0xffffff8f”
return Integer.toHexString(val); 對應的輸出是“0x8f”
為什麼會這樣呢？
原因其實很簡單，就是“byte型別轉換成int型別”導致的問題。
byte型別的0x8F是一個負數，它對應的2進位制是10001111；將一個負數的byte轉換成int型別時，執行的是有符號轉型(新增位都填充符號位的數字)。0x8F的符號位是1，因為將它轉換成int時，填充“1”；轉型後的結果(2進位制)是11111111 11111111 11111111 10001111，對應的16進製為0xffffff8f。
因為當我們執行Integer.toHexString(val);時，返回的就是0xffffff8f。
在Integer.toHexString(val & 0xff)中，相當於0xffffff8f & 0xff，得到的結果是0x8f。

(03) 返回char和short對應的16進位制的字串
“返回char對應的16進位制的字串”對應的原始碼如下：

private static String charToHexString(char val) {
    return Integer.toHexString(val);
}

“返回short對應的16進位制的字串”對應原始碼如下：

private static String shortToHexString(short val) {
    return Integer.toHexString(val & 0xffff);
}

比較上面的兩個函式，為什麼一個是 “val” ，而另一個是 “val & 0xffff”？
通過(02)的分析，我們類似的推出為什麼 “返回short對應的16進位制的字串” 要執行“val & 0xffff”。
但是，為什麼 “返回char對應的16進位制的字串” 要執行 “val” 即可。原因也很簡單，java中char是無符號型別，佔兩個位元組。將char轉換為int型別，執行的是無符號轉型，新增為都填充0。

java io系列14：DataInputStream(資料輸入流)的認知、原始碼和示例

DataInputStream 介紹

DataInputStream.java原始碼分析(基於jdk1.7.40)

示例程式碼

java io系列14：DataInputStream(資料輸入流)的認知、原始碼和示例

Caffe學習系列(14)：初識資料視覺化

夯實Java基礎系列23：一文讀懂繼承、封裝、多型的底層實現原理

Java技術學習筆記：過濾器鏈的實現方法、配置和案例分析

java技術學習總結：過濾器鏈的實現方法、配置和案例分析

Apache Pulsar：實時資料處理中訊息、計算和儲存的統一

java方法過載實驗：判斷鍵盤輸入的兩個資料的型別後進行比較

Java IO系列3 位元組流之DataInputStream與DataOutputStream

深入理解JAVA集合系列四：ArrayList源碼解讀

深入理解JAVA集合系列三：HashMap的死循環解讀

struts2系列(二)：struts2參數傳遞錯誤、struts2的輸入錯誤驗證

Java入門系列-14-深入類和對象

Java基礎實驗一：簡單資料型別和流程控制

Java入門系列-14-深入類和物件

Java面試系列總結：JavaSE高階（上）

運籌系列14：Assignment問題模型與python程式碼求解

系列一：Unity資料之SQL增刪改查

Java IO詳解（三)------位元組輸入輸出流

【2】Caffe學習系列(11)：影象資料轉換成db（leveldb/lmdb)檔案

輕鬆上雲系列之一：本地資料遷移上雲

java io系列14：DataInputStream(資料輸入流)的認知、原始碼和示例

DataInputStream 介紹

DataInputStream.java原始碼分析(基於jdk1.7.40)

示例程式碼

相關推薦