1. 程式人生 > >HashMap原始碼分析(JDK1.8)- 你該知道的都在這裡了

HashMap原始碼分析(JDK1.8)- 你該知道的都在這裡了

       HashMap是Java和Android程式設計師的基本功, JDK1.8對HashMap進行了優化, 你真正理解它了嗎?

考慮如下問題: 

1、雜湊基本原理?(答:散列表、hash碰撞、連結串列、紅黑樹

2、hashmap查詢的時間複雜度, 影響因素和原理? (答:最好O(1),最差O(n), 如果是紅黑O(logn)

3、resize如何實現的, 記住已經沒有rehash了!!!(答:拉鍊entry根據高位bit雜湊到當前位置i和size+i位置)

4、為什麼獲取下標時用按位與&,而不是取模%? (答:不只是&速度更快哦,  我覺得你能答上來便真正理解hashmap了)

5、什麼時機執行resize?

答:hashmap例項裡的元素個數大於threshold時執行resize(即桶數量擴容為2倍並雜湊原來的Entry)。 PS:threshold=桶數量*負載因子

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;   //初始化桶,預設16個元素
    if ((p = tab[i = (n - 1) & hash]) == null)   //如果第i個桶為空,建立Node例項
        tab[i] = newNode(hash, key, value, null);
    else { //雜湊碰撞的情況, 即(n-1)&hash相等
        Node<K,V> e; K k;
        if (p.hash == hash &&
            ((k = p.key) == key || (key != null && key.equals(k))))
            e = p;   //key相同,後面會覆蓋value
        else if (p instanceof TreeNode)
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);  //紅黑樹添加當前node
        else {
            for (int binCount = 0; ; ++binCount) {
                if ((e = p.next) == null) {
                    p.next = newNode(hash, key, value, null);  //連結串列添加當前元素
                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);  //當連結串列個數大於等於7時,將連結串列改造為紅黑樹
                    break;
                }
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    break;
                p = e;
            }
        }
        if (e != null) { // existing mapping for key
            V oldValue = e.value;
            if (!onlyIfAbsent || oldValue == null)
                e.value = value;
            afterNodeAccess(e);
            return oldValue;            //覆蓋key相同的value並return, 即不會執行++size
        }
    }
    ++modCount;
    if (++size > threshold)    //key不相同時,每次插入一條資料自增1. 當size大於threshold時resize
        resize();
    afterNodeInsertion(evict);
    return null;
}

6、為什麼負載因子預設為0.75f ? 能不能變為0.1、0.9、2、3等等呢?

答:0.75是平衡了時間和空間等因素; 負載因子越小桶的數量越多,讀寫的時間複雜度越低(極限情況O(1), 雜湊碰撞的可能性越小); 負載因子越大桶的數量越少,讀寫的時間複雜度越高(極限情況O(n), 雜湊碰撞可能性越高)。 0.1,0.9,2,3等都是合法值。

7、影響HashMap效能的因素?

1、 負載因子;

2、雜湊值;理想情況是均勻的雜湊到各個桶。 一般HashMap使用String型別作為key,而String類重寫了hashCode函式。

   static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

8、HashMap的key需要滿足什麼條件? 

答:必須重寫hashCode和equals方法, 常用的String類實現了這兩個方法。

示例程式碼:

    private static class KeyClass {
        int age;
        String name;

        public boolean equals(Object anyObject) {
            if (anyObject == this) {
                return true;
            }

            if (anyObject instanceof KeyClass) {
                KeyClass obj = (KeyClass) anyObject;
                if (obj.age == this.age
                        && obj.name == this.name) {
                    return true;
                }
            }
            return false;
        }

        public int hashCode() {
            return name==null? age : age|name.hashCode();
        }
    }

    public static void main(String[] args) {
        HashMap<KeyClass, String> map = new HashMap<>();
        KeyClass obj1 = new KeyClass();
        KeyClass obj2 = new KeyClass();
        obj1.age = 1;
        obj1.name = "Tom";
        obj2.age = 2;
        obj2.name ="Jack";
        map.put(obj1, "aaa");
        map.put(obj2, "bbb");
        map.put(obj1, "ccc");
        map.put(obj2, "ddd");

        map.forEach((key, value) -> {
            System.out.println(key.name + "---" + value);
        });
    }
輸出:
Tom---ccc
Jack---ddd

9、HashMap允許key/value為null, 但最多隻有一個。 為什麼?  

答: 如果key為null會放在第一個桶(即下標0)位置, 而且是在連結串列最前面(即第一個位置)。 

JDK1.8的HashMap原始碼: http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashMap.java#HashMap

        我的習慣是先看註釋再看原始碼並除錯, 先翻譯一下原始碼註釋吧, 不準之處請指正哈。

Hash table based implementation of the Map interface. This implementation provides all of the optional map

   HashTable實現了Map介面類, 這些介面實現了所有可選的map功能, 包括允許空值和空key。

operations, and permits null values and thenull key. (TheHashMap class is roughly equivalent toHashtable, except that it is unsynchronized and permits nulls.) This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

       HashMap和HashTable基本一致,  區別是HashMap是執行緒不同步的且允許空key。 HashMap不保證map的順序, 而且順序是可變的。

This implementation provides constant-time performance for the basic operations (get andput), assuming the hash function disperses the elements properly among the buckets.

    如果將資料適當的分散到桶裡, HashMap的新增、查詢函式的執行週期是常量值。

Iteration over collection views requires time proportional to the "capacity" of theHashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.

    使用迭代器遍歷所有資料的效能跟HashMap的桶(bucket)數量有直接關係,   為了提高遍歷的效能, 不能設定比較大的桶數量或者負載因子過低。

An instance of HashMap has two parameters that affect its performance:initial capacity andload factor. Thecapacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created.

      HashMap例項有2個重要引數影響它的效能: 初始容量和負載因子。 初始容量是指在雜湊表裡的桶總數, 一般在建立HashMap例項時設定初始容量。

The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.

       負載因子是指雜湊表在多滿時擴容的百分比比例。

When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table isrehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

       當雜湊表的資料個數超過負載因子和當前容量的乘積時, 雜湊表要再做一次雜湊(重建內部資料結構), 雜湊表每次擴容為原來的2倍。

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of theHashMap class, including get and put).

        負載因子的預設值是0.75, 它平衡了時間和空間複雜度。 負載因子越大會降低空間使用率,但提高了查詢效能(表現在雜湊表的大多數操作是讀取和查詢)

The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

       考慮雜湊表的效能問題, 要設定合適的初始容量,   從而減少rehash的次數。 當雜湊表中entry的總數少於負載因子和初始容量乘積時, 就不會發生rehash動作。

If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table. Note that using many keys with the same hashCode() is a sure way to slow down performance of any hash table. To ameliorate impact, when keys are, this class may use comparison order among keys to help break ties.

      如果有很多值要儲存到HashMap例項中, 在建立HashMap例項時要設定足夠大的初始容量, 避免自動擴容時rehash。 如果很多關鍵字的雜湊值相同, 會降低雜湊表的效能。 為了降低這個影響, 當關鍵字支援時, 可以對關鍵字做次排序以降低影響。

Note that this implementation is not synchronized. If multiple threads access a hash map concurrently, and at

least one of the threads modifies the map structurally, itmust be synchronized externally. (A structural modification

   雜湊表是非執行緒安全的, 如果多執行緒同時訪問雜湊表, 且至少一個執行緒修改了雜湊表的結構, 那麼必須在訪問hashmap前設定同步鎖。(修改結構是指新增或者刪除一個或多個entry, 修改鍵值不算是修改結構。)

is any operation that adds or deletes one or more mappings; merely changing the value associated with a key that an instance already contains is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map.

     一般在多執行緒操作雜湊表時,  要使用同步物件封裝map。

If no such object exists, the map should be "wrapped" using theCollections.synchronizedMap method. This is best done at creation time, to prevent accidental unsynchronized access to the map:

      如果不封裝Hashmap, 可以使用Collections.synchronizedMap  方法呼叫HashMap例項。  在建立HashMap例項時避免其他執行緒操作該例項, 即保證了執行緒安全。

   Map m = Collections.synchronizedMap(new HashMap(...));

   JDK1.8對雜湊碰撞後的拉鍊演算法進行了優化, 當拉鍊上entry數量太多(超過8個)時,將連結串列重構為紅黑樹。  下面是原始碼相關的註釋:

   * This map usually acts as a binned (bucketed) hash table, but
      * when bins get too large, they are transformed into bins of
      * TreeNodes, each structured similarly to those in
      * java.util.TreeMap. Most methods try to use normal bins, but
      * relay to TreeNode methods when applicable (simply by checking
      * instanceof a node).  Bins of TreeNodes may be traversed and
      * used like any others, but additionally support faster lookup
      * when overpopulated. However, since the vast majority of bins in
      * normal use are not overpopulated, checking for existence of
      * tree bins may be delayed in the course of table methods.

看看HashMap的幾個重要成員變數:

 //The default initial capacity - MUST be a power of two.

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; //為毛不寫成16??? 大師是想用這種寫法告訴你只能是2的冪

 HashMap的初始容量是16個, 而且容量只能是2的冪。  每次擴容時都是變成原來的2倍。

static final float DEFAULT_LOAD_FACTOR = 0.75f;

預設的負載因子是0.75f, 16*0.75=12。即預設的HashMap例項在插入第13個數據時,會擴容為32。

The bin count threshold for using a tree rather than list for a bin. Bins are converted to trees when adding an element to a bin with at least this many nodes. The value must be greater than 2 and should be at least 8 to mesh with assumptions in tree removal about conversion back to plain bins upon shrinkage.
static final int TREEIFY_THRESHOLD = 8;

注意:這是JDK1.8對HashMap的優化, 雜湊碰撞後的連結串列上達到8個節點時要將連結串列重構為紅黑樹,  查詢的時間複雜度變為O(logN)。

The table, initialized on first use, and resized as necessary. When allocated, length is always a power of two. (We also tolerate length zero in some operations to allow bootstrapping mechanics that are currently not needed.)
transient Node<K,V>[] table;  //HashMap的桶, 如果沒有雜湊碰撞, HashMap就是個陣列,我說的是如果吐舌頭  陣列的查詢時間複雜度是O(1),所以HashMap理想時間複雜度是O(1);如果所有資料都在同一個下標位置, 即N個數據組成連結串列,時間複雜度為O(n), 所以HashMap的最差時間複雜度為O(n)。如果連結串列達到8個元素時重構為紅黑樹,而紅黑樹的查詢時間複雜度為O(logN), 所以這時HashMap的時間複雜度為O(logN)。

Holds cached entrySet(). Note that AbstractMap fields are used for keySet() and values().
transient Set<Map.Entry<K,V>> entrySet; //HashMap所有的值,因為用了Set, 所以HashMap不會有key、value都相同的資料。

                               

雜湊表的結構

1、 雜湊碰撞的原因和解決方法:

     雜湊碰撞是不同的key值找到相同的下標,  對應HashMap裡hashcode和容量的模相同。

原始碼629行    if ((p = tab[i = (n - 1) & hash]) == null) , 其中n是容量值,    即用雜湊值和容量相與得到要儲存的位置。 如果不同Key的(n - 1) & hash相同, 那麼要儲存到同一個陣列下標位置, 這個現象就叫雜湊碰撞。

        final V putVal(int hash, K key, V value, boolean onlyIfAbsent,boolean evict) {

          ....

         if ((p = tab[i = (n - 1) & hash]) == null)     //如果該下標沒值,則儲存到該下標位置
             tab[i] = newNode(hash, key, value, null);     
         else {
             Node<K,V> e; K k;
             if (p.hash == hash &&
                 ((k = p.key) == key || (key != null && key.equals(k))))
                 e = p;      //如果雜湊值相同而且key相同, 則更新鍵值
             else if (p instanceof TreeNode)
                 e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);  //如果下標資料是TreeNode型別,則將新資料新增到紅黑樹中。
             else {
                 for (int binCount = 0; ; ++binCount) {
                     if ((e = p.next) == null) {
                         p.next = newNode(hash, key, value, null);   //將新Node新增到連結串列末尾
                         if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                             treeifyBin(tab, hash);    //如果連結串列個數達到8個時,將連結串列修改為紅黑樹結構
                         break;
                     }    

            .....

           }

2、JDK1.8對HashMap最大的優化是resize函式,  在擴容時不再需要rehash了, 下面就看看大師是怎麼實現的。

Initializes or doubles table size. If null, allocates in accord with initial capacity target held in field threshold. Otherwise, because we are using power-of-two expansion, the elements from each bin must either stay at same index, or move with a power of two offset in the new table.

初始化陣列或者擴容為2倍,   初值為空時,則根據初始容量開闢空間來建立陣列。否則, 因為我們使用2的冪定義陣列大小,資料要麼待在原來的下標, 或者移動到新陣列的高位下標。 (舉例: 初始陣列是16個,假如有2個數據儲存在下標為1的位置, 擴容後這2個數據可以存在下標為1或者16+1的位置)

Returns:
    the table
final Node<K,V>[] resize() {

         ....

         newThr = oldThr << 1; // double threshold,   大小擴大為2倍,出於效能考慮和者告訴使用者它是2的冪, 這裡用的是位移, 而不是*2,

   if (e.next == null)
      newTab[e.hash & (newCap - 1)] = e;  //如果該下標只有一個數據,則雜湊到當前位置或者高位對應位置(以第一次resize為例,原來在第4個位置,resize後會儲存到第4個或者第4+16個位置)
  else if (e instanceof TreeNode)
     ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);  //紅黑樹重構

   else {

     do {
        next = e.next;
        if ((e.hash & oldCap) == 0) {   
            if (loTail == null)
               loHead = e;
            else
            loTail.next = e;
            loTail = e;
        } else {
            if (hiTail == null)
               hiHead = e;
            else
               hiTail.next = e;
               hiTail = e;
         }
      } while ((e = next) != null);
      if (loTail != null) {
          loTail.next = null;
          newTab[j] = loHead;   //下標不變
      }
      if (hiTail != null) {
          hiTail.next = null;
          newTab[j + oldCap] = hiHead; //下標位置移動原來容量大小
      }

   (e.hash & oldCap) == 0寫的很贊!!! 它將原來的連結串列資料雜湊到2個下標位置,  概率是當前位置50%,高位位置50%。     你可能有點懵比, 下面舉例說明。  上邊圖中第0個下標有496和896,  假設它倆的hashcode(int型,佔4個位元組)是

resize前:

496的hashcode: 00000000  00000000  00000000  00000000

896的hashcode: 01010000  01100000  10000000  00100000

oldCap是16:       00000000  00000000  00000000  00010000

    496和896對應的e.hash & oldCap的值為0, 即下標都是第0個。

resize後:

496的hashcode: 00000000  00000000  00000000  00000000

896的hashcode: 01010000  01100000  10000000  00100000

oldCap是32:       00000000  00000000  00000000  00100000

   496和896對應的e.hash & oldCap的值為0和1, 即下標都是第0個和第16個。

   看明白了嗎?   因為hashcode的第n位是0/1的概率相同, 理論上鍊表的資料會均勻分佈到當前下標或高位陣列對應下標。

       回顧JDK1.7的HashMap,在擴容時會rehash即每個entry的位置都要再計算一遍,  效能不好呀, 所以JDK1.8做了這個優化。

      再回到文章最開始的問題, HashMap為什麼用&得到下標,而不是%?   如果使用了取模%, 那麼在容量變為2倍時, 需要rehash確定每個連結串列元素的位置。大笑

     很佩服HashMap的作者呀,  大師在運算子的使用上都是這麼考究!!!

PS: 順便說一下ArrayList, 初始容量是10個, 每次擴容是原來的1.5倍。