1. 程式人生 > >JVM原始碼分析之安全點safepoint

JVM原始碼分析之安全點safepoint

上週有幸參加了一次關於JVM的小範圍分享會,聽完R大對虛擬機器C2編譯器的講解,我的膝蓋一直是腫的,能記住的實在有點少,能聽進去也不多
1、什麼時候進行C2編譯,如何進行C2編譯(這個實在太複雜)
2、C2編譯的時候,是對整個方法體進行編譯,而不是某個方法段
3、JVM中的safepoint

一直都知道,當發生GC時,正在執行Java code的執行緒必須全部停下來,才可以進行垃圾回收,這就是熟悉的STW(stop the world),但是STW的背後實現原理,比如這些執行緒如何暫停、又如何恢復?就比較疑惑了。


然而這一切的一切,都涉及到一個概念safepoint,openjdk的實現位於openjdk/hotspot/src/share/vm/runtime/safepoint.cpp

什麼是safepoint

safepoint可以用在不同地方,比如GC、Deoptimization,在Hotspot VM中,GC safepoint比較常見,需要一個數據結構記錄每個執行緒的呼叫棧、暫存器等一些重要的資料區域裡什麼地方包含了GC管理的指標。

從執行緒角度看,safepoint可以理解成是在程式碼執行過程中的一些特殊位置,當執行緒執行到這些位置的時候,說明虛擬機器當前的狀態是安全的,如果有需要,可以在這個位置暫停,比如發生GC時,需要暫停暫停所以活動執行緒,但是執行緒在這個時刻,還沒有執行到一個安全點,所以該執行緒應該繼續執行,到達下一個安全點的時候暫停,等待GC結束。

什麼地方可以放safepoint

下面以Hotspot為例,簡單的說明一下什麼地方會放置safepoint
1、理論上,在直譯器的每條位元組碼的邊界都可以放一個safepoint,不過掛在safepoint的除錯符號資訊要佔用記憶體空間,如果每條機器碼後面都加safepoint的話,需要儲存大量的執行時資料,所以要儘量少放置safepoint,在safepoint會生成polling程式碼詢問VM是否要“進入safepoint”,polling操作也是有開銷的,polling操作會在後續解釋。

2、通過JIT編譯的程式碼裡,會在所有方法的返回之前,以及所有非counted loop的迴圈(無界迴圈)回跳之前放置一個safepoint,為了防止發生GC需要STW時,該執行緒一直不能暫停。另外,JIT編譯器在生成機器碼的同時會為每個safepoint生成一些“除錯符號資訊”,為GC生成的符號資訊是OopMap,指出棧上和暫存器裡哪裡有GC管理的指標。

執行緒如何被掛起

如果觸發GC動作,VM thread會在VMThread::loop()方法中呼叫SafepointSynchronize::begin()方法,最終使所有的執行緒都進入到safepoint。

// Roll all threads forward to a safepoint and suspend them all
void SafepointSynchronize::begin() {
  Thread* myThread = Thread::current();
  assert(myThread->is_VM_thread(), "Only VM thread may execute a safepoint");

  if (PrintSafepointStatistics || PrintSafepointStatisticsTimeout > 0) {
    _safepoint_begin_time = os::javaTimeNanos();
    _ts_of_current_safepoint = tty->time_stamp().seconds();
  }

在safepoint實現中,有這樣一段註釋,Java threads可以有多種不同的狀態,所以掛起的機制也不同,一共列舉了5中情況:


1、執行Java code

在執行位元組碼時會檢查safepoint狀態,因為在begin方法中會呼叫Interpreter::notice_safepoints()方法,通知直譯器更新dispatch table,實現如下:

void TemplateInterpreter::notice_safepoints() {
  if (!_notice_safepoints) {
    // switch to safepoint dispatch table
    _notice_safepoints = true;
    copy_table((address*)&_safept_table, (address*)&_active_table, sizeof(_active_table) / sizeof(address));
  }
}

2、執行native code

如果VM thread發現一個Java thread正在執行native code,並不會等待該Java thread阻塞,不過當該Java thread從native code返回時,必須檢查safepoint狀態,看是否需要進行阻塞。

這裡涉及到兩個狀態:Java thread state和safepoint state,兩者之間有著嚴格的讀寫順序,一般可以通過記憶體屏障實現,但是效能開銷比較大,Hotspot採用另一種方式,呼叫os::serialize_thread_states()把每個執行緒的狀態依次寫入到同一個記憶體頁中,實現如下:

// Serialize all thread state variables
void os::serialize_thread_states() {
  // On some platforms such as Solaris & Linux, the time duration of the page
  // permission restoration is observed to be much longer than expected  due to
  // scheduler starvation problem etc. To avoid the long synchronization
  // time and expensive page trap spinning, 'SerializePageLock' is used to block
  // the mutator thread if such case is encountered. See bug 6546278 for details.
  Thread::muxAcquire(&SerializePageLock, "serialize_thread_states");
  os::protect_memory((char *)os::get_memory_serialize_page(),
                     os::vm_page_size(), MEM_PROT_READ);
  os::protect_memory((char *)os::get_memory_serialize_page(),
                     os::vm_page_size(), MEM_PROT_RW);
  Thread::muxRelease(&SerializePageLock);
}

通過VM thread執行一系列mprotect os call,保證之前所有執行緒狀態的寫入可以被順序執行,效率更高。

3、執行complied code

如果想進入safepoint,則設定polling page不可讀,當Java thread發現該記憶體頁不可讀時,最終會被阻塞掛起。在SafepointSynchronize::begin()方法中,通過os::make_polling_page_unreadable()方法設定polling page為不可讀。

if (UseCompilerSafepoints && DeferPollingPageLoopCount < 0) {
    // Make polling safepoint aware
    guarantee (PageArmed == 0, "invariant") ;
    PageArmed = 1 ;
    os::make_polling_page_unreadable();
}

方法make_polling_page_unreadable()在不同系統的實現不一樣

linux下實現
// Mark the polling page as unreadable
void os::make_polling_page_unreadable(void) {
  if( !guard_memory((char*)_polling_page, Linux::page_size()) )
    fatal("Could not disable polling page");
};

solaris下實現
// Mark the polling page as unreadable
void os::make_polling_page_unreadable(void) {
  if( mprotect((char *)_polling_page, page_size, PROT_NONE) != 0 )
    fatal("Could not disable polling page");
};

在JIT編譯中,編譯器會把safepoint檢查的操作插入到機器碼指令中,比如下面的指令:

0x01b6d627: call   0x01b2b210         ; OopMap{[60]=Oop off=460}      
                                       ;*invokeinterface size      
                                       ; - Client1::[email protected] (line 23)      
                                       ;   {virtual_call}      
 0x01b6d62c: nop                       ; OopMap{[60]=Oop off=461}      
                                       ;*if_icmplt      
                                       ; - Client1::[email protected] (line 23)      
 0x01b6d62d: test   %eax,0x160100      ;   {poll}      
 0x01b6d633: mov    0x50(%esp),%esi      
 0x01b6d637: cmp    %eax,%esi

test %eax,0x160100 就是一個檢查polling page是否可讀的操作,如果不可讀,則該執行緒會被掛起等待。

4、執行緒處於Block狀態

即使執行緒已經滿足了block condition,也要等到safepoint operation完成,如GC操作,才能返回。

5、執行緒正在轉換狀態

會去檢查safepoint狀態,如果需要阻塞,就把自己掛起。 

最終實現

當執行緒訪問到被保護的記憶體地址時,會觸發一個SIGSEGV訊號,進而觸發JVM的signal handler來阻塞這個執行緒,The GC thread can protect some memory to which all threads in the process can write (using the mprotect system call) so they no longer can. Upon accessing this temporarily forbidden memory, a signal handler kicks in。

再看看底層是如何處理這個SIGSEGV訊號,實現位於hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp

// Check to see if we caught the safepoint code in the
// process of write protecting the memory serialization page.
// It write enables the page immediately after protecting it
// so we can just return to retry the write.
if ((sig == SIGSEGV) &&
    os::is_memory_serialize_page(thread, (address) info->si_addr)) {
  // Block current thread until the memory serialize page permission restored.
  os::block_on_serialize_page_trap();
  return true;
}

執行os::block_on_serialize_page_trap()把當前執行緒阻塞掛起。

執行緒如何恢復

有了begin方法,自然有對應的end方法,在SafepointSynchronize::end()中,會最終喚醒所有掛起等待的執行緒,大概實現如下:
1、重新設定pooling page為可讀

  if (PageArmed) {
    // Make polling safepoint aware
    os::make_polling_page_readable();
    PageArmed = 0 ;
  }

2、設定直譯器為ignore_safepoints,實現如下:

// switch from the dispatch table which notices safepoints back to the
// normal dispatch table.  So that we can notice single stepping points,
// keep the safepoint dispatch table if we are single stepping in JVMTI.
// Note that the should_post_single_step test is exactly as fast as the
// JvmtiExport::_enabled test and covers both cases.
void TemplateInterpreter::ignore_safepoints() {
  if (_notice_safepoints) {
    if (!JvmtiExport::should_post_single_step()) {
      // switch to normal dispatch table
      _notice_safepoints = false;
      copy_table((address*)&_normal_table, (address*)&_active_table, sizeof(_active_table) / sizeof(address));
    }
  }
}

3、喚醒所有掛起等待的執行緒

// Start suspended threads
    for(JavaThread *current = Threads::first(); current; current = current->next()) {
      // A problem occurring on Solaris is when attempting to restart threads
      // the first #cpus - 1 go well, but then the VMThread is preempted when we get
      // to the next one (since it has been running the longest).  We then have
      // to wait for a cpu to become available before we can continue restarting
      // threads.
      // FIXME: This causes the performance of the VM to degrade when active and with
      // large numbers of threads.  Apparently this is due to the synchronous nature
      // of suspending threads.
      //
      // TODO-FIXME: the comments above are vestigial and no longer apply.
      // Furthermore, using solaris' schedctl in this particular context confers no benefit
      if (VMThreadHintNoPreempt) {
        os::hint_no_preempt();
      }
      ThreadSafepointState* cur_state = current->safepoint_state();
      assert(cur_state->type() != ThreadSafepointState::_running, "Thread not suspended at safepoint");
      cur_state->restart();
      assert(cur_state->is_running(), "safepoint state has not been reset");
    }

對JVM效能有什麼影響

通過設定JVM引數 -XX:+PrintGCApplicationStoppedTime, 可以打出系統停止的時間,大概如下:

Total time for which application threads were stopped: 0.0051000 seconds  
Total time for which application threads were stopped: 0.0041930 seconds  
Total time for which application threads were stopped: 0.0051210 seconds  
Total time for which application threads were stopped: 0.0050940 seconds  
Total time for which application threads were stopped: 0.0058720 seconds  
Total time for which application threads were stopped: 5.1298200 seconds
Total time for which application threads were stopped: 0.0197290 seconds  
Total time for which application threads were stopped: 0.0087590 seconds

從上面資料可以發現,有一次暫停時間特別長,達到了5秒多,這在線上環境肯定是無法忍受的,那麼是什麼原因導致的呢?

一個大概率的原因是當發生GC時,有執行緒遲遲進入不到safepoint進行阻塞,導致其他已經停止的執行緒也一直等待,VM Thread也在等待所有的Java執行緒掛起才能開始GC,這裡需要分析業務程式碼中是否存在有界的大迴圈邏輯,可能在JIT優化時,這些迴圈操作沒有插入safepoint檢查。