應用與系統穩定性第五篇---Watchdog原理和問題分析
前面已經這個系列已經更新了4篇,宕機重啟問題分析中,Watchdog問題最為常見,今天接著寫一寫Watchdog問題的分析套路以及工作原理。
ofollow,noindex">應用與系統穩定性第一篇---ANR問題分析的一般套路
應用與系統穩定性第四篇---單執行緒導致的空指標問題分析一、Watchdog基本認識
1、什麼是watchdog?
Watchdog又名看門狗,如果不按時給“喂狗”,超過一分鐘,就會咬人。Android系統中,服務有上百種,為了防止SystemServer的一些核心服務hang住而發生凍屏,引入了Watchdog機制,當出現故障時,Watchdog就會呼叫Process.killProcess(Process.myPid())殺死SystemServer程序system_server程序是zygote的大弟子,是zygote程序fork的第一個程序,zygote和system_server這兩個程序可以說是Java世界的半邊天,任何一個程序的死亡,都會導致Java世界的崩潰。所以如果子程序SystemServer掛了,Zygote就會自殺,這樣Zygote孵化的所有子程序都會重啟一遍,相當於手機被軟重啟了,使用者不會因為手機凍屏而不能使用。
上面說的是防止Watchdog問題,系統的處理策略,而我們程式員關注的是,具體是哪裡發生了Watchdog,和ANR類似,Watchdog發生過程中,需要dump trace,最終定位並解決問題。所以得研究一套機制能確定超時問題。
watchdog程式碼位於/frameworks/base/services/core/java/com/android/server/Watchdog.java
常見Log有下面兩種,一種是 Blocked in handler 、另外一種是: Blocked in monitor ,區別在下文分析。
11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui) 11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace: 11-15 06:56:39.696 24203 24902 W Watchdog:at android.os.MessageQueue.nativePollOnce(Native Method) 11-15 06:56:39.696 24203 24902 W Watchdog:at android.os.MessageQueue.next(MessageQueue.java:323) 11-15 06:56:39.696 24203 24902 W Watchdog:at android.os.Looper.loop(Looper.java:142) 11-15 06:56:39.696 24203 24902 W Watchdog:at com.android.server.SystemServer.run(SystemServer.java:377) 11-15 06:56:39.696 24203 24902 W Watchdog:at com.android.server.SystemServer.main(SystemServer.java:239) 11-15 06:56:39.696 24203 24902 W Watchdog:at java.lang.reflect.Method.invoke(Native Method) 11-15 06:56:39.696 24203 24902 W Watchdog:at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901) 11-15 06:56:39.696 24203 24902 W Watchdog:at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791) 11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace: ......
10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg) 10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace: 10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method) 10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381) 10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353) 10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873) 10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99) 10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193) 10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65) 10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44) 10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!
2、初始化

watchdog初始化.png
Watchdog本身繼承Thread,初始化是在SystemServer啟動過程中
public final class SystemServer { ... ... /** * Starts a miscellaneous grab bag of stuff that has yet to be refactored * and organized. */ private void startOtherServices() { ...... try { ...... traceBeginAndSlog("InitWatchdog"); final Watchdog watchdog = Watchdog.getInstance(); // 獲取Watchdog物件初始化 watchdog.init(context, mActivityManagerService); // 註冊receiver以接收系統重啟廣播 Trace.traceEnd(Trace.TRACE_TAG_SYSTEM_SERVER); ...... } ...... mActivityManagerService.systemReady(new Runnable() { @Override public void run() { ...... Watchdog.getInstance().start(); ...... } }); }
241public static Watchdog getInstance() { 242if (sWatchdog == null) { 243sWatchdog = new Watchdog(); 244} 245 246return sWatchdog; 247}
為了搞一套超時判斷的方案,在Watchdog在建構函式中,會構建很多HandlerChecker,可以分為兩類:
- Monitor Checker,用於檢查是Monitor物件可能發生的死鎖, AMS, PKMS, WMS等核心的系統服務都是Monitor物件。
- Looper Checker,用於檢查執行緒的訊息佇列是否長時間處於工作狀態。Watchdog自身的訊息佇列,ui, Io, display這些全域性的訊息佇列都是被檢查的物件。此外,一些重要的執行緒的訊息佇列,也會加入到Looper Checker中,譬如AMS, PKMS,這些是在對應的物件初始化時加入的。
/* This handler will be used to post message back onto the main thread */ 107final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>(); 249private Watchdog() { //實質呼叫的是父類Thread的構造方法,設定執行緒名稱 250super("watchdog"); 251// Initialize handler checkers for each common thread we want to check.Note 252// that we are not currently checking the background thread, since it can 253// potentially hold longer running operations with no guarantees about the timeliness 254// of operations there. 255 256// The shared foreground thread is the main checker.It is where we 257// will also dispatch monitor checks and do other work. 258mMonitorChecker = new HandlerChecker(FgThread.getHandler(), 259"foreground thread", DEFAULT_TIMEOUT); 260mHandlerCheckers.add(mMonitorChecker); 261// Add checker for main thread.We only do a quick check since there 262// can be UI running on the thread. 263mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()), 264"main thread", DEFAULT_TIMEOUT)); 265// Add checker for shared UI thread. 266mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(), 267"ui thread", DEFAULT_TIMEOUT)); 268// And also check IO thread. 269mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(), 270"i/o thread", DEFAULT_TIMEOUT)); 271// And the display thread. 272mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(), 273"display thread", DEFAULT_TIMEOUT)); 274 275// Initialize monitor for Binder threads. 276addMonitor(new BinderThreadMonitor()); 277//O上新增對FD洩露的監控 278mOpenFdMonitor = OpenFdMonitor.create(); ...... 283}
其中DEFAULT_TIMEOUT一般是一分鐘,對於installd是10分鐘。
兩類HandlerChecker的側重點不同,
Monitor Checker預警我們不能長時間持有核心繫統服務的物件鎖,否則會阻塞很多函式的執行;
Looper Checker預警我們不能長時間的霸佔訊息佇列,否則其他訊息將得不到處理。
所以Watchdog就靠這兩個Checker來搞搞事情了。
3、基本原理
3.1如何新增Checker物件
拿AMS舉例,是既添加了Monitor Checker物件,也添加了Looper Checker物件,也實現了Watchdog.Monitor介面,重寫了monitor方法。
public class ActivityManagerService extends IActivityManager.Stub implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback { ...... public ActivityManagerService(Context systemContext) { ...... Watchdog.getInstance().addMonitor(this); Watchdog.getInstance().addThread(mHandler); ...... } ...... /** In this method we try to acquire our lock to make sure that we have not deadlocked */ public void monitor() { synchronized (this) { } } ...... }
在AMS構造的時候,會呼叫Watchdog的addMonitor和addThread把自己和MainHandler的物件mHander加進去
323public void addThread(Handler thread) { 324addThread(thread, DEFAULT_TIMEOUT); 325} 326 327public void addThread(Handler thread, long timeoutMillis) { 328synchronized (this) { 329if (isAlive()) { 330throw new RuntimeException("Threads can't be added once the Watchdog is running"); 331} 332final String name = thread.getLooper().getThread().getName(); 333mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis)); 334} 335} 336 314public void addMonitor(Monitor monitor) { 315synchronized (this) { 316if (isAlive()) { 317throw new RuntimeException("Monitors can't be added once the Watchdog is running"); 318} 319mMonitorChecker.addMonitor(monitor); 320} 321}
mMonitorChecker是HandlerChecker 物件,實質上是HandlerChecker的addMonitor方法,而mHandlerCheckers是ArrayList物件,就可以直接add。
120public final class HandlerChecker implements Runnable { 121private final Handler mHandler; 122private final String mName; 123private final long mWaitMax; 124private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>(); 125private boolean mCompleted; 126private Monitor mCurrentMonitor; 127private long mStartTime; 128 129HandlerChecker(Handler handler, String name, long waitMaxMillis) { 130mHandler = handler; 131mName = name; 132mWaitMax = waitMaxMillis; 133mCompleted = true; 134} 135 136public void addMonitor(Monitor monitor) { 137mMonitors.add(monitor); 138} ...... }
3.2、核心原理
在新增Checker之後,該如何使用這些Checker呢?因為Watchdog繼承Thread,直接看run方法。
398@Override 399public void run() { 400boolean waitedHalf = false; 401while (true) { 402final ArrayList<HandlerChecker> blockedCheckers; 403final String subject; 404final boolean allowRestart; //是否是在除錯狀態 405int debuggerWasConnected = 0; 406synchronized (this) { //CHECK_INTERVAL時長是DEFAULT_TIMEOUT的一半,一般是30s 407long timeout = CHECK_INTERVAL; 408//1、處理所有的HandlerChecker 410for (int i=0; i<mHandlerCheckers.size(); i++) { 411HandlerChecker hc = mHandlerCheckers.get(i); 412hc.scheduleCheckLocked(); 413} ..... // 2. 開始定期檢查 423long start = SystemClock.uptimeMillis(); 424while (timeout > 0) { 425if (Debug.isDebuggerConnected()) { 426debuggerWasConnected = 2; 427} 428try { 429wait(timeout); 430} catch (InterruptedException e) { 431Log.wtf(TAG, e); 432} 433if (Debug.isDebuggerConnected()) { 434debuggerWasConnected = 2; 435} 436timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); 437} 438 // 3. 獲取狀態,狀態有如下三種, 439final int waitState = evaluateCheckerCompletionLocked(); 440if (waitState == COMPLETED) { 441// The monitors have returned; reset 442waitedHalf = false; 443continue; 444} else if (waitState == WAITING) { 445// still waiting but within their configured intervals; back off and recheck 446continue; 447} else if (waitState == WAITED_HALF) { 448if (!waitedHalf) { 449//超時一半的時候,開始dumpStackTraces 451ArrayList<Integer> pids = new ArrayList<Integer>(); 452pids.add(Process.myPid()); 453ActivityManagerService.dumpStackTraces(true, pids, null, null, 454getInterestingNativePids()); 455waitedHalf = true; 456} 457continue; 458} 459 460// 走到這裡,說明存在超時的HandlerChecker 461blockedCheckers = getBlockedCheckersLocked(); 462subject = describeCheckersLocked(blockedCheckers); 463allowRestart = mAllowRestart; 464} 465 466// If we got here, that means that the system is most likely hung. 467// First collect stack traces from all threads of the system process. 468// Then kill this process so that the system will restart. //eventlog打印發生了watchdog 469EventLog.writeEvent(EventLogTags.WATCHDOG, subject); 470 // 471ArrayList<Integer> pids = new ArrayList<>(); 472pids.add(Process.myPid()); 473if (mPhonePid > 0) pids.add(mPhonePid); 474// Pass !waitedHalf so that just in case we somehow wind up here without having 475//開始dumpStackTraces,包含pids中的程序和getInterestingNativePids中的程序 476final File stack = ActivityManagerService.dumpStackTraces( 477!waitedHalf, pids, null, null, getInterestingNativePids()); 478 479// Give some extra time to make sure the stack traces get written. 480// The system's been hanging for a minute, another second or two won't hurt much. 481SystemClock.sleep(2000); 482 483// Pull our own kernel thread stacks as well if we're configured for that //開始dumpKernelStackTraces 484if (RECORD_KERNEL_THREADS) { 485dumpKernelStackTraces(); 486} 487 488// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log 489doSysRq('w'); 490doSysRq('l'); 491 492// Try to add the error to the dropbox, but assuming that the ActivityManager 493// itself may be deadlocked.(which has happened, causing this statement to 494// deadlock and the watchdog as a whole to be ineffective) 495Thread dropboxThread = new Thread("watchdogWriteToDropbox") { 496public void run() { //將Error加入到DropBox檔案中 497mActivity.addErrorToDropBox( 498"watchdog", null, "system_server", null, null, 499subject, null, stack, null); 500} 501}; 502dropboxThread.start(); ...... 525 526// Only kill the process if the debugger is not attached. 527if (Debug.isDebuggerConnected()) { 528debuggerWasConnected = 2; 529} 530if (debuggerWasConnected >= 2) { 531Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); 532} else if (debuggerWasConnected > 0) { 533Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); 534} else if (!allowRestart) { 535Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); 536} else { 537Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); 538for (int i=0; i<blockedCheckers.size(); i++) { 539Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:"); 540StackTraceElement[] stackTrace 541= blockedCheckers.get(i).getThread().getStackTrace(); 542for (StackTraceElement element: stackTrace) { 543Slog.w(TAG, "at " + element); 544} 545} 546Slog.w(TAG, "*** GOODBYE!"); //最終殺死System程序 547Process.killProcess(Process.myPid()); 548System.exit(10); 549} 550 551waitedHalf = false; 552} 553}
原理總結:
- 1、系統中所有需要監控的服務都呼叫Watchdog的addMonitor新增Monitor Checker到mMonitors這個List中或者addThread方法新增Looper Checker到mHandlerCheckers這個List中。
- 2、當Watchdog執行緒啟動後,便開始無限迴圈,它的run方法就開始執行
- 第一步呼叫HandlerChecker#scheduleCheckLocked處理所有的mHandlerCheckers
- 第二步定期檢查是否超時,每一次檢查的間隔時間由CHECK_INTERVAL常量設定,為30秒,每一次檢查都會呼叫evaluateCheckerCompletionLocked()方法來評估一下HandlerChecker的完成狀態:
COMPLETED表示已經完成
WAITING和WAITED_HALF表示還在等待,但未超時,WAITED_HALF時候會dump一次trace.
OVERDUE表示已經超時。預設情況下,timeout是1分鐘。
- 3、如果超時時間到了,還有HandlerChecker處於未完成的狀態(OVERDUE),則通過getBlockedCheckersLocked()方法,獲取阻塞的HandlerChecker,生成一些描述資訊,儲存日誌,包括一些執行時的堆疊資訊。
-
4、最後殺死SystemServer程序
Watchdog原理.png
上面就是大概的原理總結,還需要看幾個細節問題
3.2.1、HandlerChecker#scheduleCheckLocked的處理?
127public void scheduleCheckLocked() { //mMonitors.size為0或者,訊息佇列處於空閒,說明沒有阻塞,設定mCompleted = true後直接返回 128if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) { 129// If the target looper has recently been polling, then 130// there is no reason to enqueue our checker on it since that 131// is as good as it not being deadlocked.This avoid having 132// to do a context switch to check the thread.Note that we 133// only do this if mCheckReboot is false and we have no 134// monitors, since those would need to be executed at this point. 135mCompleted = true; 136return; 137} ...... 144mCompleted = false; 145mCurrentMonitor = null; 146mStartTime = SystemClock.uptimeMillis(); //post一個訊息到當前mHandler所在訊息佇列的最前面 147mHandler.postAtFrontOfQueue(this); 148}
如果上面訊息能夠執行,下面的run方法就會走進去,嘗試呼叫monitor申請鎖。
public final class HandlerChecker implements Runnable { ....... @Override public void run() { final int size = mMonitors.size(); for (int i = 0 ; i < size ; i++) { synchronized (Watchdog.this) { mCurrentMonitor = mMonitors.get(i); } mCurrentMonitor.monitor(); } synchronized (Watchdog.this) { mCompleted = true; mCurrentMonitor = null; } } }
對於Looper Checker而言,會判斷執行緒的訊息佇列是否處於空閒狀態。 如果被監測的訊息佇列一直閒不下來,則說明可能已經阻塞等待了很長時間
如果scheduleCheckLocked中post的訊息 能夠 被執行到,對於Monitor Checker而言,會呼叫實現類的monitor方法,上文中提到的AMS.monitor()方法, 方法實現一般很簡單,就是獲取當前類的物件鎖,如果當前物件鎖已經被持有,則monitor()會一直處於wait狀態,直到超時。
如果scheduleCheckLocked中post的訊息 不能夠 被執行到,那麼說明訊息佇列中前一個訊息一直在執行,沒有執行完成,也會超時。不得不佩服這種巧妙的設計啊,postAtFrontOfQueue可謂是一箭雙鵰,既檢測了是否鎖有耗時,也檢查了訊息佇列中某個Message是否耗時。
二、案例分析
對於Watchdog問題分析,首先需要確定trace是否有效,通過前面的分析,Watchdog在30s和1分鐘的時候都會dump一次trace,比如看到下面的trace。
09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on ActivityManager (ActivityManager) 09-24 11:25:43.442 1000 1540 2033 W Watchdog: ActivityManager stack trace: 09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.nativePollOnce(Native Method) 09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.next(MessageQueue.java:325) 09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.Looper.loop(Looper.java:148) 09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65) 09-24 11:25:43.442 1000 1540 2033 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46) 09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** GOODBYE!
然後我們看ActivityManager的trace.
"ActivityManager" prio=5 tid=12 Blocked group="main" sCount=1 dsCount=0 flags=1 obj=0x13180c38 self=0x73bb923600 sysTid=1579 nice=-2 cgrp=default sched=0/0 handle=0x73adbcf4f0 state=S schedstat=( 3039883125048 14149853235996 6778200 ) utm=112965 stm=191023 core=6 HZ=100 stack=0x73adacd000-0x73adacf000 stackSize=1037KB held mutexes= at com.android.server.am.ActiveServices.serviceTimeout(ActiveServices.java:3486) waiting to lock <0x0748826a> (a com.android.server.am.ActivityManagerService) held by thread 10 at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2032) at android.os.Handler.dispatchMessage(Handler.java:106) at android.os.Looper.loop(Looper.java:173) at android.os.HandlerThread.run(HandlerThread.java:65) at com.android.server.ServiceThread.run(ServiceThread.java:46)
因為ActivityManager被10號執行緒blocked,繼續看10號執行緒的trace.
"Binder:1540_1C" prio=5 tid=10 Native group="main" sCount=1 dsCount=0 flags=1 obj=0x1318deb8 self=0x73c0817600 sysTid=8946 nice=-4 cgrp=default sched=0/0 handle=0x739db674f0 state=S schedstat=( 2025031009459 6852098325718 5020435 ) utm=136019 stm=66484 core=1 HZ=100 stack=0x739da6d000-0x739da6f000 stackSize=1005KB held mutexes= kernel: __switch_to+0x9c/0xd0 kernel: futex_wait_queue_me+0xc4/0x13c kernel: futex_wait+0xe4/0x204 kernel: do_futex+0x170/0x500 kernel: SyS_futex+0x90/0x1b0 kernel: __sys_trace+0x4c/0x4c native: #00 pc 000000000001db2c /system/lib64/libc.so (syscall+28) native: #01 pc 00000000000e74c8 /system/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+152) native: #02 pc 00000000005227a8 /system/lib64/libart.so (art::GoToRunnable(art::Thread*)+440) native: #03 pc 00000000005225a8 /system/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28) native: #04 pc 0000000000cb8fc0 /system/framework/arm64/boot-framework.oat (Java_android_os_Process_setThreadPriority__II+176) at android.os.Process.setThreadPriority(Native method) at com.android.server.ThreadPriorityBooster.boost(ThreadPriorityBooster.java:49) at com.android.server.wm.WindowManagerThreadPriorityBooster.boost(WindowManagerThreadPriorityBooster.java:58) at com.android.server.wm.WindowManagerService.boostPriorityForLockedSection(WindowManagerService.java:930) at com.android.server.wm.WindowManagerService.containsDismissKeyguardWindow(WindowManagerService.java:3116) locked <0x0b54880e> (a com.android.server.wm.WindowHashMap) at com.android.server.am.ActivityRecord.hasDismissKeyguardWindows(ActivityRecord.java:1364) at com.android.server.am.ActivityStack.checkKeyguardVisibility(ActivityStack.java:2070) at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1924) at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3626) at com.android.server.am.ActivityStackSupervisor.attachApplicationLocked(ActivityStackSupervisor.java:1043) at com.android.server.am.ActivityManagerService.attachApplicationLocked(ActivityManagerService.java:7471) at com.android.server.am.ActivityManagerService.attachApplication(ActivityManagerService.java:7538) locked <0x0748826a> (a com.android.server.am.ActivityManagerService) at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:292) at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3026) at android.os.Binder.execTransact(Binder.java:704)
難道是setThreadPriority超時??但是缺乏1分鐘的trace,我們不能斷定是這個地方卡住。在 dumptraces 的時候對於處於 Suspended 狀態的執行緒,會修改執行緒的 suspend_count_,使其+1,然後將其新增到suspended_count_modified_threads 的列表中,然後對於 suspended_count_modified_threads 中的執行緒一起 dumptraces ,對於 dump 完成的 thread 會進行 suspend_count_ - 1 的操作。Suspended 執行緒想要由 jni 回到 java 程式碼(Runnable 狀態)在 GoToRunnable 時會檢查 suspend_count_,如果不為0就在這裡等待,直到其變為0。所以這裡只能說明 dumptraces 的時候 tid=10 在執行 setThreadPriority 的 native method,如果要確定是否卡在了這裡還需要對比兩次 traces才能確定。
2.1、案例一
有的手機在Monkey測試過程中發生Watchdog不會重啟,現象可能是凍屏,檢視traces_SystemServer_WDT05_1月_23_50_59.974.txt,發現所有執行緒都被73號執行緒blocked,而且兩次trace完全一致

WDT.png
來看看73號執行緒在幹嘛
"Binder:1300_3" prio=5 tid=73 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x14f89110 self=0x7ee794d600 | sysTid=1774 nice=-10 cgrp=default sched=0/0 handle=0x7ecbc474f0 | state=S schedstat=( 59882636556 104471794509 273786 ) utm=3455 stm=2533 core=6 HZ=100 | stack=0x7ecbb4d000-0x7ecbb4f000 stackSize=1005KB | held mutexes= kernel: __switch_to+0x94/0xa8 kernel: binder_thread_read+0x460/0x10a0 kernel: binder_ioctl_write_read+0x21c/0x360 kernel: binder_ioctl+0x50c/0x798 kernel: do_vfs_ioctl+0xb8/0x800 kernel: SyS_ioctl+0x84/0x98 kernel: el0_svc_naked+0x24/0x28 native: #00 pc 00000000000690a4/system/lib64/libc.so (__ioctl+4) native: #01 pc 0000000000024638/system/lib64/libc.so (ioctl+132) native: #02 pc 0000000000061a10/system/lib64/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+256) native: #03 pc 00000000000627a8/system/lib64/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+340) native: #04 pc 00000000000624c8/system/lib64/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+216) native: #05 pc 0000000000056d98/system/lib64/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+72) native: #06 pc 000000000008a86c/system/lib64/libgui.so (???) native: #07 pc 000000000009ec88/system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260) native: #08 pc 00000000000fb058/system/lib64/libandroid_runtime.so (???) native: #09 pc 0000000001326998/system/framework/arm64/boot-framework.oat (Java_android_view_SurfaceControl_nativeScreenshot__Landroid_os_IBinder_2Landroid_graphics_Rect_2IIIIZZI+264) at android.view.SurfaceControl.nativeScreenshot(Native method) at android.view.SurfaceControl.screenshot(SurfaceControl.java:877) at com.android.server.wm.DisplayContent.-com_android_server_wm_DisplayContent-mthref-0(DisplayContent.java:2863) at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.$m$0(unavailable:-1) at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.screenshot(unavailable:-1) at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:3125) - locked <0x036ec628> (a com.android.server.wm.WindowHashMap) at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:2862) at com.android.server.wm.AppWindowContainerController.screenshotApplications(AppWindowContainerController.java:749) at com.android.server.am.ActivityRecord.screenshotActivityLocked(ActivityRecord.java:1650) at com.android.server.am.ActivityRecord.setVisible(ActivityRecord.java:1675) at com.android.server.am.ActivityStack.makeInvisible(ActivityStack.java:2078) at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1896) at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3575) at com.android.server.am.ActivityManagerService.ensureConfigAndVisibilityAfterUpdate(ActivityManagerService.java:20965) at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20897) at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20867) at com.android.server.am.ActivityStack.resumeTopActivityInnerLocked(ActivityStack.java:2608) at com.android.server.am.ActivityStack.resumeTopActivityUncheckedLocked(ActivityStack.java:2246) at com.android.server.am.ActivityStackSupervisor.resumeFocusedStackTopActivityLocked(ActivityStackSupervisor.java:2148) at com.android.server.am.ActivityStack.completePauseLocked(ActivityStack.java:1480) at com.android.server.am.ActivityStack.activityPausedLocked(ActivityStack.java:1406) at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:7542) - locked <0x08abeada> (a com.android.server.am.ActivityManagerService) at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:317) at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3018) at android.os.Binder.execTransact(Binder.java:677)
最後是停在下面兩行
native: #06 pc 000000000008a86c/system/lib64/libgui.so (???) native: #07 pc 000000000009ec88/system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260)
使用addr2line -Cfe ./system/lib64/libgui.so 000000000009ec88
_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj frameworks/native/libs/gui/SurfaceComposer/">ComposerClient.cpp:1018 (discriminator 1)
1003 status_t ScreenshotClient::captureToBuffer(const sp<IBinder>& display, 1004Rect sourceCrop, uint32_t reqWidth, uint32_t reqHeight, 1005int32_t minLayerZ, int32_t maxLayerZ, bool useIdentityTransform, 1006uint32_t rotation, 1007sp<GraphicBuffer>* outBuffer) { 1008sp<ISurfaceComposer> s(ComposerService::getComposerService()); 1009if (s == NULL) return NO_INIT; 1010 1011sp<IGraphicBufferConsumer> gbpConsumer; 1012sp<IGraphicBufferProducer> producer; 1013BufferQueue::createBufferQueue(&producer, &gbpConsumer); 1014sp<BufferItemConsumer> consumer(new BufferItemConsumer(gbpConsumer, 1015GRALLOC_USAGE_HW_TEXTURE | GRALLOC_USAGE_SW_READ_NEVER | GRALLOC_USAGE_SW_WRITE_NEVER, 10161, true)); 1017 1018status_t ret = s->captureScreen(display, producer, sourceCrop, reqWidth, reqHeight, 1019minLayerZ, maxLayerZ, useIdentityTransform, 1020static_cast<ISurfaceComposer::Rotation>(rotation)); 1021if (ret != NO_ERROR) { 1022return ret; 1023} 1024BufferItem b; 1025consumer->acquireBuffer(&b, 0, true); 1026*outBuffer = b.mGraphicBuffer; 1027return ret; 1028}
1018行captureScreen函式是在做截圖,看來是截圖時候發生了Watchdog,根據captureScreen,那麼對應的surfaceflinger的trace如下:
"Binder:820_3" sysTid=1331 #00 pc 0000000000068fb8/system/lib64/libc.so (__epoll_pwait+8) #01 pc 000000000001fc68/system/lib64/libc.so (epoll_pwait+48) #02 pc 0000000000015c84/system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+144) #03 pc 0000000000015b6c/system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108) #04 pc 00000000000b921c/system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger13captureScreenERKNS_2spINS_7IBinderEEERKNS1_INS_22IGraphicBufferProducerEEENS_4RectEjjiibNS_16ISurfaceComposer8RotationE+672) #05 pc 0000000000088660/system/lib64/libgui.so (_ZN7android17BnSurfaceComposer10onTransactEjRKNS_6ParcelEPS1_j+1788) #06 pc 00000000000b8828/system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger10onTransactEjRKNS_6ParcelEPS1_j+144) #07 pc 00000000000559ac/system/lib64/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+136) #08 pc 0000000000061ecc/system/lib64/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+536) #09 pc 0000000000061c04/system/lib64/libbinder.so (_ZN7android14IPCThreadState20getAndExecuteCommandEv+156) #10 pc 0000000000062250/system/lib64/libbinder.so (_ZN7android14IPCThreadState14joinThreadPoolEb+60) #11 pc 0000000000082bcc/system/lib64/libbinder.so (_ZN7android10PoolThread10threadLoopEv+24) #12 pc 0000000000011674/system/lib64/libutils.so (_ZN7android6Thread11_threadLoopEPv+280) #13 pc 0000000000066970/system/lib64/libc.so (_ZL15__pthread_startPv+36) #14 pc 000000000001f474/system/lib64/libc.so (__start_thread+68)
surfaceflinger的主執行緒trace如下
----- pid 820 at 2018-01-05 23:49:16 ----- Cmd line: /system/bin/surfaceflinger ABI: 'arm64' "surfaceflinger" sysTid=820 #00 pc 00000000000690a4/system/lib64/libc.so (__ioctl+4) #01 pc 0000000000024638/system/lib64/libc.so (ioctl+132) #02 pc 0000000000015210/system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState14talkWithDriverEb+256) #03 pc 0000000000015f58/system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState15waitForResponseEPNS0_6ParcelEPi+60) #04 pc 0000000000015d84/system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState8transactEijRKNS0_6ParcelEPS2_j+216) #05 pc 00000000000128d4/system/lib64/libhwbinder.so (_ZN7android8hardware10BpHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+72) #06 pc 0000000000038bc4/system/lib64/[email protected] (_ZN7android8hardware8graphics8composer4V2_118BpHwComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+240) #07 pc 0000000000091cc0/system/lib64/libsurfaceflinger.so (_ZN7android4Hwc28Composer11createLayerEmPm+100) #08 pc 000000000009a930/system/lib64/libsurfaceflinger.so (_ZN4HWC27Display11createLayerEPNSt3__110shared_ptrINS_5LayerEEE+72) #09 pc 00000000000c5304/system/lib64/libsurfaceflinger.so (_ZN7android10HWComposer11createLayerEi+152) #10 pc 00000000000b1524/system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger15setUpHWComposerEv+1560) #11 pc 00000000000b096c/system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger20handleMessageRefreshEv+108) #12 pc 00000000000aa660/system/lib64/libsurfaceflinger.so (_ZN7android16ExSurfaceFlinger20handleMessageRefreshEv+16) #13 pc 00000000000b03e4/system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger17onMessageReceivedEi+260) #14 pc 0000000000015d40/system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+332) #15 pc 0000000000015b6c/system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108) #16 pc 000000000008b944/system/lib64/libsurfaceflinger.so (_ZN7android12MessageQueue11waitMessageEv+84) #17 pc 00000000000af338/system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger3runEv+20) #18 pc 0000000000002cfc/system/bin/surfaceflinger (main+948) #19 pc 000000000001b8b0/system/lib64/libc.so (__libc_init+88) #20 pc 00000000000028a8/system/bin/surfaceflinger (do_arm64_start+80)
看到主執行緒正在createLayer,又通過binder從surfaceflinger程序call到了/vendor/bin/hw/ [email protected] ,我們在去看看graphics.composer的對應執行緒的trace.
"HwBinder:738_1" sysTid=1273 #00 pc 000000000001dc2c/system/lib64/libc.so (syscall+28) #01 pc 0000000000066014/system/lib64/libc.so (pthread_cond_wait+96) #02 pc 000000000001fc3c/vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120) #03 pc 00000000000140e0/vendor/lib64/hw/[email protected] (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84) #04 pc 0000000000044840/system/lib64/[email protected] (_ZN7android8hardware8graphics8composer4V2_116BsComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+160) #05 pc 000000000003fd10/system/lib64/[email protected] (_ZN7android8hardware8graphics8composer4V2_118BnHwComposerClient10onTransactEjRKNS0_6ParcelEPS5_jNSt3__18functionIFvRS5_EEE+2224) #06 pc 0000000000011be0/system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware9BHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+132) #07 pc 00000000000156fc/system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14executeCommandEi+584) #08 pc 0000000000015404/system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState20getAndExecuteCommandEv+156) #09 pc 0000000000015b0c/system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14joinThreadPoolEb+60) #10 pc 000000000001f5c8/system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware10PoolThread10threadLoopEv+24) #11 pc 0000000000011674/system/lib64/vndk-sp/libutils.so (_ZN7android6Thread11_threadLoopEPv+280) #12 pc 0000000000066970/system/lib64/libc.so (_ZL15__pthread_startPv+36) #13 pc 000000000001f474/system/lib64/libc.so (__start_thread+68)
終於找到了最終blocked的地方
#02 pc 000000000001fc3c/vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120) #03 pc 00000000000140e0/vendor/lib64/hw/[email protected] (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84)
再次使用addr2line
addr2line -f -e hwcomposer.sdm845.so 1fc3c
_ZN3sdm6Locker4WaitEv
hardware/qcom/display/include/../sdm/include/utils/locker.h:141
addr2line -f -e [email protected] 140e0
ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3_18functionIFvNS3_5ErrorEmEEE
hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp:299
hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp#299 295Return<void> ComposerClient::createLayer(Display display, 296uint32_t bufferSlotCount, createLayer_cb hidl_cb) 297{ 298Layer layer = 0; 299Error err = mHal.createLayer(display, &layer); 300if (err == Error::NONE) { 301std::lock_guard<std::mutex> lock(mDisplayDataMutex); 302 303auto dpy = mDisplayData.find(display); 304if (dpy != mDisplayData.end()) { 305auto ly = dpy->second.Layers.emplace(layer, LayerBuffers()).first; 306ly->second.Buffers.resize(bufferSlotCount); 307} else { 308layer = 0; 309err = Error::BAD_DISPLAY; 310} 311} 312 313hidl_cb(err, layer); 314return Void(); 315}
看樣子是createLayer出了問題,最後將問題轉給底層顯示模組的同學繼續分析。最後關於Watchdog還是有一些問題可以思考的,比如Watchdog各個版本有哪些變化,Watchdog執行緒被blocked了怎麼辦?而且Watchdog問題紛繁複雜,各個模組的業務都不一樣,由於篇幅原因,讀者自己調查。