Android Watchdog機制原理分析

阿新 • • 發佈：2019-02-03

如我們所知，當應用超過一定時間無響應的時候，系統為了不讓應用長時處於不可操作的狀態，會彈出一個“無響應”（ANR）的對話方塊，使用者可以選擇強制關閉，從而關掉這個程序。

ANR機制是針對應用的，對於系統程序來說，如果長時間“無響應”，Android系統設計了WatchDog機制來管控。如果超過了“無響應”的延時，那麼系統WatchDog會觸發自殺機制。

當我們分析宕機重啟問題LOG的時候，經常會看到下面這樣一句話：

*** WATCHDOG KILLING SYSTEM PROCESS:

從字面意思上看是：看門狗殺掉了系統程序。這裡就提到了本文將要分析的WatchDog機制，對於WatchDog機制來說，主要通過新增兩種型別的Checker，然後每隔30s去檢測一次是否有死鎖和執行緒池堵塞的情況，如果存在，則kill掉系統。Checker主要是如下兩類：

MonitorChecker 檢查系統核心服務是否被鎖時間過長。
HandlerChecker 檢查系統核心執行緒建立的Looper管理的訊息佇列是否阻塞。實際上，MonitorChecker也是一種執行緒是FgThread的HandlerChecker。

為了方便描述，本文所指HandlerChecker不包括執行緒是FgThread的MonitorChecker。

接下來我們將從以下幾個方面對Watchdog機制展開分析：、

Watchdog、MonitorChecker 、Handlerchecker初始化。
Watchdog 機制原理分析。
Watchdog宕機重啟問題分析方法。

Watchdog 啟動

Watchdog是一個執行緒，繼承於Thread，在SystemServer.java裡面通過getInstance獲取watchdog的物件。

@SystemServer.java

            final Watchdog watchdog = Watchdog.getInstance();
            watchdog.init(context, mActivityManagerService);

在init方法裡面，註冊了ACTION_REBOOT的廣播接收器。

    public void init(Context context, ActivityManagerService activity) {

        context.registerReceiver(new 
 RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }

初始化HandlerChecker

在Watchdog初始化的過程中，會初始化Handlerchecker，程式碼如下：

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

引數分別表示:

handler: 觀察的Handler.
name: 對Handler對應的執行緒名字命名,主要方便後續發生異常之後，在LOG中輸出對應的執行緒名。
waitMaxMillis：訊息佇列阻塞的最大時長，超過這個時長，就會Kill系統，預設是60s。

watchdog的構造方法裡面，會初始化名字分別是foreground thread，main thread，ui thread，i/o thread，display thread的HandlerChecker（FgThread特例），預設的DEFAULT_TIMEOUT是60s，也就是說，執行緒建立的Looper裡面的訊息佇列不能阻塞超過60s。

程式碼如下：

    private Watchdog() {
        super("watchdog");

        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());
    }

初始化MonitorChecker

如上程式碼，在初始化Watchdog的過程中，會新增BinderThreadMonitor。

     */
    private static final class BinderThreadMonitor implements Watchdog.Monitor {
        @Override
        public void monitor() {
            Binder.blockUntilThreadAvailable();
        }
    }

外部新增

除了WatchDog裡面自己新增的固定的Checker之外，Watchdog還提供了兩個方法addMonitor和addThread供外部新增HandlerChecker和MonitorChecker。程式碼如下：

    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            }
            mMonitorChecker.addMonitor(monitor);
        }
    }

    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Threads can't be added once the Watchdog is running");
            }
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

比如ActivityManagerService就分別添加了monitor和handler；

public class ActivityManagerService extends IActivityManager.Stub implements Watchdog.Monitor
{
        // ...
        Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);

}

watchdog 機制原理

當系統的核心服務都執行之後，SystemServer.java會呼叫Watchdog.getInstance().start();從而開始執行Watchdog執行緒的run方法。程式碼如下：

 @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // [a] 
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                // [b] wait 30s 
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout); // wait time 
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }
                // [c] evaluate checker state
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        // [d] 
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                            getInterestingNativePids());
                        waitedHalf = true;
                    }
                    continue;
                }
                // [e]
                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList<Integer> pids = new ArrayList<>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            // [f] pint all stack info
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, getInterestingNativePids());

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(2000);

            // Pull our own kernel thread stacks as well if we're configured for that
            if (RECORD_KERNEL_THREADS) {
                dumpKernelStackTraces();
            }

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            // [g] report to controller
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }
            // [h] kill the process
            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i<blockedCheckers.size(); i++) {
                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) {
                        Slog.w(TAG, "    at " + element);
                    }
                }
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

接下來一步步分析這個函式。我們可以看到，Watchdog執行緒是一個死迴圈，也就是說會一直執行。在以上程式碼片段新增[a]-[g]的標識。分別對應下面的a-g。

a. 首先遍歷系統所有的HandlerChecker，然後呼叫scheduleCheckLocked執行檢查動作。程式碼片段：

        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
               // ...
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

HanderChecker: 對於名字不是foreground thread的HandlerChecker來說，mMonitors.size()為0，如果mHandler.getLooper().getQueue().isPolling()返回true，說明當前的訊息池正常，否則，說明當前的訊息已經阻塞。那麼後面的mHandler.postAtFrontOfQueue(this)也會阻塞，mCompleted就等於false。
MoniterChecker: 對於monitorChecker來說，mHandler.postAtFrontOfQueue(this)將會順利執行，而且訊息是在訊息佇列的最前端。所以會立即執行run方法。程式碼片段如下：

        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

MoniterChecker會執行monitor()方法。我們看ActivityManagerService,java的monitor方法，僅僅是請求了synchronized，如果this被其他地方持有，那麼這個地方就會等待。

    public void monitor() {
        synchronized (this) { }
    }

BinderThreadMonitor比較特殊，最終的判斷位於frameworks/native/libs/binder/IPCThreadState.cpp，判斷方法是當前進行Binder通訊的執行緒數不能超過mMaxThreads，對於SysemServer來說，這個最大值是31，定義在SystemServer.java裡面。程式碼片段：

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);

回到HandlerChecker的run方法，如果mCurrentMonitor.monitor();執行完成，沒有等待，那麼就會賦值mCompleted = true;和mCurrentMonitor = null;

後面的[c]步驟會用到這裡的結果。

b. 由於我們的檢查週期是30s，當啟動檢查之後，會讓Watchdog執行緒等待30s.

c. 呼叫evaluateCheckerCompletionLocked計算當前的檢查結果。然後呼叫getCompletionStateLocked獲取完成狀態。程式碼片段：

        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

分別介紹四種狀態以及對應的條件：

COMPLETED：監控的訊息佇列沒有阻塞且監控的monitor可以正常申請鎖。如步驟[a] 所講,此時mCompleted=true。
WAITING：監控的訊息佇列阻塞時間或者監控的monitor無法申請鎖時間在0-30s之間。
WAITED_HALF：監控的訊息佇列阻塞時間或者監控的monitor無法申請鎖的時間在30-60s之間。
OVERDUE：監控的訊息佇列阻塞時間或者監控的monitor無法申請鎖的時間超過我們預設的延時60s。

d. 如果返回的狀態是COMPLETED和WAITING，是在可以接受的範圍之內，但是如果返回了WAITED_HALF狀態，此時會呼叫ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids())列印當前程序的Trace資訊，並且會列印感興趣的native程序的Trace資訊。主要包含如下程序：

    public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
        "/system/bin/audioserver",
        "/system/bin/cameraserver",
        "/system/bin/drmserver",
        "/system/bin/mediadrmserver",
        "/system/bin/mediaserver",
        "/system/bin/sdcard",
        "/system/bin/surfaceflinger",
        "media.extractor", // system/bin/mediaextractor
        "media.codec", // vendor/bin/hw/[email protected]
        "com.android.bluetooth",  // Bluetooth service
    };

e. 如果返回了OVERDUE狀態，說明已經超時，會通過getBlockedCheckersLocked獲取當前延時的checker型別，並且通過describeCheckersLocked列印當前阻塞資訊。

MonitorChecker 延時：列印Blocked in monitor + monitor名字 + on + 執行緒名。
HandlerChecker 延時：列印Blocked in handler on + 名字（比如ui thread） + 執行緒名

f. 再次呼叫ActivityManagerService.dumpStackTraces列印當前的程序和感興趣的native程序呼叫Stack。呼叫dumpKernelStackTraces列印kernel的回撥。還會執行doSysRq來列印當前kernel和cpu的狀態。

doSysRq(‘w’): Dumps tasks that are in uninterruptable (blocked) state.
doSysRq(‘l’): Shows a stack backtrace for all active CPUs.

而且會把當前的Error寫進DropBox裡面。

g. 如果設定了ActivityController，會將當前的資訊傳遞過去。

h.WatchDog系統自殺，向LOG裡面輸出WATCHDOG KILLING SYSTEM PROCESS的資訊，呼叫Process.killProcess(Process.myPid());將system殺掉。

總結

1、Watchdog用HandlerChecker來監控訊息佇列是否發生阻塞，用MonitorChecker來監控系統核心服務是否發生長時間持鎖。
2、HandlerChecker通過`mHandler.getLooper().getQueue().isPolling()判斷是否超時，BinderThreadMonitor主要是通過判斷Binder執行緒是否超過了系統最大值來判斷是否超時，其他MonitorChecker通過synchronized(this)判斷是否超時。
3、超時之後，系統會列印一系列的資訊，包括當前程序以及核心native程序的Stacktrace，kernel執行緒Stacktrace，列印Kernel裡面blocked的執行緒以及所有CPU的backtraces。
4. 超時之後，Watchdog會殺掉自己，導致zygote重啟。

Android Watchdog機制原理分析

Watchdog 啟動

初始化HandlerChecker

初始化MonitorChecker

外部新增

watchdog 機制原理

總結

Android Watchdog機制原理分析

Android Handler 機制原理（轉）

Anroid訊息機制原理分析

Android Binder機制原理

Android P zygote 原理分析之SystemServer的啟動

Android Binder機制原理（史上最強理解，沒有之一）（轉）

Android訊息機制原理，仿寫Handler Looper原始碼解析跨執行緒通訊原理--之仿寫模擬Handler(四)

Android訊息機制原理，仿寫Handler Looper原始碼跨執行緒通訊原理--之執行緒間通訊原理(一)

android消除鋸齒原理分析

Android手勢密碼原理分析

Android 訊息機制原始碼分析

Android訊息機制原理解析

Android RecyclerView工作原理分析（上）

Android Handler機制原始碼分析

Android Binder機制原理（史上最強理解，沒有之一）

阿里系產品Xposed Hook檢測機制原理分析

Android-ANR總結原理分析

android Handler機制原理 4個組成部分原始碼解析

Android -- Vold機制簡要分析

【原創】Linux select/poll機制原理分析

Android Watchdog機制原理分析

Watchdog 啟動

初始化HandlerChecker

初始化MonitorChecker

外部新增

watchdog 機制原理

總結

相關推薦