應用與系統穩定性第二篇---ANR的監測與資訊採集
第一篇文章-ofollow,noindex">應用與系統穩定性第一篇---ANR問題分析的一般套路 主要梳理了幾個ANR案例,講了分析ANR問題的一般思路。本文基於Android O,深入理解ANR發生時系統的處理邏輯,加深對ANR問題的實戰處理。實際處理ANR問題中,對於很多的資訊還是比較陌生。比如ANR的trace是怎麼輸出來的?有時候為什麼ANR的trace是無效的?一個程序發生了ANR的過程中又發生了ANR,系統是怎麼處理的?什麼是後臺ANR?等等小問題如果你都能回答,那麼可以不用看了。
一、ANR的trace是如何生成的
1、appNotResponding方法解讀
發生ANR的時候,一般先搜尋ANR in 獲取最直觀的資訊,如下:
06-16 16:16:28.59018532073 E ActivityManager: ANR in com.android.camera (com.android.camera/.Camera) 06-16 16:16:28.59018532073 E ActivityManager: PID: 27661 06-16 16:16:28.59018532073 E ActivityManager: Reason: Input dispatching timed out (com.android.camera/com.android.camera.Camera, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.Wait queue length: 24.Wait queue head age: 5511.1ms.) 06-16 16:16:28.59018532073 E ActivityManager: Load: 16.25 / 29.48 / 38.33 06-16 16:16:28.59018532073 E ActivityManager: CPU usage from 0ms to 8058ms later: 06-16 16:16:28.59018532073 E ActivityManager:58% 291/mediaserver: 51% user + 6.7% kernel / faults: 2457 minor 4 major 06-16 16:16:28.59018532073 E ActivityManager:27% 317/mm-qcamera-daemon: 21% user + 5.8% kernel / faults: 15965 minor 06-16 16:16:28.59018532073 E ActivityManager:0.4% 288/debuggerd: 0% user + 0.3% kernel / faults: 21615 minor 87 major 06-16 16:16:28.59018532073 E ActivityManager:17% 27661/com.android.camera: 10% user + 6.8% kernel / faults: 2412 minor 34 major 06-16 16:16:28.59018532073 E ActivityManager:16% 1853/system_server: 10% user + 6.4% kernel / faults: 1754 minor 87 major 06-16 16:16:28.59018532073 E ActivityManager:10% 539/sensors.qcom: 7.8% user + 2.6% kernel / faults: 16 minor 06-16 16:16:28.59018532073 E ActivityManager:4.4% 277/surfaceflinger: 1.8% user + 2.6% kernel / faults: 14 minor 06-16 16:16:28.59018532073 E ActivityManager:4% 203/mmcqd/0: 0% user + 4% kernel 06-16 16:16:28.59018532073 E ActivityManager:2.6% 3510/com.android.phone: 1.9% user + 0.6% kernel / faults: 1148 minor 8 major 06-16 16:16:28.59018532073 E ActivityManager:2.1% 2902/com.android.systemui: 1.6% user + 0.4% kernel / faults: 1272 minor 32 major 06-16 16:16:28.59018532073 E ActivityManager:1.6% 3110/com.miui.whetstone: 1.6% user + 0% kernel / faults: 2614 minor 22 major 06-16 16:16:28.59018532073 E ActivityManager:0.8% 99/kswapd0: 0% user + 0.8% kernel 06-16 16:16:28.59018532073 E ActivityManager:1.4% 217/jbd2/mmcblk0p25: 0% user + 1.4% kernel 06-16 16:16:28.59018532073 E ActivityManager:1.4% 223/logd: 0.7% user + 0.7% kernel / faults: 4 minor 06-16 16:16:28.59018532073 E ActivityManager:0.9% 12808/kworker/0:1: 0% user + 0.9% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.8% 35/kworker/u:2: 0% user + 0.8% kernel 06-16 16:16:28.59018532073 E ActivityManager:0% 3222/com.miui.sysbase: 0% user + 0% kernel / faults: 1314 minor 12 major 06-16 16:16:28.59018532073 E ActivityManager:0.8% 3446/com.android.nfc: 0.4% user + 0.3% kernel / faults: 1223 minor 9 major 06-16 16:16:28.59018532073 E ActivityManager:0.7% 10866/kworker/u:1: 0% user + 0.7% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.6% 642/mdss_fb0: 0% user + 0.6% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.6% 29336/kworker/u:7: 0% user + 0.6% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.4% 6/kworker/u:0: 0% user + 0.4% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.4% 22924/kworker/u:6: 0% user + 0.4% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.3% 4421/mpdecision: 0% user + 0.3% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.2% 276/servicemanager: 0.1% user + 0.1% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.2% 289/rild: 0.2% user + 0% kernel / faults: 20 minor 06-16 16:16:28.59018532073 E ActivityManager:0.1% 4161/mcd: 0% user + 0% kernel / faults: 9 minor 1 major 06-16 16:16:28.59018532073 E ActivityManager:0.1% 3/ksoftirqd/0: 0% user + 0.1% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.1% 5/kworker/0:0H: 0% user + 0.1% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.1% 7/kworker/u:0H: 0% user + 0.1% kernel 06-16 16:16:28.59018532073 E ActivityManager:0% 215/flush-179:0: 0% user + 0% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.1% 321/displayfeature: 0.1% user + 0% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.1% 368/irq/33-cpubw_hw: 0% user + 0.1% kernel 06-16 16:16:28.59018532073 E ActivityManager:0.1% 403/qmuxd: 0% user + 0.1% kernel / faults: 60 minor 06-16 16:16:28.59018532073 E ActivityManager:0% 3491/com.xiaomi.finddevice: 0% user + 0% kernel / faults: 706 minor 06-16 16:16:28.59018532073 E ActivityManager:0.1% 29330/ksoftirqd/1: 0% user + 0.1% kernel 06-16 16:16:28.59018532073 E ActivityManager: 96% TOTAL: 56% user + 29% kernel + 6.3% iowait + 4.1% softirq
應用在發生ANR的時候,不論是Provider,input還是BroadcastReceiver和Service處理超時等,系統都會呼叫appNotResponding方法,進行trace的收集,主要記錄當時的案發現場,各個相關執行緒的狀態一目瞭然。下面從appNotResponding入手一起看看原始碼是怎麼寫的。
這個方法的引數有5個,如下:
- app 當前發生ANR的程序
- activity 發生ANR的介面
- parent 發生ANR的介面的上一級介面
- aboveSystem 這個引數系統中用的全部都是False,還沒有看到有true的地方
- annotation發生ANR的原因
776final void appNotResponding(ProcessRecord app, ActivityRecord activity, 777ActivityRecord parent, boolean aboveSystem, final String annotation) { // 填充firstPids和lastPids陣列,從最近執行程序中挑選 // firstPids: 用於儲存ANR程序及其父程序,system_server程序和persistent的程序 // lastPids: 用於儲存除firstPids外的其他程序 778ArrayList<Integer> firstPids = new ArrayList<Integer>(5); 779SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20); 780 781if (mService.mController != null) { 782try { 783// 0 == continue, -1 = kill process immediately 784int res = mService.mController.appEarlyNotResponding( 785app.processName, app.pid, annotation); 786if (res < 0 && app.pid != MY_PID) { 787app.kill("anr", true); 788} 789} catch (RemoteException e) { 790mService.mController = null; 791Watchdog.getInstance().setActivityController(null); 792} 793} 794 795long anrTime = SystemClock.uptimeMillis(); //更新Cpu的資訊 796if (ActivityManagerService.MONITOR_CPU_USAGE) { 797mService.updateCpuStatsNow(); 798} 799 800// Unless configured otherwise, swallow ANRs in background processes & kill the process. //showBackground意味在後臺的時候是否讓ANR有感知,如果showBackground為fasle就不會彈出ANR的dialog了 801boolean showBackground = Settings.Secure.getInt(mContext.getContentResolver(), 802Settings.Secure.ANR_SHOW_BACKGROUND, 0) != 0; 803 804boolean isSilentANR; 805 806synchronized (mService) { //這些場景的ANR直接做忽略處理(關機,app已經ANR了,APP處在crash流程中,被AMS或者其他人殺死的) 807// PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down. 808if (mService.mShuttingDown) { 809Slog.i(TAG, "During shutdown skipping ANR: " + app + " " + annotation); 810return; 811} else if (app.notResponding) { 812Slog.i(TAG, "Skipping duplicate ANR: " + app + " " + annotation); 813return; 814} else if (app.crashing) { 815Slog.i(TAG, "Crashing app skipping ANR: " + app + " " + annotation); 816return; 817} else if (app.killedByAm) { 818Slog.i(TAG, "App already killed by AM skipping ANR: " + app + " " + annotation); 819return; 820} else if (app.killed) { 821Slog.i(TAG, "Skipping died app ANR: " + app + " " + annotation); 822return; 823} 824 825// In case we come through here for the same app before completing 826// this one, mark as anring now so we will bail out. 827app.notResponding = true; 828 829// Log the ANR to the event log.常常搜尋的am_anr關鍵字,記錄ANR資訊到eventlog中 830EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid, 831app.processName, app.info.flags, annotation); 832 833// Dump thread traces as quickly as we can, starting with "interesting" processes. //ANR程序放到firstPids集合中 834firstPids.add(app.pid); 835 836// Don't dump other PIDs if it's a background ANR //如果是後臺的ANR,就不dump了 837isSilentANR = !showBackground && !isInterestingForBackgroundTraces(app); 838if (!isSilentANR) { 839int parentPid = app.pid; 840if (parent != null && parent.app != null && parent.app.pid > 0) { 841parentPid = parent.app.pid; 842} //把ANR的父程序也記錄firstPids中 843if (parentPid != app.pid) firstPids.add(parentPid); 844 //MY_PID是system_server的pid,如果發生的程序不是system,需要把system程序加到firstPids中 845if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID); 846 //最近執行的程序在mLruProcesses中,遍歷這個List 847for (int i = mService.mLruProcesses.size() - 1; i >= 0; i--) { 848ProcessRecord r = mService.mLruProcesses.get(i); 849if (r != null && r.thread != null) { 850int pid = r.pid; 851if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) { //如果persistent程序,這類是“不死程序”, //這類程序在啟動的時候會針對目標使用者程序的IApplicationThread介面,註冊一個binder訃告監聽器, //一旦日後使用者程序意外掛掉,AMS就能在第一時間感知到, //並採取相應的措施。如果AMS發現意外掛掉的應用是persistent的,它會嘗試重新啟動這個應用 852if (r.persistent) { 853firstPids.add(pid); 854if (DEBUG_ANR) Slog.i(TAG, "Adding persistent proc: " + r); 855} else if (r.treatLikeActivity) { 856firstPids.add(pid); 857if (DEBUG_ANR) Slog.i(TAG, "Adding likely IME: " + r); 858} else { //其他的都加到lastPids中 859lastPids.put(pid, Boolean.TRUE); 860if (DEBUG_ANR) Slog.i(TAG, "Adding ANR proc: " + r); 861} 862} 863} 864} 865} 866} 867 868// Log the ANR to the main log.輸出到main log裡面 869StringBuilder info = new StringBuilder(); 870info.setLength(0); 871info.append("ANR in ").append(app.processName); 872if (activity != null && activity.shortComponentName != null) { 873info.append(" (").append(activity.shortComponentName).append(")"); 874} 875info.append("\n"); 876info.append("PID: ").append(app.pid).append("\n"); 877if (annotation != null) { 878info.append("Reason: ").append(annotation).append("\n"); 879} 880if (parent != null && parent != activity) { 881info.append("Parent: ").append(parent.shortComponentName).append("\n"); 882} 883 884ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true); 885 886// don't dump native PIDs for background ANRs unless it is the process of interest 887String[] nativeProcs = null; 888if (isSilentANR) { //NATIVE_STACKS_OF_INTEREST是Watchog中定義的幾個關鍵的程序 889for (int i = 0; i < NATIVE_STACKS_OF_INTEREST.length; i++) { 890if (NATIVE_STACKS_OF_INTEREST[i].equals(app.processName)) { 891nativeProcs = new String[] { app.processName }; 892break; 893} 894} 895} else { 896nativeProcs = NATIVE_STACKS_OF_INTEREST; 897} 898 //獲取nativeProcs這些程序的pid 899int[] pids = nativeProcs == null ? null : Process.getPidsForCommands(nativeProcs); 900ArrayList<Integer> nativePids = null; 901 902if (pids != null) { 903nativePids = new ArrayList<Integer>(pids.length); 904for (int i : pids) { 905nativePids.add(i); 906} 907} 908 909// For background ANRs, don't pass the ProcessCpuTracker to 910// avoid spending 1/2 second collecting stats to rank lastPids. // 911File tracesFile = mService.dumpStackTraces(true, firstPids, 912(isSilentANR) ? null : processCpuTracker, 913(isSilentANR) ? null : lastPids, 914nativePids); 915 916String cpuInfo = null; 917if (ActivityManagerService.MONITOR_CPU_USAGE) { //更新Cpu的資訊 918mService.updateCpuStatsNow(); 919synchronized (mService.mProcessCpuTracker) { // 記錄ANR之前的cpu使用情況(CPU usage from xxxms to 0ms ago) 920cpuInfo = mService.mProcessCpuTracker.printCurrentState(anrTime); 921} // 記錄從anr時間開始的Cpu負載 922info.append(processCpuTracker.printCurrentLoad()); 923info.append(cpuInfo); 924} 925// 記錄從anr時間開始的cpu使用情況(CPU usage from xxms to xxms later) 926info.append(processCpuTracker.printCurrentState(anrTime)); 927 928Slog.e(TAG, info.toString()); 929if (tracesFile == null) { 930// 如果trace為空,則傳送singal 3到傳送ANR的程序,相當於執行adb shell kill -3 pid命令 931Process.sendSignal(app.pid, Process.SIGNAL_QUIT); 932} 933 //寫到DropBox資料夾裡面(data/system/dropbox) 934mService.addErrorToDropBox("anr", app, app.processName, activity, parent, annotation, 935cpuInfo, tracesFile, null); 936 937if (mService.mController != null) { 938try { 939// 0 == show dialog, 1 = keep waiting, -1 = kill process immediately 940int res = mService.mController.appNotResponding( 941app.processName, app.pid, info.toString()); 942if (res != 0) { 943if (res < 0 && app.pid != MY_PID) { 944app.kill("anr", true); 945} else { 946synchronized (mService) { 947mService.mServices.scheduleServiceTimeoutLocked(app); 948} 949} 950return; 951} 952} catch (RemoteException e) { 953mService.mController = null; 954Watchdog.getInstance().setActivityController(null); 955} 956} 957 958synchronized (mService) { 959mService.mBatteryStatsService.noteProcessAnr(app.processName, app.uid); 960 //如果是後臺ANR,直接kill 961if (isSilentANR) { 962app.kill("bg anr", true); 963return; 964} 965 966// Set the app's notResponding state, and look up the errorReportReceiver 967 吧 (app, 968activity != null ? activity.shortComponentName : null, 969annotation != null ? "ANR " + annotation : "ANR", 970info.toString()); 971 972// 彈出無響應對話方塊 973Message msg = Message.obtain(); 974HashMap<String, Object> map = new HashMap<String, Object>(); 975msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG; 976msg.obj = map; 977msg.arg1 = aboveSystem ? 1 : 0; 978map.put("app", app); 979if (activity != null) { 980map.put("activity", activity); 981} 982 給android.ui執行緒發生一個訊息,彈出對話方塊 983mService.mUiHandler.sendMessage(msg); 984} 985} 986
上面的程式碼擼了一遍,總結一下appNotResponding的業務邏輯:
- 1、輸出ANR Reason資訊到Events log中,關鍵字是am_anr,一般都是用這個資訊來確定Anr的發生時間
- 2、如果是後臺的anr,即isSilentANR值為true,那麼不顯示ANR對話方塊,直接kill掉ANR所在程序
- 3、輸出ANR 資訊到main log中,關鍵字是ANR in,含有Cpu負載資訊,一般都是用這個看ANR發生時候的其他程序的資源佔用情況
-
4、呼叫dumpStackTraces去生成trace檔案,這個trace檔案有哪些程序呢?
a.firstPids佇列:第一個是ANR程序,第二個是system_server,剩餘是所有persistent程序;
b.Native佇列:是指/system/bin/目錄的mediaserver,sdcard 以及surfaceflinger程序;
c.lastPids佇列: 是指mLruProcesses中的不屬於firstPids的所有程序。 - 5、將traces檔案和 CPU使用情況資訊儲存到dropbox,即data/system/dropbox目錄
- 6、彈窗告知使用者
除了上面的業務總結,我們發現有幾個小知識點
- 1、關機過程雖然比較耗時,但是不會出現ANR,不必多說,你說人家都關機了,還有提示發生ANR的必要了嗎?
- 2、ANR有兩種,前臺ANR和後臺ANR,我在除錯中發現,後臺ANR非常容易發生,系統直接kill掉這個應用程序,但是使用者沒有感知。
- 3、am_anr關鍵字比ANR in要先列印,所以更接近ANR的實際發生時間
2、dumpStackTraces(1/2)方法解讀
繼續看appNotResponding中的核心邏輯dumpStackTraces
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5492 5483/** 5484* If a stack trace dump file is configured, dump process stack traces. 5485* @param clearTraces causes the dump file to be erased prior to the new 5486*traces being written, if true; when false, the new traces will be 5487*appended to any existing file content. 5488* @param firstPids of dalvik VM processes to dump stack traces for first 5489* @param lastPids of dalvik VM processes to dump stack traces for last 5490* @param nativePids optional list of native pids to dump stack crawls 5491*/ 5492public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids, 5493ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, 5494ArrayList<Integer> nativePids) { 5495ArrayList<Integer> extraPids = null; 5496...... 5524boolean useTombstonedForJavaTraces = false; 5525File tracesFile; 5526//如果dalvik.vm.stack-trace-dir沒有配置,trace就寫到全域性檔案data/anr/traces.txt中 5527final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", ""); 5528if (tracesDirProp.isEmpty()) { 5529// When dalvik.vm.stack-trace-dir is not set, we are using the "old" trace 5530// dumping scheme. All traces are written to a global trace file (usually 5531// "/data/anr/traces.txt") so the code below must take care to unlink and recreate 5532// the file if requested. 5533// 5534// This mode of operation will be removed in the near future. 5535 5536 5537String globalTracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null); 5538if (globalTracesPath.isEmpty()) { // 沒有配置dalvik.vm.stack-trace-file,則返回null 5539Slog.w(TAG, "dumpStackTraces: no trace path configured"); 5540return null; 5541} 5542 5543tracesFile = new File(globalTracesPath); 5544try { //刪除舊的,建立新的 5545if (clearTraces && tracesFile.exists()) { 5546tracesFile.delete(); 5547} 5548 5549tracesFile.createNewFile(); 5550FileUtils.setPermissions(globalTracesPath, 0666, -1, -1); // -rw-rw-rw- 5551} catch (IOException e) { 5552Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesFile, e); 5553return null; 5554} 5555} else { 5556File tracesDir = new File(tracesDirProp); 5557// When dalvik.vm.stack-trace-dir is set, we use the "new" trace dumping scheme. 5558// Each set of ANR traces is written to a separate file and dumpstate will process 5559// all such files and add them to a captured bug report if they're recent enough. 5560maybePruneOldTraces(tracesDir); 5561 5562// NOTE: We should consider creating the file in native code atomically once we've 5563// gotten rid of the old scheme of dumping and lot of the code that deals with paths 5564// can be removed. // 建立trace檔案,格式為anr_yyyy-MM-dd-HH-mm-ss-SSS 5565tracesFile = createAnrDumpFile(tracesDir); 5566if (tracesFile == null) { 5567return null; 5568} 5569 5570useTombstonedForJavaTraces = true; 5571} 5572 5573dumpStackTraces(tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids, 5574useTombstonedForJavaTraces); 5575return tracesFile; 5576}
這段程式碼並沒有真正的執行dump操作,只是確定了trace檔案路徑,如果”dalvik.vm.stack-trace-dir”這個屬性沒有配置,就使用舊的dump策略,trace資訊寫入到全域性trace檔案/data/anr/traces.txt中,刪除舊的traces檔案,建立新的traces檔案;如果”dalvik.vm.stack-trace-dir”屬性被配置了,就建立格式為anr_yyyy-MM-dd-HH-mm-ss-SSS的檔案。
dalvik.vm.stack-trace-file特性
dalvik.vm.stack-trace-file特性允許你指定要將執行緒堆疊追蹤寫入的檔名,如果不存在,將建立,新的資訊將追加到檔案尾,檔名通過-Xstacktracefile引數寫入虛擬機器。例如:
adb shell setprop dalvik.vm.stack-trace-file /tmp/stack-traces.txt
如果這個特性沒有被定義,虛擬機器會在收到這個訊號時將堆疊追蹤資訊寫入android log。
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5581 5581private static synchronized File createAnrDumpFile(File tracesDir) { 5582if (sAnrFileDateFormat == null) { 5583sAnrFileDateFormat = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss-SSS"); 5584} 5585 5586final String formattedDate = sAnrFileDateFormat.format(new Date()); 5587final File anrFile = new File(tracesDir, "anr_" + formattedDate); 5588 5589try { 5590if (anrFile.createNewFile()) { 5591FileUtils.setPermissions(anrFile.getAbsolutePath(), 0600, -1, -1); // -rw------- 5592return anrFile; 5593} else { 5594Slog.w(TAG, "Unable to create ANR dump file: createNewFile failed"); 5595} 5596} catch (IOException ioe) { 5597Slog.w(TAG, "Exception creating ANR dump file:", ioe); 5598} 5599 5600return null; 5601} 5602
3、dumpStackTraces(2/2)方法解讀
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5704 5704private static void dumpStackTraces(String tracesFile, ArrayList<Integer> firstPids, 5705ArrayList<Integer> nativePids, ArrayList<Integer> extraPids, 5706boolean useTombstonedForJavaTraces) { 5707 5708// We don't need any sort of inotify based monitoring when we're dumping traces via 5709// tombstoned. Data is piped to an "intercept" FD installed in tombstoned so we're in full 5710// control of all writes to the file in question. 5711final DumpStackFileObserver observer; 5712if (useTombstonedForJavaTraces) { 5713observer = null; 5714} else { 5715// Use a FileObserver to detect when traces finish writing. 5716// The order of traces is considered important to maintain for legibility. 5717observer = new DumpStackFileObserver(tracesFile); 5718} 5719 5720// 所有的dump操作需要在20秒之內完成 5721long remainingTime = 20 * 1000; 5722try { 5723if (observer != null) { 5724observer.startWatching(); 5725} 5726 5727// 第一步收集firstPids中程序的trace,超時會自動放棄 5728if (firstPids != null) { 5729int num = firstPids.size(); 5730for (int i = 0; i < num; i++) { 5731if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for pid " 5732+ firstPids.get(i)); 5733final long timeTaken; 5734if (useTombstonedForJavaTraces) { 5735timeTaken = dumpJavaTracesTombstoned(firstPids.get(i), tracesFile, remainingTime); 5736} else { 5737timeTaken = observer.dumpWithTimeout(firstPids.get(i), remainingTime); 5738} 5739 5740remainingTime -= timeTaken; 5741if (remainingTime <= 0) { 5742Slog.e(TAG, "Aborting stack trace dump (current firstPid=" + firstPids.get(i) + 5743"); deadline exceeded."); 5744return; 5745} 5746 5747if (DEBUG_ANR) { 5748Slog.d(TAG, "Done with pid " + firstPids.get(i) + " in " + timeTaken + "ms"); 5749} 5750} 5751} 5752 5753// 第二步收集nativePids中程序的trace,超時會自動放棄 5754if (nativePids != null) { 5755for (int pid : nativePids) { 5756if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for native pid " + pid); 5757final long nativeDumpTimeoutMs = Math.min(NATIVE_DUMP_TIMEOUT_MS, remainingTime); 5758 5759final long start = SystemClock.elapsedRealtime(); 5760Debug.dumpNativeBacktraceToFileTimeout( 5761pid, tracesFile, (int) (nativeDumpTimeoutMs / 1000)); 5762final long timeTaken = SystemClock.elapsedRealtime() - start; 5763 5764remainingTime -= timeTaken; 5765if (remainingTime <= 0) { 5766Slog.e(TAG, "Aborting stack trace dump (current native pid=" + pid + 5767"); deadline exceeded."); 5768return; 5769} 5770 5771if (DEBUG_ANR) { 5772Slog.d(TAG, "Done with native pid " + pid + " in " + timeTaken + "ms"); 5773} 5774} 5775} 5776 5777//第三步收集extraPids中程序的trace,超時會自動放棄,這個程式碼和第一步一樣的 5778if (extraPids != null) { 5779for (int pid : extraPids) { 5780if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for extra pid " + pid); 5781 5782final long timeTaken; 5783if (useTombstonedForJavaTraces) { 5784timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime); 5785} else { 5786timeTaken = observer.dumpWithTimeout(pid, remainingTime); 5787} 5788 5789remainingTime -= timeTaken; 5790if (remainingTime <= 0) { 5791Slog.e(TAG, "Aborting stack trace dump (current extra pid=" + pid + 5792"); deadline exceeded."); 5793return; 5794} 5795 5796if (DEBUG_ANR) { 5797Slog.d(TAG, "Done with extra pid " + pid + " in " + timeTaken + "ms"); 5798} 5799} 5800} 5801} finally { 5802if (observer != null) { 5803observer.stopWatching(); 5804} 5805} 5806}
firstPids和extraPids中的程序dump trace有兩種方式,看useTombstonedForJavaTraces的取值;
- 如果useTombstonedForJavaTraces為真,採用dumpJavaTracesTombstoned獲取trace
- 如果useTombstonedForJavaTraces為假,採用DumpStackFileObserver#dumpWithTimeout獲取trace
對於nativePids中的程序,只能通過 Debug.dumpNativeBacktraceToFileTimeout來獲取trace。
除了上面的總結還需要思考幾個小問題
-
問題1: 利用am_anr來確定ANR的發生時間真的準確嗎?
在上一篇文章就已經說了,不準確,可能由於系統資源比較緊張造成一定程度的滯後,十多秒都可能,此時需要結合eventlog進一步分析。
問題2:ANR中經常出現沒有trace,或者無效trace是怎麼回事?
- 從appNotResponding方法中,我們看到有不少return的地方,在dumpStackTraces方法,也會有return出現,比如這麼多程序的trace需要在20秒之內完成,在系統資源緊張的時候,能不能完成,需要打一個問號了
- 有一些ANR問題需要binder對端的trace,通過上面的分析看,大概率是沒有對端的trace的,除非binder call到system_server或者比較重要的幾個native程序裡面
- 另外發生ANR的時候,執行緒並沒有暫停掉,有可能等到我們去抓這個程序的trace的時候,這個程序的主執行緒都恢復正常了,比如經常性出現下面的nativePollOnce狀態的堆疊。
"main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 obj=0x7685ba50 self=0x7fb0840a00 | sysTid=1460 nice=-10 cgrp=default sched=0/0 handle=0x7fb4824aa0 | state=S schedstat=( 213445473008 99617457476 724519 ) utm=14518 stm=6826 core=0 HZ=100 | stack=0x7fe82c6000-0x7fe82c8000 stackSize=8MB | held mutexes= kernel: __switch_to+0x70/0x7c kernel: SyS_epoll_wait+0x2d4/0x394 kernel: SyS_epoll_pwait+0xc4/0x150 kernel: el0_svc_naked+0x24/0x28 native: #00 pc 000000000001bf2c/system/lib64/libc.so (syscall+28) native: #01 pc 00000000000e813c/system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160) native: #02 pc 000000000054b6f8/system/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE+328) native: #03 pc 000000000054b56c/system/lib64/libart.so (_ZN3art12JniMethodEndEjPNS_6ThreadE+28) native: #04 pc 00000000008ccc50/system/framework/arm64/boot-framework.oat (Java_android_os_MessageQueue_nativePollOnce__JI+156) at android.os.MessageQueue.nativePollOnce(Native method) at android.os.MessageQueue.next(MessageQueue.java:323) at android.os.Looper.loop(Looper.java:142) at com.android.server.SystemServer.run(SystemServer.java:387) at com.android.server.SystemServer.main(SystemServer.java:245) at java.lang.reflect.Method.invoke!(Native method) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:904) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:794)
-
問題3: ANR中的dumpStackTraces是耗時操作,會不會引起系統的ANR?
我覺得是有可能的,而是在系統穩定性很差的情況下,大量應用都ANR了,都需要dumpStackTraces,那麼都有引起系統watchdog的可能。AOSP在P上有多個changelist對此進行優化,有興趣可以自己檢視。
3.1、DumpStackFileObserver方式獲取trace
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5631 5625/** 5626* Legacy code, do not use. Existing users will be deleted. 5627* 5628* @deprecated 5629*/ 5630@Deprecated 5631public static class DumpStackFileObserver extends FileObserver { 5632// Keep in sync with frameworks/native/cmds/dumpstate/utils.cpp 5633private static final int TRACE_DUMP_TIMEOUT_MS = 10000; // 10 seconds 5634 5635private final String mTracesPath; 5636private boolean mClosed; 5637 5638public DumpStackFileObserver(String tracesPath) { 5639super(tracesPath, FileObserver.CLOSE_WRITE); 5640mTracesPath = tracesPath; 5641} 5642 5643@Override 5644public synchronized void onEvent(int event, String path) { 5645mClosed = true; 5646notify(); 5647} 5648 5649public long dumpWithTimeout(int pid, long timeout) { //發生singal開輸出trace,相當於adb shell kill -3 5650sendSignal(pid, SIGNAL_QUIT); 5651final long start = SystemClock.elapsedRealtime(); 5652// timeout為20s,TRACE_DUMP_TIMEOUT_MS為10s,取小則為10s 5653final long waitTime = Math.min(timeout, TRACE_DUMP_TIMEOUT_MS); 5654synchronized (this) { 5655try { 5656wait(waitTime); // Wait for traces file to be closed. 5657} catch (InterruptedException e) { 5658Slog.wtf(TAG, e); 5659} 5660} 5661 5662// This avoids a corner case of passing a negative time to the native 5663// trace in case we've already hit the overall timeout. 5664final long timeWaited = SystemClock.elapsedRealtime() - start; 5665if (timeWaited >= timeout) { 5666return timeWaited; 5667} 5668 ....... 5683return (end - start); 5684} 5685}
這個方法將會被廢棄了,原理上就是用adb shell kill -3抓trace
3.2、dumpJavaTracesTombstoned方式獲取trace
5687/** 5688* Dump java traces for process {@code pid} to the specified file. If java trace dumping 5689* fails, a native backtrace is attempted. Note that the timeout {@code timeoutMs} only applies 5690* to the java section of the trace, a further {@code NATIVE_DUMP_TIMEOUT_MS} might be spent 5691* attempting to obtain native traces in the case of a failure. Returns the total time spent 5692* capturing traces. 5693*/ 5694private static long dumpJavaTracesTombstoned(int pid, String fileName, long timeoutMs) { 5695final long timeStart = SystemClock.elapsedRealtime(); 5696if (!Debug.dumpJavaBacktraceToFileTimeout(pid, fileName, (int) (timeoutMs / 1000))) { 5697Debug.dumpNativeBacktraceToFileTimeout(pid, fileName, 5698(NATIVE_DUMP_TIMEOUT_MS / 1000)); 5699} 5700 5701return SystemClock.elapsedRealtime() - timeStart; 5702}
實質上還是由dumpJavaBacktraceToFileTimeout來獲取trace。dumpNativeBacktraceToFileTimeout是一個native方法,下面來看一看
1131static jboolean android_os_Debug_dumpNativeBacktraceToFileTimeout(JNIEnv* env, jobject clazz, 1132jint pid, jstring fileName, jint timeoutSecs) { 1133const bool ret = dumpTraces(env, pid, fileName, timeoutSecs, kDebuggerdNativeBacktrace); 1134return ret ? JNI_TRUE : JNI_FALSE; 1135}
1107static bool dumpTraces(JNIEnv* env, jint pid, jstring fileName, jint timeoutSecs, 1108DebuggerdDumpType dumpType) { 1109const ScopedUtfChars fileNameChars(env, fileName); 1110if (fileNameChars.c_str() == nullptr) { 1111return false; 1112} 1113 //開啟 /data/anr/anr_yyyy-MM-dd-HH-mm-ss-SSS檔案 1114android::base::unique_fd fd(open(fileNameChars.c_str(), 1115O_CREAT | O_WRONLY | O_NOFOLLOW | O_CLOEXEC | O_APPEND, 11160666)); 1117if (fd < 0) { 1118fprintf(stderr, "Can't open %s: %s\n", fileNameChars.c_str(), strerror(errno)); 1119return false; 1120} 1121 1122return (dump_backtrace_to_file_timeout(pid, dumpType, timeoutSecs, fd) == 0); 1123}
int dump_backtrace_to_file_timeout(pid_t tid, DebuggerdDumpType dump_type, int timeout_secs,int fd) { android::base::unique_fd copy(dup(fd)); if (copy == -1) { return -1; } int timeout_ms = timeout_secs > 0 ? timeout_secs * 1000 : 0; return debuggerd_trigger_dump(tid, dump_type, timeout_ms, std::move(copy)) ? 0 : -1; }
可以看到實質是呼叫debuggerd_trigger_dump,可以看debuggerd_client.cpp,可向debuggerd傳送命令,利用debuggerd抓trace。
總結:在dumpStackTraces方法中,主要根據useTombstonedForJavaTraces這個屬性來設定不同的抓取trace的方式,java的程序實質上是使用adb shell kill -3來抓trace,對於native的程序實質上是使用debuggerd(debuggerd -b [pid])抓trace。而useTombstonedForJavaTraces的值是false還是true取決與"dalvik.vm.stack-trace-dir"這個屬性在開機的時候是否設定了。