1. 程式人生 > >一個AMS、PMS、WMS競爭鎖引起死鎖無法開啟問題的分析過程

一個AMS、PMS、WMS競爭鎖引起死鎖無法開啟問題的分析過程

問題:在工廠段出現,一直提示“android 正在啟動”,長按開機鍵恢復。由於產線生產機器都是按K計算,所以概率問題會放大,此問題大約1000臺機器會出現10臺左右的卡在android正在啟動,由於到了量產階段,問題緊急,無奈我還被緊急派去生產車間解決問題,最快的航班經濟艙已滿,生平第一次坐了一次頭等艙(- -!)
首先就是要復現問題,抓到問題log,編譯ENG版本,開啟相關log,開始復現:

PackageManagerService中的:
static boolean DEBUG_SETTINGS = true;
static boolean DEBUG_PREFERRED = true;
static boolean DEBUG_UPGRADE = true;
private static boolean DEBUG_INSTALL = true;
private static boolean DEBUG_REMOVE = true;
private static boolean DEBUG_BROADCASTS = true;
private static boolean DEBUG_SHOW_INFO = true;
private static boolean DEBUG_PACKAGE_INFO = true;
private static boolean DEBUG_INTENT_MATCHING = true;
private static boolean DEBUG_PACKAGE_SCANNING = true;
private static boolean DEBUG_VERIFY = true;
private static boolean DEBUG_DEXOPT = true;
private static boolean DEBUG_ABI_SELECTION = true;
private static boolean DEBUG_PERMISSION = true;
ActivityManagerService中:
static boolean DEBUG_SERVICE = true;
static boolean DEBUG_SERVICE_EXECUTING = true;

由於問題的表現是“Android正在啟動”異常,在log中能夠看到開機時某些apk又去提取odex,列舉如下:
1、//其中開機重新提取odex僅為arm64的apk有(無lib庫):

05-06 16:43:22.870 1076 1076 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/app/E163_WindPowerSaver/E163_WindPowerSaver.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm64/[email protected]@[email protected]
[email protected] --instruction-set=arm64 --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13 05-06 16:43:24.415 1081 1081 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/plugin/UsbChecker/UsbChecker.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm64/
[email protected]
@[email protected]@classes.dex --instruction-set=arm64 --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13 05-06 16:43:25.875 1098 1098 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/app/AtciService/AtciService.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm64/[email protected]@[email protected]@classes.dex --instruction-set=arm64 --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13 05-06 16:43:29.491 1154 1154 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/priv-app/E163_ZteDeviceRegister/E163_ZteDeviceRegister.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm64/[email protected]@[email protected][email protected] --instruction-set=arm64 --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13 05-06 16:44:55.465 1249 1249 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/vendor/operator/app/E163_PoemChina/E163_PoemChina.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm64/[email protected]@[email protected]@[email protected][email protected] --instruction-set=arm64 --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13

Solution有兩種:
1、刪除LOCAL_MULTILIB :=32或LOCAL_32_BIT_ONLY=true
2、對於這個apk的解決辦法為預編譯時跳過提起其的odex檔案:
\build\core\dex_preopt_odex_install.mk中新增:
ifeq ($(LOCAL_MODULE),MODULE_NAME)
LOCAL_DEX_PREOPT:=
endif

built_odex:=
installed_odex:=

MODULE_NAME寫為需要替換的apk的module name(注意新增位置、避免編譯報錯)

2、//其中開機重新提取odex僅為arm的apk(僅支援32位):

05-06 16:43:30.057 1163 1163 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/priv-app/E163_share/E163_share.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm/[email protected]@[email protected][email protected] --instruction-set=arm --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13
05-06 16:43:52.769 1232 1232 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/vendor/operator/app/E163_CBox_mm_ZTE/E163_CBox_mm_ZTE.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm/[email protected]@[email protected]@[email protected][email protected] --instruction-set=arm --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13
05-06 16:44:09.802 1235 1235 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/vendor/operator/app/E163_mCloud/E163_mCloud.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm/[email protected]@[email protected]@[email protected][email protected] --instruction-set=arm --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13
05-06 16:44:44.288 1239 1239 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/vendor/operator/app/E163_MMPlug/E163_MMPlug.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm/[email protected]@[email protected]@[email protected][email protected] --instruction-set=arm --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13

Solution:在Android.mk中設定LOCAL_MULTILIB :=32,則預編譯時會提取該apk的32bit的odex

3、開機重新提取odex既有arm又有arm64的apk:

05-06 16:43:41.732 1200 1200 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/app/webview/webview.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm/[email protected]@[email protected]@classes.dex --instruction-set=arm --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13
05-06 16:43:44.385 1203 1203 I dex2oat : /system/bin/dex2oat --zip-fd=11 --zip-location=/system/app/webview/webview.apk --oat-fd=12 --oat-location=/data/dalvik-cache/arm64/[email protected]@[email protected]@classes.dex --instruction-set=arm64 --instruction-set-features=default --runtime-arg -Xms64m --runtime-arg -Xmx512m --swap-fd=13

Solution:在Android.mk中設定LOCAL_MULTILIB :=both,則預編譯時會提取該apk的32bit和64bit的odex

問題分析到這的時候,我們發現一個現象,成功被帶偏了,就是如果強制斷電(拔電池),就會比較容易復現問題,實際上對於低概率的問題,還是儘量不要只往一個方向思考,期間研究了許多google對於斷電系統的一些保護推測是開機時異常斷電,有機率造成file corruption (如果是某app的oat file corruption,會造成app無法啟動)
瞭解到Google設計了一套方法,只要上一次是異常斷電,下一次開機就”全部重建”ART optimization files(*.oat),因此
1. 會顯示Application is starting…
2. 全部重建oat file,所以開機時間會延長
Emmmm,看起來好像找到了問題的方向。當然,後面這個推測被否了,因為發現不異常斷電也能復現出問題。再仔細分析log,發現如下可疑點:
檢查log發現systemserver都沒有初始化好,下面log沒走:

SystemServer: PerfMgr state notifier
SystemServer: HDMI Manager Service
SystemServer: Making services ready

//發現這裡This kernel does not have HDMI audio support,
05-14 09:00:11.812 805 805 D SettingsProvider: lookupValue table system cache.fullyMatchesDisk() volume_music_hdmi
05-14 09:00:11.812 805 805 V SettingsProvider: call(system:volume_music_hdmi) for 0
05-14 09:00:11.921 805 805 W WiredAccessoryManager: This kernel does not have HDMI audio support

//即需要確認客戶HDMI feature是否disable掉了
1196 if (!disableNonCoreServices && SystemProperties.get("ro.mtk_hdmi_support").equals("1")) {
1197 try {
1198 Slog.i(TAG, "HDMI Manager Service");
1199 hdmiManager = new MtkHdmiManagerService(context);
1200 ServiceManager.addService(Context.HDMI_SERVICE,
1201 hdmiManager.asBinder());
1202 } catch (Throwable e) {
1203 Slog.e(TAG, "Failure starting MtkHdmiManager", e);
1204 }
1205 }

然後這個和開不了機沒有特別的關係。
從log看:

05-14 09:00:09.691 484 484 I sn : Retry 2
05-14 09:00:12.167 805 805 I SystemServer: PerfMgr state notifier
05-14 09:04:39.720 484 484 I sn : Retry 29
05-14 09:16:19.786 484 484 I sn : Retry 99

這之後就沒有sn的log了。
但是一直到下邊的時間點,還沒看到launcher起來:

05-14 10:15:33.784

開機log中,system server跑到9點多一點就沒log了,只有surface這邊的log在跑。判斷是卡到SystemServer初始化某個地方從而導致後面一直在跑“Android 正在啟動”而得不到釋放,因為跑到perfService這裡還沒有走到systemserver後面的watchdog,也不會走到重啟。
//目前可以看到PerfMgr state notifier後面,HDMI Manager Service之前,確定與disable HDMI無關,
檢視code:

1179 perfServiceMgr = new PerfServiceManager(context);
//1179這行函式裡面的log是有打出來的
     //05-14 09:00:12.171 805 805 D PerfServiceManager: [PerfService] Created and started PerfService thread 
     
1181 IPerfService perfService = null;
1182 perfService = new PerfServiceImpl(context, perfServiceMgr);
1183
1184 Slog.d("perfservice", "perfService=" + perfService);
//1184這行log沒有打出來,那麼只有卡到PerfServiceImpl這個函式裡面,而這個函式只是在處理廣播 
     //mContext.registerReceiver(new PerfServiceBroadcastReceiver(), broadcastFilter);
     //那麼有可能是ActivityManagerService.java中的registerReceiver()函式中沒有拿到這個鎖: synchronized(this)

所以接下來是需要復現去抓SystemServer的backtrace來分析了,按照以下步驟打systemServer的backtrace:
1.先清除data/anr下檔案
2.使用adb shell ps檢視system_server的pid
>>adb shell
>>ps | grep system_server //過濾查詢 system_server資訊
3.使用adb shell kill -16 $(上面獲得的pid)
>>kill -16 pid
4.重複2-3此上面步驟,多獲取幾次backtrace
5.以上backtrace會預設儲存在data/anr下
找到問題原因和抓到trace,問題漸漸明朗:
1.總結:
通過抓取的call back分析肯定是上層system_server死鎖導致。
2.原因:
具體原因就是AMS中的鎖和POMS中的鎖相互被佔用導致,但是正常應該不會卡在這裡。
3.懷疑點:
根據之前解這類開機卡住的經驗,比較大的可能是哪裡code修改或底層sensor配置不對導致,需要同步檢查相關修改和配置。
4.發現一個必現路徑:

“將手機時間修改為第一次出現正在啟動訊息框之後的3天,開機必現正在啟動,更容易復現卡住問題”

原因:修改時間,必走這裡所以會走正在啟動:

4884 if (timeSinceLast > interval) { //interval定義的變數是3天:DEFAULT_MANDATORY_FSTRIM_INTERVAL = 3 * DateUtils.DAY_IN_MILLIS;
4885 doTrim = true; //時間改成超過3天,這裡doTrim就置為true,就會走到下面 if (doTrim) showBootMessage() 從而必現“Android 正在啟動“
4886 Slog.w(TAG, "No disk maintenance in " + timeSinceLast
4887 + "; running immediately");
4888 }

由於AMS、PMS和WMS在鎖上存在競爭,導致死鎖發生。想到的解決方法是需要解開迴圈等待的問題,需要將PowerManagerService中setScreenBrightnessOverrideFromWindowManagerInternal方法裡面的synchronized(mLock)修改成非阻塞型的,這樣需要將PowerManagerService中的mLock替換成顯式鎖,如java.util.Lock,然後用mLock.tryLock()和unLock()來對程式碼進行加鎖和解鎖。使用synchronized(obj)是內建鎖,這樣無法精細的控制鎖。顯式鎖就是使用java.util.Lock或者其他的鎖物件,然後採用非阻塞式鎖.

Lock lock=new ReentrantLock(); 
if(lock.tryLock()) {
//獲得鎖,執行具體任務 
lock.unlock(); 
}
//其他地方可以使用阻塞鎖:
lock.lock(); 
try{ 
//具體任務 
}finally{ 
lock.unlock(); 
}

總結:
Main thread拿PowerManagerService的mLock這把鎖去搶AMS的鎖,而:
android.display執行緒拿AMS的鎖,去搶PowerManagerService的鎖,兩個執行緒發生死鎖。懷疑是google的程式碼原生就存在這種可能性。
關鍵callStack如下

DALVIK THREADS (60):
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 obj=0x740befb8 self=0x7fa0caf800
| sysTid=798 nice=-2 cgrp=default sched=0/0 handle=0x7fa4e05eb0
| state=S schedstat=( 8332874739 193417372 11682 ) utm=589 stm=244 core=0 HZ=100
| stack=0x7fef078000-0x7fef07a000 stackSize=8MB
| held mutexes=
at com.android.server.am.ActivityManagerService.registerReceiver(ActivityManagerService.java:16635)
- waiting to lock <0x39099865> (a com.android.server.am.ActivityManagerService) held by thread 22 //競爭AMS的鎖
at android.app.ContextImpl.registerReceiverInternal(ContextImpl.java:1776)
at android.app.ContextImpl.registerReceiver(ContextImpl.java:1744)
at android.app.ContextImpl.registerReceiver(ContextImpl.java:1738)
at com.android.server.display.DisplayPowerController.(DisplayPowerController.java:416)
at com.android.server.display.DisplayManagerService$LocalService.initPowerManagement(DisplayManagerService.java:1888)
- locked <0x38a2d93a> (a com.android.server.display.DisplayManagerService$SyncRoot)
at com.android.server.power.PowerManagerService.systemReady(PowerManagerService.java:605)
- locked <0x386259eb> (a java.lang.Object) //PowerMs的鎖
at com.android.server.SystemServer.startOtherServices(SystemServer.java:1281)
at com.android.server.SystemServer.run(SystemServer.java:333)
at com.android.server.SystemServer.main(SystemServer.java:234)
at java.lang.reflect.Method.invoke!(Native method)
at java.lang.reflect.Method.invoke(Method.java:372)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:959)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:754)

另一個執行緒:
"android.display" prio=5 tid=22 Blocked
| group="main" sCount=1 dsCount=0 obj=0x12e91f20 self=0x7fa0cc2800
| sysTid=825 nice=-4 cgrp=default sched=0/0 handle=0x7f8d888000
| state=S schedstat=( 87739463 31037617 214 ) utm=7 stm=1 core=1 HZ=100
| stack=0x7f8d786000-0x7f8d788000 stackSize=1036KB
| held mutexes=
at com.android.server.power.PowerManagerService.setScreenBrightnessOverrideFromWindowManagerInternal(PowerManagerService.java:2873)
- waiting to lock <0x386259eb> (a java.lang.Object) held by thread 1 //競爭PowerMs的鎖
at com.android.server.power.PowerManagerService.access$7300(PowerManagerService.java:100)
at com.android.server.power.PowerManagerService$LocalService.setScreenBrightnessOverrideFromWindowManager(PowerManagerService.java:3987)
at com.android.server.wm.WindowManagerService.performLayoutAndPlaceSurfacesLockedInner(WindowManagerService.java:10961)
at com.android.server.wm.WindowManagerService.performLayoutAndPlaceSurfacesLockedLoop(WindowManagerService.java:9573)
at com.android.server.wm.WindowManagerService.performLayoutAndPlaceSurfacesLocked(WindowManagerService.java:9515)
at com.android.server.wm.WindowManagerService.setNewConfiguration(WindowManagerService.java:4441)
- locked <0x22675592> (a java.util.HashMap)
at com.android.server.am.ActivityManagerService.updateConfigurationLocked(ActivityManagerService.java:17941)
at com.android.server.am.ActivityManagerService.updateConfiguration(ActivityManagerService.java:17801)
- locked <0x39099865> (a com.android.server.am.ActivityManagerService) //持有AMS的鎖
at com.android.server.wm.WindowManagerService.sendNewConfiguration(WindowManagerService.java:7762)
at com.android.server.wm.WindowManagerService.updateRotationUnchecked(WindowManagerService.java:7120)
at com.android.server.wm.WindowManagerService.updateRotation(WindowManagerService.java:7070)
at com.android.internal.policy.impl.PhoneWindowManager.updateRotation(PhoneWindowManager.java:6383)
at com.android.internal.policy.impl.PhoneWindowManager.enableScreenAfterBoot(PhoneWindowManager.java:6354)
at com.android.server.wm.WindowManagerService.performEnableScreen(WindowManagerService.java:6515)
at com.android.server.wm.WindowManagerService$H.handleMessage(WindowManagerService.java:8663)
at android.os.Handler.dispatchMessage(Handler.java:111)
at android.os.Looper.loop(Looper.java:194)
at android.os.HandlerThread.run(HandlerThread.java:61)
at com.android.server.ServiceThread.run(ServiceThread.java:46)

tid=1和tid=22兩個死鎖執行緒
詳細請看抓取的call back如下tid=1和tid=22兩個執行緒,從中可以看出:

"main" prio=5 tid=1 Blocked

Void systemReady() { // PowerManagerService, tid = 1
Synchronized(mLock) { // java.lang.Object, may be held by this thread
……
registerReceiver(…) {
ams.registerReceiver(…) { // ActivityManagerService
……
synchronized(this) // com.android.server.am.ActivityManagerService, waiting!!!!, 
……
}
}
…
}
}

"android.display" prio=5 tid=22 Blocked

Void updateConfiguration(…) { //ActivityManagerService, tid = 22
Synchronized(this) { // com.android.server.am.ActivityManagerService, may be held by this thread
……
pms. setScreenBrightnessOverrideFromWindowManagerInternal() {
synchronized(mLock) { // java.lang.Object, may be held by this thread, waiting!!!!
……
}
}
...
}
}

自此,問題分析完成,考慮到修改鎖型別的方案作用在設定螢幕亮度的地方,防止導致螢幕在某種情形下無法點亮等更嚴重的情況。在量產階段暫時先採用了保守的解法:去掉3天更新doTrim的功能。
實際提供的解鎖方案應當是可行,是解決問題出現的根本原因。