一:背景

1. 講故事

七月中旬有一位朋友加wx求助,他的程式在生產上跑著跑著記憶體就飆起來了,貌似沒有回頭的趨勢,詢問如何解決,截圖如下:

和這位朋友聊下來,感覺像是自己在小縣城當了個小老闆,規律的生活,有當地資源,各種小關係,有一股財務自由的味道,這也是我一直嚮往的生活方式 。

既然朋友找到我了,我得想辦法給他解決問題,既然是記憶體暴漲,我就賭一把在託管層面吧,嘿嘿,上windbg說話。

二: windbg 分析

1. 託管還是非託管

一直在追這個系列的朋友應該知道,我無數次的用 !address -summary!eeheap -gc 這兩個命令來判斷當前記憶體屬於託管層還是非託管層。


  1. 0:000> !address -summary
  2. --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
  3. MEM_FREE 393 7dfe`f2105000 ( 125.996 TB) 98.43%
  4. MEM_RESERVE 1691 200`0f1e4000 ( 2.000 TB) 99.81% 1.56%
  5. MEM_COMMIT 6191 0`fed07000 ( 3.981 GB) 0.19% 0.00%
  6. 0:000> !eeheap -gc
  7. Number of GC Heaps: 1
  8. generation 0 starts at 0x000001D2E572BBC8
  9. generation 1 starts at 0x000001D2E54F70E0
  10. generation 2 starts at 0x000001D252051000
  11. ephemeral segment allocation context: none
  12. segment begin allocated size
  13. 000001D252050000 000001D252051000 000001D26204FFE0 0xfffefe0(268431328)
  14. Large object heap starts at 0x000001D262051000
  15. segment begin allocated size
  16. 000001D262050000 000001D262051000 000001D2655F3F80 0x35a2f80(56242048)
  17. Total Size: Size: 0xbf4dbf80 (3209543552) bytes.
  18. ------------------------------
  19. GC Heap Size: Size: 0xbf4dbf80 (3209543552) bytes.

卦象上程序指標為 3.98G ,GC堆指標為 3209543552 = 3G ,很顯然,本次事故屬於 託管層面

2. 尋找託管層上的大物件

我們都知道C#是託管語言,所以甭管有用沒用的物件都逃不出GC堆,言外之意就是看GC堆準沒錯,挑幾個大物件看看。


  1. 0:000> !dumpheap -stat
  2. Statistics:
  3. MT Count TotalSize Class Name
  4. 00007ff98a68f090 391475 43869284 System.Int32[]
  5. 00007ff98b6adfa0 1902760 45666240 System.Collections.ObjectModel.ReadOnlyCollection`1[[System.Linq.Expressions.Expression, System.Linq.Expressions]]
  6. 00007ff98b6ac3c0 1951470 46835280 System.Linq.Expressions.ConstantExpression
  7. 00007ff98bc452e0 1681178 53797696 System.Linq.Expressions.TypedConstantExpression
  8. 00007ff98eacb6b8 1902708 60886656 System.Dynamic.Utils.ListArgumentProvider
  9. 00007ff98f236518 1774982 70999280 Microsoft.EntityFrameworkCore.Query.Expressions.ColumnExpression
  10. 00007ff98c650c58 1681142 80694816 System.Linq.Expressions.MethodCallExpression3
  11. 00007ff98a82bc38 3414094 81938256 System.RuntimeMethodHandle
  12. 00007ff98fd96fc0 17750 83936016 System.Collections.Generic.Dictionary`2+Entry[[System.Reflection.MemberInfo, System.Private.CoreLib],[System.Linq.Expressions.Expression, System.Linq.Expressions]][]
  13. 00007ff98e5ed5d8 35493 101740504 System.Collections.Generic.Dictionary`2+Entry[[Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceCacheKey, Microsoft.Extensions.DependencyInjection],[System.Object, System.Private.CoreLib]][]
  14. 00007ff98bcff6a8 3639389 116460448 System.Linq.Expressions.PropertyExpression
  15. 00007ff98b85cf00 5028347 160907104 System.Reflection.Emit.GenericFieldInfo
  16. 00007ff98a671e18 2178117 168395994 System.String
  17. 00007ff98a5b6610 160565 171498416 System.Object[]
  18. 00007ff98eaa8ab0 4981589 199263560 System.Linq.Expressions.MemberAssignment
  19. 00007ff98a672360 398740 391928469 System.Byte[]
  20. 00007ff98a746d68 181886 486150592 System.Char[]

從託管堆上看,System.Linq.Expressions.MemberAssignment 物件高達 498w ,很明顯有問題,從類名看可能和 ExpressionTree 有關,那就抽幾個物件看看它的引用鏈上是否有過大的物件。


  1. 0:000> !gcroot 000001d25399f690
  2. HandleTable:
  3. 000001D251B715A8 (pinned handle)
  4. -> 000001D262068CF0 System.Object[]
  5. -> 000001D2531C3B78 Microsoft.EntityFrameworkCore.Internal.ServiceProviderCache
  6. -> 000001D25399E3D0 Remotion.Linq.QueryModel
  7. -> 000001D25399E3B8 Remotion.Linq.Clauses.SelectClause
  8. -> 000001D25442C068 System.Linq.Expressions.MemberInitExpression
  9. -> 000001D25442C050 System.Runtime.CompilerServices.TrueReadOnlyCollection`1[[System.Linq.Expressions.MemberBinding, System.Linq.Expressions]]
  10. -> 000001D2539A0290 System.Linq.Expressions.MemberBinding[]
  11. -> 000001D25399F690 System.Linq.Expressions.MemberAssignment

引用鏈特別長,這裡我就擷取一下,經過一頓排查,我發現大物件居然是 Remotion.Linq.Clauses.SelectClause,objsize 這個物件直接爆掉了,真的很奇葩,如下程式碼所示:


  1. 0:000> !objsize 000001D25399E3B8
  2. sizeof(000001D25399E3B8) = -1187378032 (0xb93a0c90) bytes (Remotion.Linq.Clauses.SelectClause)

有點懵,這個物件居然是罪魁禍首,從引用鏈看它是 EF 下的一個構建表示式樹的小部件,可以肯定的是,朋友在用EF的時候出了什麼問題,不過還得硬著頭皮繼續挖 SelectClause,經過深挖,我發現這個類有大量這樣的 char[] 陣列,匯出來後大概是下面這樣。


  1. Logistics.Text30),
  2. | Text31 = string TryReadValue(t1.Outer.Outer, 42, WmsOutboundConfirmLogistics.Text31),
  3. | Text32 = string TryReadValue(t1.Outer.Outer, 43, WmsOutboundConfirmLogistics.Text32),
  4. | Text33 = string TryReadValue(t1.Outer.Outer, 44, WmsOutboundConfirmLogistics.Text33),
  5. | Text34 = string TryReadValue(t1.Outer.Outer, 45, WmsOutboundConfirmLogistics.Text34),
  6. | Text35 = string TryReadValue(t1.Outer.Outer, 46, WmsOutboundConfirmLogistics.Text35),
  7. | IsQueue = Nullable<bool> TryReadValue(t1.Outer.Outer, 47, WmsOutboundConfirmLogistics.IsQueue),
  8. | IsStop = Nullable<bool> TryReadValue(t1.Outer.Outer, 48, WmsOutboundConfirmLogistics.IsStop),
  9. | CheckCode = string TryReadValue(t1.Outer.Outer, 49, WmsOutboundConfirmLogistics.CheckCode),
  10. | ClientCode = string TryReadValue(t1.Outer.Inner, 50, WmsOutboundOrder.ClientCode),
  11. | WarehouseCode = string TryReadValue(t1.Outer.Inner, 51, WmsOutboundOrder.WarehouseCode),
  12. | ErpNumber = string TryReadValue(t1.Outer.Inner, 52, WmsOutboundOrder.ErpNumber),
  13. | OrderCategory = string TryReadValue(t1.Outer.Inner, 53, WmsOutboundOrder.OrderCategory),
  14. | OrderStatus = string TryReadValue(t1.Outer.Inner, 54, WmsOutboundOrder.OrderStatus),
  15. | OrderType = string TryReadValue(t1.Outer.Inner, 55, WmsOutboundOrder.OrderType),
  16. | SendCompany = string TryReadValue(t1.Outer.Inner, 56, WmsOutboundOrder.SendCompany),
  17. | SendName = string TryReadValue(t1.Outer.Inner, 57, WmsOutboundOrder.SendName),
  18. | SendTel = string TryReadValue(t1.Outer.Inner, 58, WmsOutboundOrder.SendTel),
  19. | SendMobile = string TryReadValue(t1.Outer.Inner, 59, WmsOutboundOrder.SendMobile),
  20. | SendProvince = string TryReadValue(t1.Outer.Inner, 60, WmsOutboundOrder.SendProvince),
  21. | SendCity = string TryReadValue(t1.Outer.Inner, 61, WmsOutboundOrder.SendCity),
  22. | SendArea = string TryReadValue(t1.Outer.Inner, 62, WmsOutboundOrder.SendArea),
  23. | ...
  24. | CategoryName = string TryReadValue(t1.Outer.Inner, 88, WmsOutboundOrder.CategoryName),
  25. | SourcePlatformCode = string TryReadValue(t1.Outer.Inner, 89, WmsOutboundOrder.SourcePlatformCode),
  26. | PayMode = (string)string TryReadValue(t1.Outer.Outer, 90, null),
  27. | List = List<WmsOutboundConfirmLogisticsLinesDTO> WmsOutboundConfirmLogisticsBusiness.GetOrderLines(string TryReadValue(t1.Outer.Outer, 5, WmsOutboundConfirmLogistics.OrderNumber)),
  28. | ConfirmTime = DateTime TryReadValue(t1.Inner, 91, WmsOutboundOrderConfirmation.CreateTime),
  29. | ReturnUrl = (string)string TryReadValue(t1.Outer.Outer, 92, null)
  30. | }
  31. |__ ),
  32. |__ contextType: Core.DataRepository.BaseDbContext,
  33. |__ logger: DiagnosticsLogger<Query>,
  34. |__ queryContext: Unhandled parameter: queryContext)

從內容看,應該是 select 語句的ExpressionTree表示,問了下朋友,說大概是報表業務,不過這些資訊給他貌似也沒有多大幫助,說實話到這裡我其實也不知道怎麼繼續往下排查了,陷入了絕望。

3. 從絕望中尋找希望

我在想,既然EF構建了大量這樣的 ExpressionTree,肯定有問題,但也想不出是什麼問題,隔了半天,我突然靈光一現,EF既然構建了樹,有可能sql也出來了,對,我何不直接在 heap 上搜索 select 的sql語句。。。。


  1. 0:000> !strings /m:*select*
  2. Address Gen Length Value
  3. 000001d2e4de64e0 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  4. 000001d2e4e11e78 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  5. 000001d2e4e3d1f0 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  6. 000001d2e4e673c8 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  7. 000001d2e4e91760 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  8. 000001d2e4ebb2e8 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  9. 000001d2e4ee54f8 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  10. 000001d2e4f10758 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  11. 000001d2e4f398d0 2 1964 SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
  12. ---------------------------------------
  13. 18128 matching strings

果然發現了大量重複的 select 語句,而且從最左邊的記憶體地址看都是非常接近的,也就說明他們是在某一個操作中同時生成的,然後我們匯出幾個sql語句。


  1. SELECT a."Id", ....
  2. FROM "WmsOutboundConfirmLogistics" AS a
  3. INNER JOIN "WmsOutboundOrder" AS b ON a."OrderNumber" = b."OrderNumber"
  4. INNER JOIN "WmsOutboundOrderConfirmation" AS c ON a."OrderNumber" = c."OrderNumber"
  5. WHERE (a."OrderNumber" = @__pagination_OrderNumber_0) AND (b."FreezeStatus" = FALSE)
  6. ORDER BY a."Id"";
  7. SELECT a."Id", ....
  8. FROM "WmsOutboundConfirmLogistics" AS a
  9. INNER JOIN "WmsOutboundOrder" AS b ON a."OrderNumber" = b."OrderNumber"
  10. INNER JOIN "WmsOutboundOrderConfirmation" AS c ON a."OrderNumber" = c."OrderNumber"
  11. WHERE (a."OrderNumber" = @__pagination_OrderNumber_0) AND (b."FreezeStatus" = FALSE)
  12. ORDER BY a."Id""

拿到這 1.8w 重複的sql 給朋友看,朋友說這是查詢報表的sql。

4. 所有線索整合打通

那這裡就存在著很大問題,既然是查詢報表,為什麼會有 1.8w 相同的sql,唯一不同的就是 a."OrderNumber" = @__pagination_OrderNumber_0 中的訂單號,難道不應該是 a.OrderNumber in (xxxx) 或者是表關聯查詢嗎??? 整理一下就是下面這樣的猜想:


  1. -- 理想
  2. select * from a where a.id in (1,2,3)
  3. -- 現實
  4. select * from a where a.id=1;
  5. select * from a where a.id=2;
  6. select * from a where a.id=3;

加上每個sql記憶體地址相近,再結合爆表的 Remotion.Linq.Clauses.SelectClause 物件,整個流程大概就是:本該表關聯或者in操作,結果變成了無數個單條sql語句查詢,導致EF底層出現記憶體爆炸式增長。

三:總結

看了下朋友查詢ef的寫法,猜測大多都是人肉構建 ExpressionTree 去查詢資料庫,大寫的,比如下面的這張圖:

解決方案就是讓朋友檢查下表示式樹的寫法問題,或者直接灌寫好的sql得了,說實話這個dump還是費了九牛二虎之力,本以為很簡單,實操起來還是碰到了一點小困難,就當歷練成長吧!

更多高質量乾貨:參見我的 GitHub: dotnetfly