

记一次 .NET某施工建模软件 卡死分析
source link: https://www.cnblogs.com/huangxincheng/p/18082543
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

1. 讲故事
前几天有位朋友在微信上找到我,说他的软件卡死了,分析了下也不知道是咋回事,让我帮忙看一下,很多朋友都知道,我分析dump是免费的,当然也不是所有的dump我都能搞定,也只能尽自己最大能力帮助别人缩小问题范围吧,既然dump有了,接下来就开启分析之路。
二:WinDbg分析
1. 为什么会卡死
不同类型的程序卡死的解决思路是不一样的,朋友也说了是窗体程序,那就重点观察下主线程吧,使用 k 命令即可。
0:000> k 25
# Child-SP RetAddr Call Site
00 00000000`007fc8d8 00007ffd`87439b13 ntdll!NtWaitForAlertByThreadId+0x14
01 00000000`007fc8e0 00007ffd`87439a06 ntdll!RtlpWaitOnAddressWithTimeout+0x43
02 00000000`007fc910 00007ffd`8743987d ntdll!RtlpWaitOnAddress+0xae
03 00000000`007fc980 00007ffd`87435fdc ntdll!RtlpWaitOnCriticalSection+0xd9
04 00000000`007fc9f0 00007ffd`87435ef0 ntdll!RtlpEnterCriticalSectionContended+0xdc
05 00000000`007fca20 00007ffd`536839ea ntdll!RtlEnterCriticalSection+0x40
06 00000000`007fca50 00007ffd`5368470a AcLayers!NS_VirtualRegistry::CRegLock::CRegLock+0x1a
07 00000000`007fca90 00007ffd`536726d2 AcLayers!NS_VirtualRegistry::APIHook_RegOpenKeyExW+0x2a
08 00000000`007fcb10 00007ffd`778e550b AcLayers!NS_WRPMitigation::APIHook_RegOpenKeyExW+0x42
09 00000000`007fcb60 00007ffd`778e5437 xxx!GetCodePageForFont+0xa7
0a 00000000`007fcc90 00007ffd`778e5296 xxx!CToolTipsMgr::NewFont+0x113
0b 00000000`007fcda0 00007ffd`778e18f9 xxx!CToolTipsMgr::LoadTheme+0xb2
0c 00000000`007fcdd0 00007ffd`84b9ca66 xxx!CToolTipsMgr::s_ToolTipsWndProc+0x1b9
0d 00000000`007fce10 00007ffd`84b9c34b user32!UserCallWinProcCheckWow+0x266
0e 00000000`007fcf90 00007ffd`4f36b1cc user32!CallWindowProcW+0x8b
0f 00000000`007fcfe0 00007ffd`4f39ccac System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.DefWndProc+0x9c
10 00000000`007fd090 00007ffd`4f39cc05 System_Windows_Forms_ni!System.Windows.Forms.ToolTip.WndProc+0x9c
11 00000000`007fd260 00007ffd`4f36a3a3 System_Windows_Forms_ni!System.Windows.Forms.ToolTip.ToolTipNativeWindow.WndProc+0x15
12 00000000`007fd290 00007ffd`4f9e1161 System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.Callback+0xc3
13 00000000`007fd330 00007ffd`52c8222e System_Windows_Forms_ni+0x8d1161
14 00000000`007fd3a0 00007ffd`84b9ca66 clr!UMThunkStub+0x6e
15 00000000`007fd430 00007ffd`84b9c78c user32!UserCallWinProcCheckWow+0x266
16 00000000`007fd5b0 00007ffd`84bb3b32 user32!DispatchClientMessage+0x9c
17 00000000`007fd610 00007ffd`874c22c4 user32!__fnINLPCREATESTRUCT+0xa2
18 00000000`007fd670 00007ffd`836a1f24 ntdll!KiUserCallbackDispatcherContinue
19 00000000`007fd7e8 00007ffd`84ba15df win32u!NtUserCreateWindowEx+0x14
1a 00000000`007fd7f0 00007ffd`84ba11d4 user32!VerNtUserCreateWindowEx+0x20f
1b 00000000`007fdb80 00007ffd`84ba1012 user32!CreateWindowInternal+0x1b4
1c 00000000`007fdce0 00007ffd`4f3e8098 user32!CreateWindowExW+0x82
1d 00000000`007fdd70 00007ffd`4f3696f0 System_Windows_Forms_ni+0x2d8098
...
从卦象看,很明显主线程卡在 NtWaitForAlertByThreadId 上,这是有问题的,接下来我们仔细解读下线程栈。
- DispatchClientMessage
这个方法表示当前从 queue 中拿到了别的线程通过 Invoke 送过来的信息,正在处理中。
- LoadTheme
这个方法表示正在用主线程更新窗体样式
- APIHook_RegOpenKeyExW
首先说一下 AcLayers.dll,专业名词叫 垫片
,详情可以看一下《软件调试》,它主要用来处理一些系统级兼容性的问题,然后可以看到它在查询注册表时有一个lock操作。
在非托管代码中,lock 一般都用 临界区(CriticalSection) 实现,那到底它等待的临界区被谁持有着呢?
2. 谁持有着临界区锁
要想获取锁的持有信息,可以使用 !cs -l
或者 !locks
,但这里要提醒一下,在真实的dump分析过程中,有时候不准,所以更好的办法就是从线程栈上提取,那怎么提取呢? 其实就是寻找 ntdll!RtlEnterCriticalSection
方法的第一个参数即可,方法签名如下:
VOID RtlEnterCriticalSection(
PRTL_CRITICAL_SECTION CriticalSection
);
接下来反汇编下 00007ffd536839ea
处的代码,看看 rcx 寄存器是怎么传下来的。
0:000> ub 00007ffd`536839ea
AcLayers!NS_VirtualRegistry::OPENKEY::AddEnumEntries<NS_VirtualRegistry::VIRTUALVAL>+0x11a:
00007ffd`536839ce cc int 3
00007ffd`536839cf cc int 3
AcLayers!NS_VirtualRegistry::CRegLock::CRegLock:
00007ffd`536839d0 48895c2408 mov qword ptr [rsp+8],rbx
00007ffd`536839d5 57 push rdi
00007ffd`536839d6 4883ec30 sub rsp,30h
00007ffd`536839da 488bf9 mov rdi,rcx
00007ffd`536839dd 488d0d4c7f0300 lea rcx,[AcLayers!NS_VirtualRegistry::csRegCriticalSection (00007ffd`536bb930)]
00007ffd`536839e4 ff15ae660100 call qword ptr [AcLayers!_imp_EnterCriticalSection (00007ffd`5369a098)]
从卦象上看,很吉利,这个 rcx 原来是一个全局变量 AcLayers!NS_VirtualRegistry::csRegCriticalSection
, 接下来用 !cs 观察下到底被谁持有着。
0:000> !cs AcLayers!NS_VirtualRegistry::csRegCriticalSection
-----------------------------------------
Critical section = 0x00007ffd536bb930 (AcLayers!NS_VirtualRegistry::csRegCriticalSection+0x0)
DebugInfo = 0x000000001c4e58e0
LOCKED
LockCount = 0x2
WaiterWoken = No
OwningThread = 0x0000000000001d20
RecursionCount = 0x1
LockSemaphore = 0xFFFFFFFF
SpinCount = 0x00000000020007ce
这又是一副吉卦,可以看到当前持有线程是 1d20
,那这个线程正在做什么呢?
3. 1d20 线程为什么持锁不释放
案情往前推进了一步,我们切过去观察下这个线程栈。
0:000> ~~[1d20]s
ntdll!NtDelayExecution+0x14:
00007ffd`874bec14 c3 ret
0:028> kL
# Child-SP RetAddr Call Site
00 00000000`33ccd948 00007ffd`83955381 ntdll!NtDelayExecution+0x14
01 00000000`33ccd950 00007ffd`6d4a2361 KERNELBASE!SleepEx+0xa1
02 00000000`33ccd9f0 00007ffd`8520a75c perfts!CloseLagPerfData+0x21
03 00000000`33ccda30 00007ffd`85209ccd advapi32!CloseExtObjectLibrary+0xec
04 00000000`33ccda90 00007ffd`8396dc6a advapi32!PerfRegCloseKey+0x15d
05 00000000`33ccdae0 00007ffd`839715e6 KERNELBASE!BaseRegCloseKeyInternal+0x72
06 00000000`33ccdb10 00007ffd`83935209 KERNELBASE!ClosePredefinedHandle+0x96
07 00000000`33ccdb40 00007ffd`53685d71 KERNELBASE!RegCloseKey+0x149
08 00000000`33ccdba0 00007ffd`53683ae5 AcLayers!NS_VirtualRegistry::CVirtualRegistry::CloseKey+0xbd
09 00000000`33ccdbf0 00007ffd`51c7737e AcLayers!NS_VirtualRegistry::APIHook_RegCloseKey+0x25
0a 00000000`33ccdc30 00007ffd`51bf4be2 mscorlib_ni+0x58737e
0b 00000000`33ccdce0 00007ffd`513c356a mscorlib_ni!Microsoft.Win32.RegistryKey.Dispose+0x72
0c 00000000`33ccdd20 00007ffd`513c34b9 System_ni!System.Diagnostics.PerformanceCounterLib.GetStringTable+0x41a
...
13 00000000`33cce050 00007ffd`513bfe3c System_ni!System.Diagnostics.PerformanceCounter..ctor+0xd7
14 00000000`33cce0a0 00007ffc`f45cb2ce System_ni!System.Diagnostics.PerformanceCounter..ctor+0x1c
15 00000000`33cce0d0 00007ffc`f45cb14c 0x00007ffc`f45cb2ce
16 00000000`33cce120 00007ffc`f45cb023 0x00007ffc`f45cb14c
...
从卦中看,这个线程貌似在用 CloseLagPerfData
方法关闭一些东西时一直在Sleep等待,可以反汇编 00007ffd6d4a2361
处代码看看等待多久。
0:028> ub 00007ffd`6d4a2361
perfts!CloseLagPerfData+0x5:
00007ffd`6d4a2345 55 push rbp
00007ffd`6d4a2346 488bec mov rbp,rsp
00007ffd`6d4a2349 4883ec30 sub rsp,30h
00007ffd`6d4a234d e8720e0000 call perfts!LagCounterManager::Cleanup (00007ffd`6d4a31c4)
00007ffd`6d4a2352 33db xor ebx,ebx
00007ffd`6d4a2354 eb0b jmp perfts!CloseLagPerfData+0x21 (00007ffd`6d4a2361)
00007ffd`6d4a2356 b964000000 mov ecx,64h
00007ffd`6d4a235b ff15c74e0000 call qword ptr [perfts!_imp_Sleep (00007ffd`6d4a7228)]
...
从卦中的 mov ecx,64h
可以看到是 Sleep(100) 毫秒,更多细节也没空继续追究了,但不管怎么样,它是由上层的计数器类 PerformanceCounter
引发的,这里学一下 4S 店的做法,让朋友能不能不要调用 PerformanceCounter
这个类,咱躲开他就可以了,截图如下:

去掉之后,朋友反馈问题消失。
说来也奇怪,最近发现了二起由 PerformanceCounter
引发的程序卡死,把经验留在这里,希望后来人少踩坑吧!

Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK