MSVC profilers attribute most time to std::panicking::try::do_call<AssertUnwindSafe<closure>,()>
While building servo with a locally-built msvc nightly rust with debug symbols (same behaviour with regular nightly binaries, the debug symbols just give a lot more useful stacks/etc. overall since names are resolved to proper demangled ones), profilers end up attributing >70% of CPU time to std::panicking::try::do_call<AssertUnwindSafe<closure>,()>. They both see the call stack from thread_start -> thread::start_thread -> __rust_maybe_catch_panic -> std::panicking:....
If I break in the (Visual Studio) debugger, basically every Rust-created thread has a stack that looks like this:
servo.exe!net::image_cache_thread::Receivers::recv() Line 277 Unknown
servo.exe!net::image_cache_thread::ImageCache::run(core::option::Option<webrender_traits::api::RenderApi> core_resource_thread, ipc_channel::ipc::IpcReceiver<net_traits::image_cache_thread::ImageCacheCommand>) Line 369 Unknown
servo.exe!00007ff7b1b93ae9() Line 352 Unknown
servo.exe!00007ff7b2dfa6e0() Line 97 Unknown
servo.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595 Unknown
servo.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21 Unknown
servo.exe!std::sys::thread::{{impl}}::new::thread_start(libc::c_void * main) Line 50 Unknown
or
win32u.dll!NtUserGetMessage() Unknown
user32.dll!GetMessageW() Unknown
> servo.exe!00007ff7b2bca4b0() Line 352 Unknown
servo.exe!00007ff7b2dfa6e0() Line 97 Unknown
servo.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595 Unknown
servo.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21 Unknown
servo.exe!std::sys::thread::{{impl}}::new::thread_start(libc::c_void * main) Line 50 Unknown
I have a whole bunch of worker threads that are showing
> servo.exe!00007ff7b2009fca() Line 352 Unknown
servo.exe!00007ff7b2dfa6e0() Line 97 Unknown
servo.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595 Unknown
servo.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21 Unknown
servo.exe!std::sys::thread::{{impl}}::new::thread_start(libc::c_void * main) Line 50 Unknown
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
as their entire stack. The "Line 352", on each of those unnamed functions, takes me to panicking.rs:352, which is the do_call mentioned above. But they're also very clearly different addresses, all of which are in the servo.exe module (which is at 00007FF7B1A20000-00007FF7B3EEB000).
Manually dumping the pdb for servo.exe and looking up the relevant offset (0x5e9fca) gives me:
00000000005e7bb0 72712 SymTagFunction net::bluetooth_thread::BluetoothManager::start C:\proj\r\servo\components\net\bluetooth_thread.rs(181)
00000000005f9890 254 SymTagFunction net::bluetooth_thread::BluetoothManager::get_or_create_adapter C:\proj\r\servo\components\net\bluetooth_thread.rs(232)
Which is totally reasonable and correct -- that last thread is somewhere inside the BluetoothManager's thread func. The size -- 72712 -- also looks correct, and our address should be in the middle of it. The source file and line number are also correct!
But the tools are merging all of these things down to line 352. I'm wondering if they're not using the PDB info for this, and the compiler/llvm is generating incorrect info in another layer?
dia2dump says:
$ dia2dump.exe -lines 0x5e9fca servo.exe line 352 at [005E9DF0][0001:005E8DF0], len = 0x1C28 C:\proj\r\rustc\src\libstd\panicking.rsSo it has
0x5e9fcapointing to the wrong place. But, if I grab the lines fornet::bluetooth_thread::BluetoothManager::start:$ dia2dump.exe -lines 'net::bluetooth_thread::BluetoothManager::start' servo.exe ** net::bluetooth_thread::BluetoothManager::start line 181 at [001E7BB0][0001:001E6BB0], len = 0x15C C:\proj\r\servo\components\net\bluetooth_thread.rs line 182 at [001E7D0C][0001:001E6D0C], len = 0x9AIt thinks that that function is at a totally different address. An address exactly
0x400000off from what it should be, which happens to be the default base address used bylink.exe. This might not be being taken into account when outputting debug info? Or, more accurately, this is something that might need to be explicitly taken into account for windows, whereas on the gcc toolchain it doesn't?Vladimir Vukicevic at 2016-10-14 21:47:15
Sorry I can't really quite make heads or tails of much of the output here, but it's standard to see a catch_unwind at the bottom of a stack as all threads catch an unwind. In theory that's not perf critical though as it's vastly dwarfed by creation of the thread itself.
Alex Crichton at 2016-10-15 07:45:01
The three call stacks at the start are the entire call stacks, for builds with full debuginfo. Note that call_box is there, but then there are two functions without any debug info for the function name, but with line numbers and source numbers (line 352, panicking.rs). That debug info is incorrect -- there should be a proper function name there (e.g. one of those is actually the
BluetoothManager::startthread start function), and they should have the proper line number and source file.It looks like the MSVC debug info is incorrect, with (some?) debug info being emitted for symbols at the wrong address.
Vladimir Vukicevic at 2016-10-15 21:18:48
Oh! Sorry I misread the issue here as a perf problem related to catch_unwind, but I see now it's basically just bad stack info and such!
Alex Crichton at 2016-10-16 06:41:02
Yep, sorry -- the issue title is the symptom that I originally saw. Here's a simple testcase that reproduces this on Windows:
use std::thread; fn compute_primes() -> i64 { let mut a: i64 = 2; let mut count: i64 = 0; while count < 200000 { let mut b: i64 = 2; let mut prime = true; while (b * b) <= a { if a % b == 0 { prime = false; break; } b += 1; } if prime { count += 1; } a += 1; } count } fn main() { let child1 = thread::spawn(|| { compute_primes() }); let child2 = thread::spawn(|| { compute_primes() }); println!("{}", child1.join().unwrap()); println!("{}", child2.join().unwrap()); }The two children aren't necessary, but varying the contents of their thread start functions sometimes yields slightly different results -- for example, copy-pasting the contents of compute_primes into one of them results in profilers reporting two distinct stacks, but each with only the
std::panicking::try::do_callsymbol reported. I'm going to dig into the reported debuginfo next.Vladimir Vukicevic at 2016-10-17 16:54:16
Ah ha -- I had a hunch that turned out to be correct. Adding
#[inline(never)]to compute_primes caused everything to look proper in debuggers & profilers. So this issue seems to be that debug info for inlined functions is incorrect.Vladimir Vukicevic at 2016-10-17 16:58:52
@vvuk As far as I know, even msvc C++ does not handle inlined functions. It will report the entire body of a function as being that function, even if numerous other functions were inlined into it. So unless you can identify a function that isn't being reported and isn't inlined, I don't think there's any bug here on Rust's side.
Peter Atashian at 2016-10-17 17:00:51
That's true, but in our case here, there are two functions on the call stack that are entirely missing debug info when they shouldn't be. Additionally, you should be able to set a breakpoint on a function that was inlined (it should set a breakpoint everywhere where that code appears; right now it all gets mapped to the single line where the inlining happened).
In test-thread, breaking during the middle of execution reports something like this (I've subtracted the module load base for all the addresses):
thread-test.exe 0x120d() Line 352 thread-test.exe 0xe250() Line 97 thread-test.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595 thread-test.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21The value of
thread_test::mainis0x1730. Usingdia2dump:$ dia2dump.exe -lines 0x1730 thread-test.exe line 19 at [00001730][0001:00000730], len = 0x166 C:\proj\r\thread-test\src\main.rs $ dia2dump.exe -lines 0x120d thread-test.exe line 352 at [000011D0][0001:000001D0], len = 0x63 C:\proj\r\rustc\src\libstd\panicking.rs line 352 at [00001233][0001:00000233], len = 0x14 line 798 at [00001250][0001:00000250], len = 0x34 C:\proj\r\rustc\src\libcore\result.rs line 799 at [00001284][0001:00000284], len = 0xE line 799 at [00001292][0001:00000292], len = 0x3E line 29 at [000012D0][0001:000002D0], len = 0x32 C:\proj\r\rustc\src\libcore\macros.rs line 798 at [00001302][0001:00000302], len = 0x26 C:\proj\r\rustc\src\libcore\result.rsNote that there's no debuginfo at all for
0x120d-- and the closest one is the copy of thepanicking.rswhich is where the function was inlined into.Vladimir Vukicevic at 2016-10-18 20:54:38
From @fitzgen (he's having github issues):
I verified that on Linux, compute_primes is also being inlined into the try::do_call blah blah however, we are generating correct DWARF info describing the inlining so GDB (and I assume perf too, if this program ran long enough to get a meaningful number of samples) can and does display the correct call stack. So this issue is definitely MSVC/windows/PDB specific somehow.
Vladimir Vukicevic at 2016-10-21 18:29:05
I think the issue is that code compiled with
rustcis omitting the base information for the inlined function if it is only ever inlined. When I compile an equivalent C++ program usingcland usellvm-readobj -codeviewon the object file, I see (irrelevant fields ommitted):FuncId (0x1919) { TypeLeafKind: LF_FUNC_ID (0x1601) ParentScope: 0x0 FunctionType: unsigned __int64 () (0x164C) Name: compute_primes } ... CodeViewDebugInfo [ Subsection [ SubSectionType: Symbols (0xF1) ProcStart { Kind: S_LPROC32_ID (0x1146) FunctionType: compute_primes (0x1919) CodeOffset: ?compute_primes@@YA_KXZ+0x0 Flags [ (0x80) HasOptimizedDebugInfo (0x80) ] DisplayName: compute_primes LinkageName: ?compute_primes@@YA_KXZ } ... ] CodeViewDebugInfo [ Subsection [ SubSectionType: Symbols (0xF1) ProcStart { Kind: S_GPROC32_ID (0x1147) FunctionType: thread_some_primes (0x191A) CodeOffset: ?thread_some_primes@@YAXXZ+0x0 Flags [ (0x80) HasOptimizedDebugInfo (0x80) ] DisplayName: thread_some_primes LinkageName: ?thread_some_primes@@YAXXZ } InlineSite { Inlinee: compute_primes (0x1919) BinaryAnnotations [ ChangeCodeOffsetAndLineOffset: {CodeOffset: 0x0, LineOffset: 0} ChangeLineOffset: 1 ...Specifically that the FuncType
0x1919has its ownCodeViewDebugInfo; later when it's inlined, theBinaryAnnotationsthen have something to modify.From
rustc(and, actually,clang-cl-- I'm pretty sure that this is allvmbug):FuncId (0x10B4) { TypeLeafKind: LF_FUNC_ID (0x1601) ParentScope: primes (0x10B2) FunctionType: __int64 () (0x10B3) Name: compute_primes } ... CodeViewDebugInfo [ Subsection [ SubSectionType: InlineeLines (0xF6) InlineeSourceLine { Inlinee: compute_primes (0x10B4) FileID: C:\proj\r\test-msvc\primes.rs (0x30) SourceLineNum: 3 } ... CodeViewDebugInfo [ Subsection [ SubSectionType: Symbols (0xF1) ProcStart { Kind: S_LPROC32_ID (0x1146) FunctionType: do_call (0x11EE) CodeOffset: _ZN3std9panicking3try7do_call17h7aa81a0e201766daE+0x0 DisplayName: std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,i64> LinkageName: _ZN3std9panicking3try7do_call17h7aa81a0e201766daE } ... InlineSite { > Inlinee: compute_primes (0x10B4) BinaryAnnotations [ ChangeLineOffset: 6 ChangeCodeOffset: 0x20 ChangeCodeLength: 0x6 ChangeLineOffset: 1 ChangeCodeOffset: 0x10 ChangeLineOffset: 1 ChangeCodeOffset: 0x12 ... } Local { Type: __int64 (0x13) Flags [ (0x0) ] VarName: a } ... Local { Type: __int64 (0x13) Flags [ (0x100) IsOptimizedOut (0x100) ] VarName: count } ...I don't know in detail how CV is supposed to work; but here, the function (our
std::panicking::try::do_call) has anInlineSiteforcompute_primeswith appropriate line/code offsets (that reference the baseInlineeSourceLinescompute_primes). But the piece that's missing, which is what I think is causing the issues, is the actualSymbolssubsection forcompute_primes. This is only a guess though; from the above, everything that the debugger needs could be present (it should be able to figure out that it's in thedo_call, and from there look at that function's debug info and see that it's in theInlineeSite), assuming that theChange...annotations are taken to mean offsets from the start of the inline site.Vladimir Vukicevic at 2016-10-25 14:42:14
Yeah, I think that's it -- https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L764 even has:
// Emit a symbol subsection, required by VS2012+ to find function boundaries. OS.AddComment("Symbol subsection for " + Twine(FuncName)); MCSymbol *SymbolsEnd = beginCVSubsection(ModuleSubstreamKind::Symbols);Vladimir Vukicevic at 2016-10-25 14:47:47
So I tired to trick the compiler into emitting a non-inlined version of
compute_primesby adding:println!("{:?}", compute_primes as *const u8);This causes the non-inlined version to shop up in the
.ll, and aSymbolsection for it in the obj file. From dia2dump:Function : static, [00001950][0001:00000950], len = 00000077, primes::compute_primes Function attribute: Function info: FuncDebugStart : static, [00001950][0001:00000950] FuncDebugEnd : static, [00001950][0001:00000950]However, even though the final
.exe/.pdbhasprimes::compute_primes, it doesn't have it as anInlineSiteat all! From dia2dump:Function : static, [000011D0][0001:000001D0], len = 00000077, std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,i64> FuncDebugStart : static, [000011D0][0001:000001D0] FuncDebugEnd : static, [000011D0][0001:000001D0] InlineSite : (std::panic::{{impl}}::call_once) has type function InlineSite : (core::ptr::write) has type functionIf I remove the
println!, the final PDB file doesn't mentioncompute_primesat all -- there's no mention of that string in the PDB file. The twodo_calls look exactly the same as above. So, it's not enough to just get theSymbolsection to be emitted.. it seems to need to be referenced somehow to make sure that the linker doesn't just nuke it. dia2dump of a cl-built .exe might have a clue:Function : static, [000064C0][0001:000054C0], len = 0000005F, compute_primes Function attribute: Function info: asyncheh wasinlined FuncDebugStart : static, [000064C0][0001:000054C0] FuncDebugEnd : static, [0000651E][0001:0000551E] Function : static, [00006540][0001:00005540], len = 00000066, int __cdecl thread_some_primes(void) Function attribute: Function info: asyncheh FuncDebugStart : static, [00006540][0001:00005540] FuncDebugEnd : static, [000065A5][0001:000055A5] InlineSite : (compute_primes) has type functionNote the
wasinlinedincompute_primes, and that it's missing in the function info from our original pdb dump.Vladimir Vukicevic at 2016-10-25 15:34:11
This comes from a
FrameProcentry that's part of theSymbol, which LLVM doesn't seem to emit at all right now (dump from cl-built file):FrameProc { TotalFrameBytes: 0x28 PaddingFrameBytes: 0x0 OffsetToPadding: 0x0 BytesOfCalleeSavedRegisters: 0x0 OffsetOfExceptionHandler: 0x0 SectionIdOfExceptionHandler: 0x0 Flags [ (0x114A00) AsynchronousExceptionHandling (0x200) Inlined (0x800) OptimizedForSpeed (0x100000) ]So the fix for this is (probably?) going to be inside LLVM, in https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp. We already properly have a
DISubprogramforcompute_primesemitted, so the CodeView debug output needs to realize that it's being inlined and outputS_FRAMEPROCwith the right flags.Vladimir Vukicevic at 2016-10-25 15:56:04
So a quick hack with this patch to llvm: https://gist.github.com/vvuk/0112ab429e306d504f546d6a55a7bb82 and with the
println!hack in the code gets me a resulting pdb file that has the frameproc with the "wasinlined" flag set. However, the InlineSite is just missing from the output. It's in the obj file for one of thedo_calls: (viallvm-dumpobj -codeview)InlineSite { Inlinee: call_once (0x10C3) } InlineSite { Inlinee: {{closure}} (0x10BB) } InlineSite { Inlinee: compute_primes (0x10B4) } InlineSiteEnd { } InlineSiteEnd { } InlineSiteEnd { } InlineSite { Inlinee: write (0x10C7) } InlineSiteEnd { }but after the link, the resulting binary's pdb file just has:
Function : static, [000011D0][0001:000001D0], len = 00000077, std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,i64> InlineSite : (std::panic::{{impl}}::call_once) has type function InlineSite : (core::ptr::write) has type functionI'm stumped a to why it's being dropped. Do we have any CodeView experts or even MS folks we could ask for suggestions?
Vladimir Vukicevic at 2016-10-26 03:11:54
Just discovered https://github.com/Microsoft/microsoft-pdb -- and the included
cvdump.exethat's a lot more detailed thandia2dump.exe. The InlineSites are present -- they are just nested anddia2dumponly prints the topmost. But I think that the nesting may be incorrect.Vladimir Vukicevic at 2016-10-26 03:27:48
Okay! With this LLVM patch: https://gist.github.com/vvuk/42bc1825dac505cad499edadc582e5b2 I get proper results: https://i.imgur.com/ZwGGlno.png https://i.imgur.com/V4r7dEk.png . The patch makes the InlineSites not nest in the CodeView output. It also does the FrameProc hack, but that may not be needed; going to do a build without it shortly.
This however only works if I include the
println!hack to make sure a non-inline version ofcompute_primesis emitted. So the LLVM fix is not complete; we still need to have rustc emit a non-inlined version of a function if it was only ever inlined. This will cause code bloat, but maybe that's okay for full debugging. If anyone has any thoughts on how to do that, they'd be welcomed.Not sure what the next steps are here -- if the FrameProc hack is not needed, I can submit the un-nesting to LLVM upstream and see what they say?
Vladimir Vukicevic at 2016-10-26 15:50:11
Both FrameProc and the non-nested InlineSites are needed.
Vladimir Vukicevic at 2016-10-26 17:18:25
Is this still an issue?
konstantinua00 at 2021-08-18 14:12:08