MSVC profilers attribute most time to std::panicking::try::do_call<AssertUnwindSafe<closure>,()>

d9c1c15
Opened by Vladimir Vukicevic at 2023-04-05 17:45:47

While building servo with a locally-built msvc nightly rust with debug symbols (same behaviour with regular nightly binaries, the debug symbols just give a lot more useful stacks/etc. overall since names are resolved to proper demangled ones), profilers end up attributing >70% of CPU time to std::panicking::try::do_call<AssertUnwindSafe<closure>,()>. They both see the call stack from thread_start -> thread::start_thread -> __rust_maybe_catch_panic -> std::panicking:....

If I break in the (Visual Studio) debugger, basically every Rust-created thread has a stack that looks like this:

    servo.exe!net::image_cache_thread::Receivers::recv() Line 277   Unknown
    servo.exe!net::image_cache_thread::ImageCache::run(core::option::Option<webrender_traits::api::RenderApi> core_resource_thread, ipc_channel::ipc::IpcReceiver<net_traits::image_cache_thread::ImageCacheCommand>) Line 369  Unknown
    servo.exe!00007ff7b1b93ae9() Line 352   Unknown
    servo.exe!00007ff7b2dfa6e0() Line 97    Unknown
    servo.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595    Unknown
    servo.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21    Unknown
    servo.exe!std::sys::thread::{{impl}}::new::thread_start(libc::c_void * main) Line 50    Unknown

or

    win32u.dll!NtUserGetMessage()  Unknown
    user32.dll!GetMessageW()   Unknown
>   servo.exe!00007ff7b2bca4b0() Line 352   Unknown
    servo.exe!00007ff7b2dfa6e0() Line 97    Unknown
    servo.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595    Unknown
    servo.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21    Unknown
    servo.exe!std::sys::thread::{{impl}}::new::thread_start(libc::c_void * main) Line 50    Unknown

I have a whole bunch of worker threads that are showing

>   servo.exe!00007ff7b2009fca() Line 352   Unknown
    servo.exe!00007ff7b2dfa6e0() Line 97    Unknown
    servo.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595    Unknown
    servo.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21    Unknown
    servo.exe!std::sys::thread::{{impl}}::new::thread_start(libc::c_void * main) Line 50    Unknown
    kernel32.dll!BaseThreadInitThunk() Unknown
    ntdll.dll!RtlUserThreadStart() Unknown

as their entire stack. The "Line 352", on each of those unnamed functions, takes me to panicking.rs:352, which is the do_call mentioned above. But they're also very clearly different addresses, all of which are in the servo.exe module (which is at 00007FF7B1A20000-00007FF7B3EEB000).

Manually dumping the pdb for servo.exe and looking up the relevant offset (0x5e9fca) gives me:

00000000005e7bb0      72712 SymTagFunction        net::bluetooth_thread::BluetoothManager::start C:\proj\r\servo\components\net\bluetooth_thread.rs(181)
00000000005f9890        254 SymTagFunction        net::bluetooth_thread::BluetoothManager::get_or_create_adapter C:\proj\r\servo\components\net\bluetooth_thread.rs(232)

Which is totally reasonable and correct -- that last thread is somewhere inside the BluetoothManager's thread func. The size -- 72712 -- also looks correct, and our address should be in the middle of it. The source file and line number are also correct!

But the tools are merging all of these things down to line 352. I'm wondering if they're not using the PDB info for this, and the compiler/llvm is generating incorrect info in another layer?

  1. dia2dump says:

    $ dia2dump.exe -lines 0x5e9fca servo.exe
            line 352 at [005E9DF0][0001:005E8DF0], len = 0x1C28     C:\proj\r\rustc\src\libstd\panicking.rs
    

    So it has 0x5e9fca pointing to the wrong place. But, if I grab the lines for net::bluetooth_thread::BluetoothManager::start:

    $ dia2dump.exe -lines 'net::bluetooth_thread::BluetoothManager::start' servo.exe
    ** net::bluetooth_thread::BluetoothManager::start
            line 181 at [001E7BB0][0001:001E6BB0], len = 0x15C      C:\proj\r\servo\components\net\bluetooth_thread.rs
            line 182 at [001E7D0C][0001:001E6D0C], len = 0x9A
    

    It thinks that that function is at a totally different address. An address exactly 0x400000 off from what it should be, which happens to be the default base address used by link.exe. This might not be being taken into account when outputting debug info? Or, more accurately, this is something that might need to be explicitly taken into account for windows, whereas on the gcc toolchain it doesn't?

    Vladimir Vukicevic at 2016-10-14 21:47:15

  2. Sorry I can't really quite make heads or tails of much of the output here, but it's standard to see a catch_unwind at the bottom of a stack as all threads catch an unwind. In theory that's not perf critical though as it's vastly dwarfed by creation of the thread itself.

    Alex Crichton at 2016-10-15 07:45:01

  3. The three call stacks at the start are the entire call stacks, for builds with full debuginfo. Note that call_box is there, but then there are two functions without any debug info for the function name, but with line numbers and source numbers (line 352, panicking.rs). That debug info is incorrect -- there should be a proper function name there (e.g. one of those is actually the BluetoothManager::start thread start function), and they should have the proper line number and source file.

    It looks like the MSVC debug info is incorrect, with (some?) debug info being emitted for symbols at the wrong address.

    Vladimir Vukicevic at 2016-10-15 21:18:48

  4. Oh! Sorry I misread the issue here as a perf problem related to catch_unwind, but I see now it's basically just bad stack info and such!

    Alex Crichton at 2016-10-16 06:41:02

  5. Yep, sorry -- the issue title is the symptom that I originally saw. Here's a simple testcase that reproduces this on Windows:

    use std::thread;
    
    fn compute_primes() -> i64 {
            let mut a: i64 = 2;
            let mut count: i64 = 0;
            while count < 200000 {
                let mut b: i64 = 2;
                let mut prime = true;
                while (b * b) <= a {
                    if a % b == 0 { prime = false; break; }
                    b += 1;
                }
                if prime { count += 1; }
                a += 1;
            }
            count
    }
    
    fn main() {
        let child1 = thread::spawn(|| {
            compute_primes()
        });
        let child2 = thread::spawn(|| {
            compute_primes()
        });
    
        println!("{}", child1.join().unwrap());
        println!("{}", child2.join().unwrap());
    }
    

    The two children aren't necessary, but varying the contents of their thread start functions sometimes yields slightly different results -- for example, copy-pasting the contents of compute_primes into one of them results in profilers reporting two distinct stacks, but each with only the std::panicking::try::do_call symbol reported. I'm going to dig into the reported debuginfo next.

    Vladimir Vukicevic at 2016-10-17 16:54:16

  6. Ah ha -- I had a hunch that turned out to be correct. Adding #[inline(never)] to compute_primes caused everything to look proper in debuggers & profilers. So this issue seems to be that debug info for inlined functions is incorrect.

    Vladimir Vukicevic at 2016-10-17 16:58:52

  7. @vvuk As far as I know, even msvc C++ does not handle inlined functions. It will report the entire body of a function as being that function, even if numerous other functions were inlined into it. So unless you can identify a function that isn't being reported and isn't inlined, I don't think there's any bug here on Rust's side.

    Peter Atashian at 2016-10-17 17:00:51

  8. That's true, but in our case here, there are two functions on the call stack that are entirely missing debug info when they shouldn't be. Additionally, you should be able to set a breakpoint on a function that was inlined (it should set a breakpoint everywhere where that code appears; right now it all gets mapped to the single line where the inlining happened).

    In test-thread, breaking during the middle of execution reports something like this (I've subtracted the module load base for all the addresses):

    thread-test.exe  0x120d() Line 352
    thread-test.exe 0xe250() Line 97
    thread-test.exe!alloc::boxed::{{impl}}::call_box<(),closure>(closure * self, ...) Line 595
    thread-test.exe!std::sys_common::thread::start_thread(libc::c_void * main) Line 21
    

    The value of thread_test::main is 0x1730. Using dia2dump:

    $ dia2dump.exe -lines 0x1730 thread-test.exe
            line 19 at [00001730][0001:00000730], len = 0x166       C:\proj\r\thread-test\src\main.rs
    $ dia2dump.exe -lines 0x120d thread-test.exe
            line 352 at [000011D0][0001:000001D0], len = 0x63       C:\proj\r\rustc\src\libstd\panicking.rs
            line 352 at [00001233][0001:00000233], len = 0x14
            line 798 at [00001250][0001:00000250], len = 0x34       C:\proj\r\rustc\src\libcore\result.rs
            line 799 at [00001284][0001:00000284], len = 0xE
            line 799 at [00001292][0001:00000292], len = 0x3E
            line 29 at [000012D0][0001:000002D0], len = 0x32        C:\proj\r\rustc\src\libcore\macros.rs
            line 798 at [00001302][0001:00000302], len = 0x26       C:\proj\r\rustc\src\libcore\result.rs
    

    Note that there's no debuginfo at all for 0x120d -- and the closest one is the copy of the panicking.rs which is where the function was inlined into.

    Vladimir Vukicevic at 2016-10-18 20:54:38

  9. From @fitzgen (he's having github issues):

    I verified that on Linux, compute_primes is also being inlined into the try::do_call blah blah however, we are generating correct DWARF info describing the inlining so GDB (and I assume perf too, if this program ran long enough to get a meaningful number of samples) can and does display the correct call stack. So this issue is definitely MSVC/windows/PDB specific somehow.

    Vladimir Vukicevic at 2016-10-21 18:29:05

  10. I think the issue is that code compiled with rustc is omitting the base information for the inlined function if it is only ever inlined. When I compile an equivalent C++ program using cl and use llvm-readobj -codeview on the object file, I see (irrelevant fields ommitted):

      FuncId (0x1919) {
        TypeLeafKind: LF_FUNC_ID (0x1601)
        ParentScope: 0x0
        FunctionType: unsigned __int64 () (0x164C)
        Name: compute_primes
      }
    ...
    CodeViewDebugInfo [
      Subsection [
        SubSectionType: Symbols (0xF1)
        ProcStart {
          Kind: S_LPROC32_ID (0x1146)
          FunctionType: compute_primes (0x1919)
          CodeOffset: ?compute_primes@@YA_KXZ+0x0
          Flags [ (0x80)
            HasOptimizedDebugInfo (0x80)
          ]
          DisplayName: compute_primes
          LinkageName: ?compute_primes@@YA_KXZ
        }
    ...
    ]
    CodeViewDebugInfo [
      Subsection [
        SubSectionType: Symbols (0xF1)
        ProcStart {
          Kind: S_GPROC32_ID (0x1147)
          FunctionType: thread_some_primes (0x191A)
          CodeOffset: ?thread_some_primes@@YAXXZ+0x0
          Flags [ (0x80)
            HasOptimizedDebugInfo (0x80)
          ]
          DisplayName: thread_some_primes
          LinkageName: ?thread_some_primes@@YAXXZ
        }
        InlineSite {
          Inlinee: compute_primes (0x1919)
          BinaryAnnotations [
            ChangeCodeOffsetAndLineOffset: {CodeOffset: 0x0, LineOffset: 0}
            ChangeLineOffset: 1
    ...
    

    Specifically that the FuncType 0x1919 has its own CodeViewDebugInfo; later when it's inlined, the BinaryAnnotations then have something to modify.

    From rustc (and, actually, clang-cl -- I'm pretty sure that this is a llvm bug):

      FuncId (0x10B4) {
        TypeLeafKind: LF_FUNC_ID (0x1601)
        ParentScope: primes (0x10B2)
        FunctionType: __int64 () (0x10B3)
        Name: compute_primes
      }
    ...
    CodeViewDebugInfo [
      Subsection [
        SubSectionType: InlineeLines (0xF6)
        InlineeSourceLine {
          Inlinee: compute_primes (0x10B4)
          FileID: C:\proj\r\test-msvc\primes.rs (0x30)
          SourceLineNum: 3
        }
    ...
    CodeViewDebugInfo [
      Subsection [
        SubSectionType: Symbols (0xF1)
        ProcStart {
          Kind: S_LPROC32_ID (0x1146)
          FunctionType: do_call (0x11EE)
          CodeOffset: _ZN3std9panicking3try7do_call17h7aa81a0e201766daE+0x0
          DisplayName: std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,i64>
          LinkageName: _ZN3std9panicking3try7do_call17h7aa81a0e201766daE
        }
    ...
        InlineSite {
    >     Inlinee: compute_primes (0x10B4)
          BinaryAnnotations [
            ChangeLineOffset: 6
            ChangeCodeOffset: 0x20
            ChangeCodeLength: 0x6
            ChangeLineOffset: 1
            ChangeCodeOffset: 0x10
            ChangeLineOffset: 1
            ChangeCodeOffset: 0x12
    ...
        }
        Local {
          Type: __int64 (0x13)
          Flags [ (0x0) ]
          VarName: a
        }
    ...
        Local {
          Type: __int64 (0x13)
          Flags [ (0x100) IsOptimizedOut (0x100)
          ]
          VarName: count
        }
    ...
    

    I don't know in detail how CV is supposed to work; but here, the function (our std::panicking::try::do_call) has an InlineSite for compute_primes with appropriate line/code offsets (that reference the base InlineeSourceLines compute_primes). But the piece that's missing, which is what I think is causing the issues, is the actual Symbols subsection for compute_primes. This is only a guess though; from the above, everything that the debugger needs could be present (it should be able to figure out that it's in the do_call, and from there look at that function's debug info and see that it's in the InlineeSite), assuming that the Change... annotations are taken to mean offsets from the start of the inline site.

    Vladimir Vukicevic at 2016-10-25 14:42:14

  11. Yeah, I think that's it -- https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L764 even has:

      // Emit a symbol subsection, required by VS2012+ to find function boundaries.
      OS.AddComment("Symbol subsection for " + Twine(FuncName));
      MCSymbol *SymbolsEnd = beginCVSubsection(ModuleSubstreamKind::Symbols);
    

    Vladimir Vukicevic at 2016-10-25 14:47:47

  12. So I tired to trick the compiler into emitting a non-inlined version of compute_primes by adding:

        println!("{:?}", compute_primes as *const u8);
    

    This causes the non-inlined version to shop up in the .ll, and a Symbol section for it in the obj file. From dia2dump:

    Function       : static, [00001950][0001:00000950], len = 00000077, primes::compute_primes
                     Function attribute:
                     Function info:
    FuncDebugStart :   static, [00001950][0001:00000950]
    FuncDebugEnd   :   static, [00001950][0001:00000950]
    

    However, even though the final .exe/.pdb has primes::compute_primes, it doesn't have it as an InlineSite at all! From dia2dump:

    Function       : static, [000011D0][0001:000001D0], len = 00000077, std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,i64>
    FuncDebugStart :   static, [000011D0][0001:000001D0]
    FuncDebugEnd   :   static, [000011D0][0001:000001D0]
    InlineSite     :   (std::panic::{{impl}}::call_once) has type function 
    InlineSite     :   (core::ptr::write) has type function 
    

    If I remove the println!, the final PDB file doesn't mention compute_primes at all -- there's no mention of that string in the PDB file. The two do_calls look exactly the same as above. So, it's not enough to just get the Symbol section to be emitted.. it seems to need to be referenced somehow to make sure that the linker doesn't just nuke it. dia2dump of a cl-built .exe might have a clue:

    Function       : static, [000064C0][0001:000054C0], len = 0000005F, compute_primes
                     Function attribute:
                     Function info: asyncheh wasinlined
    FuncDebugStart :   static, [000064C0][0001:000054C0]
    FuncDebugEnd   :   static, [0000651E][0001:0000551E]
    
    Function       : static, [00006540][0001:00005540], len = 00000066, int __cdecl thread_some_primes(void)
                     Function attribute:
                     Function info: asyncheh
    FuncDebugStart :   static, [00006540][0001:00005540]
    FuncDebugEnd   :   static, [000065A5][0001:000055A5]
    InlineSite     :   (compute_primes) has type function 
    

    Note the wasinlined in compute_primes, and that it's missing in the function info from our original pdb dump.

    Vladimir Vukicevic at 2016-10-25 15:34:11

  13. This comes from a FrameProc entry that's part of the Symbol, which LLVM doesn't seem to emit at all right now (dump from cl-built file):

        FrameProc {
          TotalFrameBytes: 0x28
          PaddingFrameBytes: 0x0
          OffsetToPadding: 0x0
          BytesOfCalleeSavedRegisters: 0x0
          OffsetOfExceptionHandler: 0x0
          SectionIdOfExceptionHandler: 0x0
          Flags [ (0x114A00)
            AsynchronousExceptionHandling (0x200)
            Inlined (0x800)
            OptimizedForSpeed (0x100000)
          ]
    

    So the fix for this is (probably?) going to be inside LLVM, in https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp. We already properly have a DISubprogram for compute_primes emitted, so the CodeView debug output needs to realize that it's being inlined and output S_FRAMEPROC with the right flags.

    Vladimir Vukicevic at 2016-10-25 15:56:04

  14. So a quick hack with this patch to llvm: https://gist.github.com/vvuk/0112ab429e306d504f546d6a55a7bb82 and with the println! hack in the code gets me a resulting pdb file that has the frameproc with the "wasinlined" flag set. However, the InlineSite is just missing from the output. It's in the obj file for one of the do_calls: (via llvm-dumpobj -codeview)

        InlineSite { Inlinee: call_once (0x10C3) }
        InlineSite { Inlinee: {{closure}} (0x10BB) }
        InlineSite { Inlinee: compute_primes (0x10B4) }
        InlineSiteEnd { }
        InlineSiteEnd { }
        InlineSiteEnd { }
        InlineSite { Inlinee: write (0x10C7) }
        InlineSiteEnd { }
    

    but after the link, the resulting binary's pdb file just has:

    Function       : static, [000011D0][0001:000001D0], len = 00000077, std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,i64>
    InlineSite     :   (std::panic::{{impl}}::call_once) has type function 
    InlineSite     :   (core::ptr::write) has type function 
    

    I'm stumped a to why it's being dropped. Do we have any CodeView experts or even MS folks we could ask for suggestions?

    Vladimir Vukicevic at 2016-10-26 03:11:54

  15. Just discovered https://github.com/Microsoft/microsoft-pdb -- and the included cvdump.exe that's a lot more detailed than dia2dump.exe. The InlineSites are present -- they are just nested and dia2dump only prints the topmost. But I think that the nesting may be incorrect.

    Vladimir Vukicevic at 2016-10-26 03:27:48

  16. Okay! With this LLVM patch: https://gist.github.com/vvuk/42bc1825dac505cad499edadc582e5b2 I get proper results: https://i.imgur.com/ZwGGlno.png https://i.imgur.com/V4r7dEk.png . The patch makes the InlineSites not nest in the CodeView output. It also does the FrameProc hack, but that may not be needed; going to do a build without it shortly.

    This however only works if I include the println! hack to make sure a non-inline version of compute_primes is emitted. So the LLVM fix is not complete; we still need to have rustc emit a non-inlined version of a function if it was only ever inlined. This will cause code bloat, but maybe that's okay for full debugging. If anyone has any thoughts on how to do that, they'd be welcomed.

    Not sure what the next steps are here -- if the FrameProc hack is not needed, I can submit the un-nesting to LLVM upstream and see what they say?

    Vladimir Vukicevic at 2016-10-26 15:50:11

  17. Both FrameProc and the non-nested InlineSites are needed.

    Vladimir Vukicevic at 2016-10-26 17:18:25

  18. Is this still an issue?

    konstantinua00 at 2021-08-18 14:12:08