Move oom test out of line with -C panic=abort

7758951
Opened by Jeff Muizelaar at 2020-06-10 20:52:13

The following code:

#[inline(never)]
pub fn foo() -> Vec<u8> {
  vec![0; 900]
}

compiles to

example::foo:
        push    rbp
        mov     rbp, rsp
        push    rbx
        sub     rsp, 40
        mov     rbx, rdi
        lea     rdx, [rbp - 32]
        mov     edi, 900
        mov     esi, 1
        call    __rust_alloc_zeroed@PLT
        test    rax, rax
        je      .LBB0_1
        mov     qword ptr [rbx], rax
        mov     qword ptr [rbx + 8], 900
        mov     qword ptr [rbx + 16], 900
        mov     rax, rbx
        add     rsp, 40
        pop     rbx
        pop     rbp
        ret
.LBB0_1:
        mov     rax, qword ptr [rbp - 32]
        movups  xmm0, xmmword ptr [rbp - 24]
        movaps  xmmword ptr [rbp - 48], xmm0
        mov     qword ptr [rbp - 32], rax
        movaps  xmm0, xmmword ptr [rbp - 48]
        movups  xmmword ptr [rbp - 24], xmm0
        lea     rdi, [rbp - 32]
        call    __rust_oom@PLT
        ud2

It would be a good code size improvement for the oom handling to be moved into __rust_alloc_zeroed when compiling with panic=abort. This will also make the branch on null better predicted because it's not duplicated all around the code base, but in just one spot.

  1. This seems like it would apply to not-only oom=abort and would involve rethinking the #[inline] on functions in liballoc. But overall it seems weird to me that the __rust_oom path has so much code related to it. I would expect it to be pretty much just the call of the function, not what it currently is.

    cc @alexcrichton @pnkfelix (since both of you worked on allocators)

    Simonas Kazlauskas at 2017-12-15 12:23:55

  2. Adding #![crate_type = "lib"], and running rustc +nightly -C panic=abort -O a.rs --emit=asm with rustc 1.28.0-nightly (b68432d56 2018-06-12), I get:

            .text
            .file   "a0-8787f43e282added376259c1adb08b80.rs"
            .section        .text._ZN1a3foo17h55c63ef928c74adbE,"ax",@progbits
            .globl  _ZN1a3foo17h55c63ef928c74adbE
            .p2align        4, 0x90
            .type   _ZN1a3foo17h55c63ef928c74adbE,@function
    _ZN1a3foo17h55c63ef928c74adbE:
            pushq   %rbx
            movq    %rdi, %rbx
            movl    $900, %edi
            movl    $1, %esi
            callq   __rust_alloc_zeroed@PLT
            testq   %rax, %rax
            je      .LBB0_1
            movq    %rax, (%rbx)
            movq    $900, 8(%rbx)
            movq    $900, 16(%rbx)
            movq    %rbx, %rax
            popq    %rbx
            retq
    .LBB0_1:
            movl    $900, %edi
            movl    $1, %esi
            callq   _ZN5alloc5alloc3oom17h772d6152ef20ca43E@PLT
            ud2
    .Lfunc_end0:
            .size   _ZN1a3foo17h55c63ef928c74adbE, .Lfunc_end0-_ZN1a3foo17h55c63ef928c74adbE
    
    
            .section        ".note.GNU-stack","",@progbits
    

    This seems smaller, but I’m not good at reading assembly. @jrmuizel, is this still an issue? What is it that should change?

    Simon Sapin at 2018-06-13 17:02:24

  3. Nope. The test for null and the call to alloc::oom should be moved out of line. i.e. allocation should just be a single call that either returns successfully or throws. We don't want to the control flow to handle OOM to leak into callers.

    Jeff Muizelaar at 2018-06-14 19:47:56

  4. OOM has been changing a lot recently but @SimonSapin's post matches what I currently see as well. It's not clear how to easily improve this in the sense that we can't do this at the base layer (__rust_alloc_zeroed itself) as it's also the foundation for fallible allocation. Instead we may wish to specialize Vec itself with #[inline(never)] functions that allocate an internally check for null, always returning a non-null pointer.

    This is unfortunately collection-specific, however, and would need separate treatment for BTreeMap/HashMap/Box/etc.

    Alex Crichton at 2018-06-28 18:05:12

  5. There is discussion at https://internals.rust-lang.org/t/pre-rfc-changing-the-alloc-trait/7487 for infallible allocation APIs that could be used in all collections. They could be marked #[inline(never)] to keep the null check out of line but this would be an additional call nesting level, not instead of __rust_alloc_zeroed.

    Is that function call overhead preferable to duplicating the null check? (__rust_alloc_zeroed is inlined when using LTO, but there would still be an out-of-line infallible_something calling out-of-line calloc.)

    Simon Sapin at 2018-06-28 22:10:52

  6. This now looks like

    .LCPI0_0:
            .quad   900
            .quad   900
    example::foo:
            push    rbx
            mov     rbx, rdi
            mov     edi, 900
            mov     esi, 1
            call    qword ptr [rip + __rust_alloc_zeroed@GOTPCREL]
            test    rax, rax
            je      .LBB0_1
            mov     qword ptr [rbx], rax
            movaps  xmm0, xmmword ptr [rip + .LCPI0_0]
            movups  xmmword ptr [rbx + 8], xmm0
            mov     rax, rbx
            pop     rbx
            ret
    .LBB0_1:
            mov     edi, 900
            mov     esi, 1
            call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
            ud2
    

    Steve Klabnik at 2019-12-12 13:33:08

  7. I planed around with #[inline(never)] in raw_vec.rs, it fixed the small test case here but didn't have much affect on a larger program. I think it would be good to get a better feel for how big a problem this is in larger programs. It would be good to to have a tool that can compute how much binary code comes from raw_vec.rs etc to get a true feel for the possible impact.

    Jeff Muizelaar at 2020-06-10 19:25:14

  8. It would be good to to have a tool that can compute how much binary code

    https://github.com/dtolnay/cargo-llvm-lines can measure the number of LLVM IR lines by function. But it does so on unoptimized IR, meant as a proxy for estimating compile times. I assume that here we’re interested in code size that remains after optimizations.

    Simon Sapin at 2020-06-10 20:52:13