Move oom test out of line with -C panic=abort
The following code:
#[inline(never)]
pub fn foo() -> Vec<u8> {
vec![0; 900]
}
compiles to
example::foo:
push rbp
mov rbp, rsp
push rbx
sub rsp, 40
mov rbx, rdi
lea rdx, [rbp - 32]
mov edi, 900
mov esi, 1
call __rust_alloc_zeroed@PLT
test rax, rax
je .LBB0_1
mov qword ptr [rbx], rax
mov qword ptr [rbx + 8], 900
mov qword ptr [rbx + 16], 900
mov rax, rbx
add rsp, 40
pop rbx
pop rbp
ret
.LBB0_1:
mov rax, qword ptr [rbp - 32]
movups xmm0, xmmword ptr [rbp - 24]
movaps xmmword ptr [rbp - 48], xmm0
mov qword ptr [rbp - 32], rax
movaps xmm0, xmmword ptr [rbp - 48]
movups xmmword ptr [rbp - 24], xmm0
lea rdi, [rbp - 32]
call __rust_oom@PLT
ud2
It would be a good code size improvement for the oom handling to be moved into __rust_alloc_zeroed when compiling with panic=abort. This will also make the branch on null better predicted because it's not duplicated all around the code base, but in just one spot.
This seems like it would apply to not-only oom=abort and would involve rethinking the
#[inline]on functions in liballoc. But overall it seems weird to me that the __rust_oom path has so much code related to it. I would expect it to be pretty much just the call of the function, not what it currently is.cc @alexcrichton @pnkfelix (since both of you worked on allocators)
Simonas Kazlauskas at 2017-12-15 12:23:55
Adding
#![crate_type = "lib"], and runningrustc +nightly -C panic=abort -O a.rs --emit=asmwith rustc 1.28.0-nightly (b68432d56 2018-06-12), I get:.text .file "a0-8787f43e282added376259c1adb08b80.rs" .section .text._ZN1a3foo17h55c63ef928c74adbE,"ax",@progbits .globl _ZN1a3foo17h55c63ef928c74adbE .p2align 4, 0x90 .type _ZN1a3foo17h55c63ef928c74adbE,@function _ZN1a3foo17h55c63ef928c74adbE: pushq %rbx movq %rdi, %rbx movl $900, %edi movl $1, %esi callq __rust_alloc_zeroed@PLT testq %rax, %rax je .LBB0_1 movq %rax, (%rbx) movq $900, 8(%rbx) movq $900, 16(%rbx) movq %rbx, %rax popq %rbx retq .LBB0_1: movl $900, %edi movl $1, %esi callq _ZN5alloc5alloc3oom17h772d6152ef20ca43E@PLT ud2 .Lfunc_end0: .size _ZN1a3foo17h55c63ef928c74adbE, .Lfunc_end0-_ZN1a3foo17h55c63ef928c74adbE .section ".note.GNU-stack","",@progbitsThis seems smaller, but I’m not good at reading assembly. @jrmuizel, is this still an issue? What is it that should change?
Simon Sapin at 2018-06-13 17:02:24
Nope. The test for null and the call to alloc::oom should be moved out of line. i.e. allocation should just be a single call that either returns successfully or throws. We don't want to the control flow to handle OOM to leak into callers.
Jeff Muizelaar at 2018-06-14 19:47:56
OOM has been changing a lot recently but @SimonSapin's post matches what I currently see as well. It's not clear how to easily improve this in the sense that we can't do this at the base layer (
__rust_alloc_zeroeditself) as it's also the foundation for fallible allocation. Instead we may wish to specializeVecitself with#[inline(never)]functions that allocate an internally check for null, always returning a non-null pointer.This is unfortunately collection-specific, however, and would need separate treatment for BTreeMap/HashMap/Box/etc.
Alex Crichton at 2018-06-28 18:05:12
There is discussion at https://internals.rust-lang.org/t/pre-rfc-changing-the-alloc-trait/7487 for infallible allocation APIs that could be used in all collections. They could be marked
#[inline(never)]to keep the null check out of line but this would be an additional call nesting level, not instead of__rust_alloc_zeroed.Is that function call overhead preferable to duplicating the null check? (
__rust_alloc_zeroedis inlined when using LTO, but there would still be an out-of-lineinfallible_somethingcalling out-of-linecalloc.)Simon Sapin at 2018-06-28 22:10:52
This now looks like
.LCPI0_0: .quad 900 .quad 900 example::foo: push rbx mov rbx, rdi mov edi, 900 mov esi, 1 call qword ptr [rip + __rust_alloc_zeroed@GOTPCREL] test rax, rax je .LBB0_1 mov qword ptr [rbx], rax movaps xmm0, xmmword ptr [rip + .LCPI0_0] movups xmmword ptr [rbx + 8], xmm0 mov rax, rbx pop rbx ret .LBB0_1: mov edi, 900 mov esi, 1 call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL] ud2Steve Klabnik at 2019-12-12 13:33:08
I planed around with #[inline(never)] in raw_vec.rs, it fixed the small test case here but didn't have much affect on a larger program. I think it would be good to get a better feel for how big a problem this is in larger programs. It would be good to to have a tool that can compute how much binary code comes from raw_vec.rs etc to get a true feel for the possible impact.
Jeff Muizelaar at 2020-06-10 19:25:14
It would be good to to have a tool that can compute how much binary code
https://github.com/dtolnay/cargo-llvm-lines can measure the number of LLVM IR lines by function. But it does so on unoptimized IR, meant as a proxy for estimating compile times. I assume that here we’re interested in code size that remains after optimizations.
Simon Sapin at 2020-06-10 20:52:13