Trait object calls not devirtualized

cdf3f69
Opened by Jeff Muizelaar at 2024-12-21 05:41:25
trait A {
    fn k(&self, x: i32) -> i32;
}

pub struct O {
}

pub struct P {
}

impl A for O {
    fn k(&self, x: i32) -> i32 {
        x
    }
}

impl A for P {
    fn k(&self, x: i32) -> i32 {
        x
    }
}

pub enum R {
    P(P),
    O(O),
}
impl R {
    fn inner(&self) -> &A {
        match self {
            &R::P(ref p) => p,
            &R::O(ref o) => o,
        }
    }
    pub fn k(&self, x: i32) -> i32 {
        match self {
            &R::P(ref p) => p.k(x),
            &R::O(ref o) => o.k(x),
        }
    }

    pub fn j(&self, x: i32) -> i32 {
        self.inner().k(x)
    }
}

compiles to

core::ptr::drop_in_place:
        push    rbp
        mov     rbp, rsp
        pop     rbp
        ret

core::ptr::drop_in_place:
        push    rbp
        mov     rbp, rsp
        pop     rbp
        ret

<example::O as example::A>::k:
        push    rbp
        mov     rbp, rsp
        mov     eax, esi
        pop     rbp
        ret

<example::P as example::A>::k:
        push    rbp
        mov     rbp, rsp
        mov     eax, esi
        pop     rbp
        ret

example::R::k:
        push    rbp
        mov     rbp, rsp
        mov     eax, esi
        pop     rbp
        ret

example::R::j:
        push    rbp
        mov     rbp, rsp
        cmp     byte ptr [rdi], 0
        lea     rdi, [rdi + 1]
        lea     rax, [rip + vtable.1]
        lea     rcx, [rip + vtable.0]
        cmove   rcx, rax
        pop     rbp
        jmp     qword ptr [rcx + 24]

vtable.0:
        .quad   core::ptr::drop_in_place
        .quad   0
        .quad   1
        .quad   <example::O as example::A>::k

vtable.1:
        .quad   core::ptr::drop_in_place
        .quad   0
        .quad   1
        .quad   <example::P as example::A>::k

Notice how much worse R::j() is then R::k()

  1. It's somewhat not expected that non-generic trait object calls will be devirtualized -- if you want guarantees on that, you'll need to use generics. However, this does seem like a case where we can possibly do better.

    cc @arielb1 -- possible LLVM issue

    Mark Rousskov at 2017-11-05 01:26:55

  2. How would you write j() with generics?

    Jeff Muizelaar at 2017-11-05 02:43:46

  3. I think LLVM just does never does this sort of split inlining - it never inlines a function unless it knows the exact call target, and it tends to merge, rather than split, similar code paths. Actually using generics to write j without trait objects requires the closure to be generic over types (basically F: for<T: A> FnOnce(&T) -> R), which is not supported in today's Rust with closures and therefore requires an ugly manual trait implementation.

    I would either duplicate the code by hand or make inner a macro, as in:

    #![crate_type="rlib"]
    
    trait A {
        fn k(&self, x: i32) -> i32;
    }
    
    pub struct O {
    }
    
    pub struct P {
    }
    
    impl A for O {
        fn k(&self, x: i32) -> i32 {
            x
        }
    }
    
    impl A for P {
        fn k(&self, x: i32) -> i32 {
            x
        }
    }
    
    pub enum R {
        P(P),
        O(O),
    }
    
    // hack to work around the lack of generic closures, we don't
    // want to use a trait object here because performance is a
    // problem.
    macro_rules! with_inner {
        ($this:expr, $var:ident => $body:expr) => {
            match $this {
                &R::P(ref $var) => { $body }
                &R::O(ref $var) => { $body }
            }
        }
    }
    
    impl R {
        pub fn k(&self, x: i32) -> i32 {
            match self {
                &R::P(ref p) => p.k(x),
                &R::O(ref o) => o.k(x),
            }
        }
    
        pub fn j(&self, x: i32) -> i32 {
            with_inner!(self, a => a.k(x))
        }
    }
    
    // Marker to allow seeing the assembly easily
    #[no_mangle]
    pub fn j_exported(r: &R, x: i32) -> i32 { r.j(x) }
    

    Ariel Ben-Yehuda at 2017-11-05 11:44:18

  4. Triage: the codegen has changed, but the two functions are still different:

    playground::R::k: # @playground::R::k
    # %bb.0:
    	movl	%esi, %eax
    	retq
                                            # -- End function
    
    playground::R::j: # @playground::R::j
    # %bb.0:
    	cmpb	$1, (%rdi)
    	leaq	.Lanon.7823f3966c6de2987f3ac86f05619038.0(%rip), %rax
    	leaq	.Lanon.7823f3966c6de2987f3ac86f05619038.1(%rip), %rcx
    	cmoveq	%rax, %rcx
    	addq	$1, %rdi
    	jmpq	*24(%rcx)               # TAILCALL
                                            # -- End function
    

    Steve Klabnik at 2019-09-16 12:46:14

  5. On 1.39.0 (with -Copt-level=3) the codegen appears to do the right thing according to godbolt.org.

    The outputted ASM appears:

    j_exported:
            mov     eax, esi
            ret
    

    for the above example


    This does not occur in the playground because, I don't believe you can change the optimization level.

    Cody Laeder at 2019-12-09 19:22:40

  6. @valarauca That example was showcasing how to avoid the problem in the first place. The original code still produces an indirect call.

    Jonas Schievink at 2019-12-09 20:55:23

  7. Things have changed a bit since the bug was opened, on 1.21.0 I believe, and from the last update from steveklabnik. Here's the godbolt for 1.50 with -C opt-level=3.

    <example::O as example::A>::k:
            mov     eax, esi
            ret
    
    example::R::k:
            mov     eax, esi
            ret
    
    example::R::j:
            add     rdi, 1
            jmp     qword ptr [rip + <example::O as example::A>::k@GOTPCREL]
    

    <P as A>::k seems to have been totally eliminated on account of being identical to <O as A>::k, but R::j is still a bit weird. The add was presumably left behind from a previous version of the code, and a jmp was used instead of <O as A>::k being inlined.

    The above is not great, but because of the elimination of <P as A>::k by recent compiler versions, I think the code also no longer represents the problem very well. Making <O as A>::k and <P as A>::k have different code gives a better example. With the latter returning x + 10, you get (godbolt):

    <example::O as example::A>::k:
            mov     eax, esi
            ret
    
    <example::P as example::A>::k:
            lea     eax, [rsi + 10]
            ret
    
    example::R::k:
            lea     eax, [rsi + 10]
            cmp     byte ptr [rdi], 1
            cmove   eax, esi
            ret
    
    example::R::j:
            mov     rax, rdi
            add     rdi, 1
            cmp     byte ptr [rax], 1
            jne     .LBB3_1
            mov     rax, qword ptr [rip + <example::O as example::A>::k@GOTPCREL]
            jmp     rax
    .LBB3_1:
            mov     rax, qword ptr [rip + <example::P as example::A>::k@GOTPCREL]
            jmp     rax
    

    In the above case, it seems like it should have kept going with optimization -- inlining the calls and eventually optimizing down to the same thing as R::k, ideally.

    On the other hand, in some cases, I might even hope for LLVM to replace the match with a virtualized call, depending on the number of arms and the complexity of the A::k implementations, etc. The match happens to be potentially very profitable in the example case, because the A::k implementations are few and extremely simple, so inlining them would be a no-brainer. LLVM doesn't quite manage to get there though.

    Tyson Nottingham at 2021-02-20 00:26:03