vtable accesses optimization in a tight loop.

73be5a3
Opened by dpc at 2023-10-08 20:50:27

This is a followup of #39992 .

This playrust:

#![crate_type="lib"]

use std::sync::Arc;

pub struct Wrapper {
    x: usize,
    y: usize,
    t: Arc<Foo>,
}

pub trait Foo {
    fn foo(&self);
}

pub fn test(foo: Wrapper) {
    for _ in 0..200 {
        foo.t.foo();
    }
}

Generates the tight loop:

.LBB1_1:
	incl	%ebp
	cmpl	$200, %ebp
	jge	.LBB1_2
	movq	16(%rbx), %rdi
	movq	24(%rbx), %rax
	leaq	15(%rdi), %rcx
	negq	%rdi
	andq	%rcx, %rdi
	addq	%r12, %rdi
.Ltmp12:
	callq	*%rax
.Ltmp13:
	jmp	.LBB1_1

while it seems to me the vtable access should be out of the loop, and it shouldn't be hard.

  1. Strangely, changing the signature to take &Wrapper does hoist the vtable access:

    .LBB0_1:
    	inc	ebp
    	mov	rdi, rbx
    	call	r14
    	cmp	ebp, 200
    	jl	.LBB0_1
    

    Hanna Kruppe at 2017-03-07 10:41:25

  2. Looks like the load is hoisted on nightly (but not beta):

    .LBB2_1:                                # =>This Inner Loop Header: Depth=1
    	addl	$1, %ebp
    	cmpl	$200, %ebp
    	jae	.LBB2_2
    # %bb.4:                                #   in Loop: Header=BB2_1 Depth=1
    	movq	%rbx, %rdi
    	callq	*%r13
    	jmp	.LBB2_1
    

    The &Wrapper variant still generates slightly better code:

    
    .LBB0_1:                                # =>This Inner Loop Header: Depth=1
    	movq	%rbx, %rdi
    	callq	*%r14
    	addl	$-1, %ebp
    	jne	.LBB0_1
    

    Nikita Popov at 2018-12-02 20:04:48