optimisation opportunity missed due to assert!

d67123b
Opened by Joe at 2023-10-08 20:50:32

I have noticed interesting behaviour regarding assert! which was unexpected. I am not sure if this is a LLVM or Rust issue.

in the following code (See godbolt link below) I have a vec like structure OwnedRegion. I want to iterate over the region (implemented as slice::IterMut) and populate an array. I expect LLVM to do loop unrolling and loop vectorisation. If you place the necessary intrinsics::assume to trick llvm into removing the bound check you do indeed get the optimisations. However I wanted to replace the assume with an assert! but that removes the optimisation even if combined with an assume! What I think is happening is the bounds check is being reintroduced because the length check in the assert! is being moved into the for loop or elided completely because it duplicates the length check in the index operation.

What should happen is the bounds check in the for loop should be lifted/elided because of the assert!

fn main() {
    let mut region: OwnedRegion<usize> = test::black_box(OwnedRegion::empty());
    let mut my_array: [usize; 128] = [0;128];

    // !!! The following assert blocks the loop vectorization optimization !!!
    // !!! remove the following line to enable Loop vectorization !!!
    assert!(region.len > 128 );

    // I know this assume is unsafe given that we created an empty OwnedRegion. 
    // I am only looking at the output assembly to see if optimization is run 
    unsafe {  assume(region.len() > 128) };
    for i in 0..128 {
       my_array[i] = region[i];
    } 
    println!("{:?}", &my_array[..32]);
}

with assert! and assume:

inner loops is:

  cmp r15, rcx
  jbe .LBB6_5
  lea rcx, [rax - 2]
  mov rdx, qword ptr [rbx + 8*rax - 24]
  mov qword ptr [rbp + 8*rax - 1152], rdx
  cmp r15, rcx
  jbe .LBB6_4
  lea rcx, [rax - 1]
  mov rdx, qword ptr [rbx + 8*rax - 16]
  mov qword ptr [rbp + 8*rax - 1144], rdx
  cmp r15, rcx
  jbe .LBB6_8
  mov rcx, qword ptr [rbx + 8*rax - 8]
  mov qword ptr [rbp + 8*rax - 1136], rcx
  cmp r15, rax
  jbe .LBB6_10
  add rsi, 4
  mov rcx, qword ptr [rbx + 8*rax]
  mov qword ptr [rbp + 8*rax - 1128], rcx

where LBB6_4, LBB6_5, LBB6_8, and LBB6_10 jump to std::panic

with just assume:

inner loops is:

.LBB3_3:
  movups xmm0, xmmword ptr [rbx + 8*rax]
  movups xmm1, xmmword ptr [rbx + 8*rax + 16]
  movups xmmword ptr [rbp + 8*rax - 1120], xmm0
  movups xmmword ptr [rbp + 8*rax - 1104], xmm1
  movups xmm0, xmmword ptr [rbx + 8*rax + 32]
  movups xmm1, xmmword ptr [rbx + 8*rax + 48]
  movups xmmword ptr [rbp + 8*rax - 1088], xmm0
  movups xmmword ptr [rbp + 8*rax - 1072], xmm1
  movups xmm0, xmmword ptr [rbx + 8*rax + 64]
  movups xmm1, xmmword ptr [rbx + 8*rax + 80]
  movups xmmword ptr [rbp + 8*rax - 1056], xmm0
  movups xmmword ptr [rbp + 8*rax - 1040], xmm1
  movups xmm0, xmmword ptr [rbx + 8*rax + 96]
  movups xmm1, xmmword ptr [rbx + 8*rax + 112]
  movups xmmword ptr [rbp + 8*rax - 1024], xmm0
  movups xmmword ptr [rbp + 8*rax - 1008], xmm1
  add rax, 16
  cmp rax, 128
  jne .LBB3_3

here is a godbolt link

  1. What I think might be happening is the assert! before the assume is causing the assume to be optimised away and assert! by itself is not enough for LLVM to discover the necessary invariants. interestingly if you put the assume first the optimisation runs.

    Joe at 2017-10-18 19:13:32

  2. Sidenote, have you tried just using .zip() here? With two slice iterators as input, it should have no problem optimizting this correctly, without asserts and assumes. (Of course if that doesn't happen, it's an interesting bug.)

    Edit:

        assert!(region.len > 128 );
        
        for (a, &b) in my_array.iter_mut().zip(&region[..]) {
           *a = b;
        } 
    

    bluss at 2017-10-18 21:47:07

  3. I have and it does indeed do necessary optimiziation - but only if using in a for loop. A ‘.zip(....).map(|(a,b)| { *a = *b })’ does not optimise

    Joe at 2017-10-18 21:51:22

  4. I was quite pleased to see zip work :)

    Joe at 2017-10-18 21:52:14

  5. The code in this shape is easier to debug (in a crate-type lib), and it reproduces the same issue. Then we don't have to worry about black box or any other effects.

    
    // rest of the trait impls are needed from the previous code.
    
    #[no_mangle]
    pub fn foo(region: OwnedRegion<usize>, my_array: &mut [usize; 128]) {
        assert!(region.len() > 128 );
    
        unsafe {  assume(region.len() > 128) };
        for i in 0..128 {
           my_array[i] = region[i];
        } 
    }
    

    (Good, sorry for the zip distraction then.)

    bluss at 2017-10-18 22:07:08

  6. While this doesn't solve the missed optimization, I find that simply starting such a method with slicing everything to the correct size is often enough to help the optimizer. Does region support slicing?

    let region = &region[0..128];
    let my_array = &mut my_array[0..128];
    

    Thiez at 2017-10-19 12:08:22

  7. Thanks @Thiez thats good to now. Also seems to work well when the range is dynamic. I am wondering if you know whay this slice trick works so well? I can consistently get a average 10ns improvement over any other method for my iterators. Is rust doing something with slices that llvm can see that it cant see in a hand rolled iterator?

    Joe at 2017-10-22 21:30:15

  8. A slice is very simple, just a pointer and a length, so I suspect it is fairly easy for LLVM to reason about. It kind of has to be able to reason well about pointers and lengths to be able to handle IR generated by clang, because point + length is such a common pattern in C. The slicing operation performs the range check (and panics when the thing it is based on does not have the required range), so it is also a lot like having the assert!(region.len() > 128) in your code.

    Thiez at 2017-10-23 10:54:01