optimisation opportunity missed due to assert!
I have noticed interesting behaviour regarding assert! which was unexpected. I am not sure if this is a LLVM or Rust issue.
in the following code (See godbolt link below) I have a vec like structure OwnedRegion.
I want to iterate over the region (implemented as slice::IterMut) and populate an array.
I expect LLVM to do loop unrolling and loop vectorisation. If you place
the necessary intrinsics::assume to trick llvm into removing the bound check you do
indeed get the optimisations. However I wanted to replace the assume with an assert!
but that removes the optimisation even if combined with an assume! What I think is happening
is the bounds check is being reintroduced because the length check in the assert! is being moved into the for loop or elided completely because it duplicates the length check in the index operation.
What should happen is the bounds check in the for loop should be lifted/elided because of the assert!
fn main() {
let mut region: OwnedRegion<usize> = test::black_box(OwnedRegion::empty());
let mut my_array: [usize; 128] = [0;128];
// !!! The following assert blocks the loop vectorization optimization !!!
// !!! remove the following line to enable Loop vectorization !!!
assert!(region.len > 128 );
// I know this assume is unsafe given that we created an empty OwnedRegion.
// I am only looking at the output assembly to see if optimization is run
unsafe { assume(region.len() > 128) };
for i in 0..128 {
my_array[i] = region[i];
}
println!("{:?}", &my_array[..32]);
}
with assert! and assume:
inner loops is:
cmp r15, rcx
jbe .LBB6_5
lea rcx, [rax - 2]
mov rdx, qword ptr [rbx + 8*rax - 24]
mov qword ptr [rbp + 8*rax - 1152], rdx
cmp r15, rcx
jbe .LBB6_4
lea rcx, [rax - 1]
mov rdx, qword ptr [rbx + 8*rax - 16]
mov qword ptr [rbp + 8*rax - 1144], rdx
cmp r15, rcx
jbe .LBB6_8
mov rcx, qword ptr [rbx + 8*rax - 8]
mov qword ptr [rbp + 8*rax - 1136], rcx
cmp r15, rax
jbe .LBB6_10
add rsi, 4
mov rcx, qword ptr [rbx + 8*rax]
mov qword ptr [rbp + 8*rax - 1128], rcx
where LBB6_4, LBB6_5, LBB6_8, and LBB6_10 jump to std::panic
with just assume:
inner loops is:
.LBB3_3:
movups xmm0, xmmword ptr [rbx + 8*rax]
movups xmm1, xmmword ptr [rbx + 8*rax + 16]
movups xmmword ptr [rbp + 8*rax - 1120], xmm0
movups xmmword ptr [rbp + 8*rax - 1104], xmm1
movups xmm0, xmmword ptr [rbx + 8*rax + 32]
movups xmm1, xmmword ptr [rbx + 8*rax + 48]
movups xmmword ptr [rbp + 8*rax - 1088], xmm0
movups xmmword ptr [rbp + 8*rax - 1072], xmm1
movups xmm0, xmmword ptr [rbx + 8*rax + 64]
movups xmm1, xmmword ptr [rbx + 8*rax + 80]
movups xmmword ptr [rbp + 8*rax - 1056], xmm0
movups xmmword ptr [rbp + 8*rax - 1040], xmm1
movups xmm0, xmmword ptr [rbx + 8*rax + 96]
movups xmm1, xmmword ptr [rbx + 8*rax + 112]
movups xmmword ptr [rbp + 8*rax - 1024], xmm0
movups xmmword ptr [rbp + 8*rax - 1008], xmm1
add rax, 16
cmp rax, 128
jne .LBB3_3
here is a godbolt link
What I think might be happening is the
assert!before theassumeis causing theassumeto be optimised away andassert!by itself is not enough for LLVM to discover the necessary invariants. interestingly if you put theassumefirst the optimisation runs.Joe at 2017-10-18 19:13:32
Sidenote, have you tried just using
.zip()here? With two slice iterators as input, it should have no problem optimizting this correctly, without asserts and assumes. (Of course if that doesn't happen, it's an interesting bug.)Edit:
assert!(region.len > 128 ); for (a, &b) in my_array.iter_mut().zip(®ion[..]) { *a = b; }bluss at 2017-10-18 21:47:07
I have and it does indeed do necessary optimiziation - but only if using in a for loop. A ‘.zip(....).map(|(a,b)| { *a = *b })’ does not optimise
Joe at 2017-10-18 21:51:22
I was quite pleased to see zip work :)
Joe at 2017-10-18 21:52:14
The code in this shape is easier to debug (in a crate-type lib), and it reproduces the same issue. Then we don't have to worry about black box or any other effects.
// rest of the trait impls are needed from the previous code. #[no_mangle] pub fn foo(region: OwnedRegion<usize>, my_array: &mut [usize; 128]) { assert!(region.len() > 128 ); unsafe { assume(region.len() > 128) }; for i in 0..128 { my_array[i] = region[i]; } }(Good, sorry for the zip distraction then.)
bluss at 2017-10-18 22:07:08
While this doesn't solve the missed optimization, I find that simply starting such a method with slicing everything to the correct size is often enough to help the optimizer. Does
regionsupport slicing?let region = ®ion[0..128]; let my_array = &mut my_array[0..128];Thiez at 2017-10-19 12:08:22
Thanks @Thiez thats good to now. Also seems to work well when the range is dynamic. I am wondering if you know whay this slice trick works so well? I can consistently get a average 10ns improvement over any other method for my iterators. Is rust doing something with slices that llvm can see that it cant see in a hand rolled iterator?
Joe at 2017-10-22 21:30:15
A slice is very simple, just a pointer and a length, so I suspect it is fairly easy for LLVM to reason about. It kind of has to be able to reason well about pointers and lengths to be able to handle IR generated by clang, because point + length is such a common pattern in C. The slicing operation performs the range check (and panics when the thing it is based on does not have the required range), so it is also a lot like having the
assert!(region.len() > 128)in your code.Thiez at 2017-10-23 10:54:01