VecDeque's iterator could be optimized
VecDeque's iterator seems to never allow llvm to vectorize loops, this could be improved.
Even an enum where it uses the slice iterator for its contiguous case, and a fallback iterator, would allow vectorization for the contiguous case.
The following simple benchmark shows the massive difference between the deque iterator and using its two slice parts. The results are the same whether the deque is discontinuous or not. The summation in sum_deque_2 is an order of magnitude faster, because llvm can autovectorize it.
/* Results
test sum_deque ... bench: 1,301 ns/iter (+/- 111)
test sum_deque_2 ... bench: 71 ns/iter (+/- 2)
*/
#![feature(test)]
extern crate test;
use test::Bencher;
use test::black_box;
use std::collections::VecDeque;
#[bench]
fn sum_deque(b: &mut Bencher) {
let mut dq: VecDeque<_> = (0..1024).collect();
/*
for _ in 0..128 {
let elt = dq.pop_back().unwrap();
dq.push_front(elt);
}
*/
let dq = black_box(dq);
b.iter(|| {
let mut sum = 0;
for elt in &dq {
sum += *elt;
}
sum
})
}
#[bench]
fn sum_deque_2(b: &mut Bencher) {
let mut dq: VecDeque<_> = (0..1024).collect();
/*
for _ in 0..128 {
let elt = dq.pop_back().unwrap();
dq.push_front(elt);
}
*/
let dq = black_box(dq);
b.iter(|| {
let mut sum = 0;
let (a, b) = dq.as_slices();
for elt in a {
sum += *elt;
}
for elt in b {
sum += *elt;
}
sum
})
}
@dotdash The two for loops over slices in
sum_deque_2are very efficent, but I've never managed to build a rust iterator that optimizes like that (I guess llvm would have to recognize it as two consecutive loops)..chain()doesn't do it, and simpler or special cases like chain don't. It's a big iterator challenge!I've run into the same in the
ndarraycrate. For non-contiguous arrays, the iterator wants to be by row, and then dispatch to the slice iterator for contiguous rows. I haven't been able to make a rust iterator that does this efficiently (I use explicit nested for loops to implement arithmetic ops instead).bluss at 2016-01-10 10:59:49
Could you clarify which SSE instruction set was used to get that 20x speedup?
On
ARMv7it's 2x before usingNEONand 3x after so I'm not sure if it's the best llvm could do here. (possibly material for an llvm bug report)Even on an old x86 cpu, using
SSE (v1)yields a 2x speedup for a total of 5x.Taylor Trump at 2016-01-10 15:30:36
My tested platform is x86-64, Sandy Bridge, but with default settings, so only default instructions used.
It's using
paddd, the loop core insum_deque_2looks like this:f460: f3 0f 6f 52 90 movdqu xmm2,XMMWORD PTR [rdx-0x70] f465: f3 0f 6f 5a a0 movdqu xmm3,XMMWORD PTR [rdx-0x60] f46a: f3 0f 6f 62 b0 movdqu xmm4,XMMWORD PTR [rdx-0x50] f46f: f3 0f 6f 6a c0 movdqu xmm5,XMMWORD PTR [rdx-0x40] f474: 66 0f fe d0 paddd xmm2,xmm0 f478: 66 0f fe d9 paddd xmm3,xmm1 f47c: 66 0f fe d4 paddd xmm2,xmm4 f480: 66 0f fe dd paddd xmm3,xmm5 f484: f3 0f 6f 62 d0 movdqu xmm4,XMMWORD PTR [rdx-0x30] f489: f3 0f 6f 6a e0 movdqu xmm5,XMMWORD PTR [rdx-0x20] f48e: 66 0f fe e2 paddd xmm4,xmm2 f492: 66 0f fe eb paddd xmm5,xmm3 f496: f3 0f 6f 42 f0 movdqu xmm0,XMMWORD PTR [rdx-0x10] f49b: f3 0f 6f 0a movdqu xmm1,XMMWORD PTR [rdx] f49f: 66 0f fe c4 paddd xmm0,xmm4 f4a3: 66 0f fe cd paddd xmm1,xmm5 f4a7: 48 83 ea 80 sub rdx,0xffffffffffffff80 f4ab: 48 83 c3 e0 add rbx,0xffffffffffffffe0 f4af: 75 af jne f460 <_ZN11sum_deque_220hbee41beab497848aadaE+0x2b0>using
-C target-cpu=native,sum_deque_2improves further, switching to AVX(?) instructionsvpaddand so on.bluss at 2016-01-10 17:23:47
Thanks, just the default
SSE2indeed.Do you think it would be useful to create an
llvmissue containing IR output from the two platforms side-by-side? The issue is here.The llvm ARM guy is interested in some more arm data points, care to chime in @emoon, @warricksothr?
Taylor Trump at 2016-01-10 17:48:23
Your issue is only tangential to this issue, but you should make a more reduced test case if you only want to look at the loop's codegen. Here's what I'd look at, just the summation, which produces a big blob of autovectorized code on x86-64. https://play.rust-lang.org/?gist=8b542d1b17b8d69e44b7&version=nightly
bluss at 2016-01-11 05:39:33
Thanks @bluss and sorry for the diversion. The fact unrolling is disabled on ARM by default was an interesting discovery. Using the
-force-target-max-vector-interleave=4LLVM option actually produces a small improvement. (and stops there for N > 4)Taylor Trump at 2016-01-11 16:07:43
Triage: No change using the
forloop, but due to thefoldimprovement for VecDeque's iterator, the idiomaticdq.iter().sum()is now as fast as it should be (same as the slice).bluss at 2016-12-05 10:18:47