VecDeque's iterator could be optimized

136266d
Opened by bluss at 2019-12-11 16:29:26

VecDeque's iterator seems to never allow llvm to vectorize loops, this could be improved.

Even an enum where it uses the slice iterator for its contiguous case, and a fallback iterator, would allow vectorization for the contiguous case.

The following simple benchmark shows the massive difference between the deque iterator and using its two slice parts. The results are the same whether the deque is discontinuous or not. The summation in sum_deque_2 is an order of magnitude faster, because llvm can autovectorize it.


/* Results
test sum_deque   ... bench:       1,301 ns/iter (+/- 111)
test sum_deque_2 ... bench:          71 ns/iter (+/- 2)
*/

#![feature(test)]
extern crate test;

use test::Bencher;
use test::black_box;

use std::collections::VecDeque;


#[bench]
fn sum_deque(b: &mut Bencher) {
    let mut dq: VecDeque<_> = (0..1024).collect();
    /*
    for _ in 0..128 {
        let elt = dq.pop_back().unwrap();
        dq.push_front(elt);
    }
    */
    let dq = black_box(dq);

    b.iter(|| {
        let mut sum = 0;
        for elt in &dq {
            sum += *elt;
        }
        sum
    })
}


#[bench]
fn sum_deque_2(b: &mut Bencher) {
    let mut dq: VecDeque<_> = (0..1024).collect();
    /*
    for _ in 0..128 {
        let elt = dq.pop_back().unwrap();
        dq.push_front(elt);
    }
    */
    let dq = black_box(dq);

    b.iter(|| {
        let mut sum = 0;
        let (a, b) = dq.as_slices();
        for elt in a {
            sum += *elt;
        }
        for elt in b {
            sum += *elt;
        }
        sum
    })
}
  1. @dotdash The two for loops over slices in sum_deque_2 are very efficent, but I've never managed to build a rust iterator that optimizes like that (I guess llvm would have to recognize it as two consecutive loops). .chain() doesn't do it, and simpler or special cases like chain don't. It's a big iterator challenge!

    I've run into the same in the ndarray crate. For non-contiguous arrays, the iterator wants to be by row, and then dispatch to the slice iterator for contiguous rows. I haven't been able to make a rust iterator that does this efficiently (I use explicit nested for loops to implement arithmetic ops instead).

    bluss at 2016-01-10 10:59:49

  2. Could you clarify which SSE instruction set was used to get that 20x speedup?

    On ARMv7 it's 2x before using NEON and 3x after so I'm not sure if it's the best llvm could do here. (possibly material for an llvm bug report)

    Even on an old x86 cpu, using SSE (v1) yields a 2x speedup for a total of 5x.

    Taylor Trump at 2016-01-10 15:30:36

  3. My tested platform is x86-64, Sandy Bridge, but with default settings, so only default instructions used.

    It's using paddd, the loop core in sum_deque_2 looks like this:

        f460:       f3 0f 6f 52 90          movdqu xmm2,XMMWORD PTR [rdx-0x70]
        f465:       f3 0f 6f 5a a0          movdqu xmm3,XMMWORD PTR [rdx-0x60]
        f46a:       f3 0f 6f 62 b0          movdqu xmm4,XMMWORD PTR [rdx-0x50]
        f46f:       f3 0f 6f 6a c0          movdqu xmm5,XMMWORD PTR [rdx-0x40]
        f474:       66 0f fe d0             paddd  xmm2,xmm0
        f478:       66 0f fe d9             paddd  xmm3,xmm1
        f47c:       66 0f fe d4             paddd  xmm2,xmm4
        f480:       66 0f fe dd             paddd  xmm3,xmm5
        f484:       f3 0f 6f 62 d0          movdqu xmm4,XMMWORD PTR [rdx-0x30]
        f489:       f3 0f 6f 6a e0          movdqu xmm5,XMMWORD PTR [rdx-0x20]
        f48e:       66 0f fe e2             paddd  xmm4,xmm2
        f492:       66 0f fe eb             paddd  xmm5,xmm3
        f496:       f3 0f 6f 42 f0          movdqu xmm0,XMMWORD PTR [rdx-0x10]
        f49b:       f3 0f 6f 0a             movdqu xmm1,XMMWORD PTR [rdx]
        f49f:       66 0f fe c4             paddd  xmm0,xmm4
        f4a3:       66 0f fe cd             paddd  xmm1,xmm5
        f4a7:       48 83 ea 80             sub    rdx,0xffffffffffffff80
        f4ab:       48 83 c3 e0             add    rbx,0xffffffffffffffe0
        f4af:       75 af                   jne    f460 <_ZN11sum_deque_220hbee41beab497848aadaE+0x2b0>
    

    using -C target-cpu=native, sum_deque_2 improves further, switching to AVX(?) instructions vpadd and so on.

    bluss at 2016-01-10 17:23:47

  4. Thanks, just the default SSE2 indeed.

    Do you think it would be useful to create an llvm issue containing IR output from the two platforms side-by-side? The issue is here.

    The llvm ARM guy is interested in some more arm data points, care to chime in @emoon, @warricksothr?

    Taylor Trump at 2016-01-10 17:48:23

  5. Your issue is only tangential to this issue, but you should make a more reduced test case if you only want to look at the loop's codegen. Here's what I'd look at, just the summation, which produces a big blob of autovectorized code on x86-64. https://play.rust-lang.org/?gist=8b542d1b17b8d69e44b7&version=nightly

    bluss at 2016-01-11 05:39:33

  6. Thanks @bluss and sorry for the diversion. The fact unrolling is disabled on ARM by default was an interesting discovery. Using the -force-target-max-vector-interleave=4 LLVM option actually produces a small improvement. (and stops there for N > 4)

    Taylor Trump at 2016-01-11 16:07:43

  7. Triage: No change using the for loop, but due to the fold improvement for VecDeque's iterator, the idiomatic dq.iter().sum() is now as fast as it should be (same as the slice).

    bluss at 2016-12-05 10:18:47