u32 saturating_mul+compare slower than u64 mul+compare

ad4d742
Opened by eefriedman at 2021-04-14 04:50:22
#![feature(test)]
#![feature(core_intrinsics)]

extern crate test;

static mut XXX: u32 = 10;

use test::Bencher;

#[bench]
fn bench_sat_mul(b: &mut Bencher) {
    b.iter(|| unsafe {
        for _ in 1..1000 {
            let r = std::intrinsics::volatile_load(&XXX);
            let r2 = std::intrinsics::volatile_load(&XXX);
            let r = r.saturating_mul(r2);
            let r = if r > 1000000000 { 1000000000 } else { r };
            std::intrinsics::volatile_store(&mut XXX, r);
        }
    });
}

#[bench]
fn bench_sat_mul_2(b: &mut Bencher) {
    b.iter(|| unsafe {
        for _ in 1..1000 {
            let r = std::intrinsics::volatile_load(&XXX);
            let r2 = std::intrinsics::volatile_load(&XXX);
            let r = r as u64 * r2 as u64;
            let r = if r > 1000000000 { 1000000000 } else { r as u32 };
            std::intrinsics::volatile_store(&mut XXX, r);
        }
    });
}

Originally reported at https://users.rust-lang.org/t/unexpected-performance-from-array-bound-tests-and-more/6376/5 .

The two different code sequences:

    deac:       f7 e2                   mul    %edx
    deae:       0f 40 c7                cmovo  %edi,%eax
    deb1:       3d 00 ca 9a 3b          cmp    $0x3b9aca00,%eax
    deb6:       0f 47 c3                cmova  %ebx,%eax
    df3c:       48 0f af f3             imul   %rbx,%rsi
    df40:       48 81 fe 00 ca 9a 3b    cmp    $0x3b9aca00,%rsi
    df47:       0f 43 f2                cmovae %edx,%esi

There also seems to be some effect on loop unrolling.

  1. LLVM?

    Ariel Ben-Yehuda at 2016-07-21 09:24:58

  2. aarch64 is affected too:

    test bench_sat_mul   ... bench:       8,476 ns/iter (+/- 11)
    test bench_sat_mul_2 ... bench:       6,536 ns/iter (+/- 256)
    

    Taylor Trump at 2016-07-22 19:49:31

  3. Looks like this one is duplicate of https://github.com/rust-lang/rust/issues/34948?

    Ingvar Stepanyan at 2016-12-31 16:46:21

  4. Rephrased into https://rust.godbolt.org/z/vcTT3GMeb for ease of ASM-looking.

    I don't see anything obvious that rust should be doing here, so this is probably up to LLVM. There's not even a umul.sat intrinsic. (Well, I guess we could emit umul.fix.sat with a scale of 0, but that seems unlikely to be better.)

    scottmcm at 2021-04-14 04:50:22