u32 saturating_mul with small constant is slower than a multiply+compare

3c7cb20
Opened by eefriedman at 2020-07-09 18:28:30
#![feature(test)]
#![feature(core_intrinsics)]

extern crate test;

static mut XXX: u32 = 10;

use test::Bencher;

#[bench]
fn bench_sat_mul(b: &mut Bencher) {
    b.iter(|| unsafe {
        for _ in 1..1000 {
            let mut r = std::intrinsics::volatile_load(&XXX);
            r = r.saturating_mul(3);
            std::intrinsics::volatile_store(&mut XXX, r);
        }
    });
}

#[inline(always)]
fn fast_saturating_mul(a: u32, b: u32) -> u32 {
    let r = a as u64 * b as u64;
    if r > 0xFFFFFFFF { 0xFFFFFFFF } else { r as u32 }
}

#[bench]
fn bench_sat_mul_2(b: &mut Bencher) {
    b.iter(|| unsafe {
        for _ in 1..1000 {
            let mut r = std::intrinsics::volatile_load(&XXX);
            r = fast_saturating_mul(r, 3);
            std::intrinsics::volatile_store(&mut XXX, r);
        }
    });
}

Resulting timings (x86-64 Linux, Ivy Bridge processor):

test bench_sat_mul   ... bench:       4,354 ns/iter (+/- 231)
test bench_sat_mul_2 ... bench:       3,710 ns/iter (+/- 108)

Maybe not a perfect benchmark, but there's probably something worth looking at. ~~Originally reported at https://users.rust-lang.org/t/unexpected-performance-from-array-bound-tests-and-more/6376/5 .~~

  1. Did a bit more testing... apparently the constant "3" makes a substantial difference (presumably because one version gets transformed into an LEA). Still potentially interesting, but maybe not quite in the same way.

    eefriedman at 2016-07-21 03:18:12

  2. aarch64 is affected too, and unless you were simply demonstrating the difference (let's say under load), the absolute x86_64 performance seems way too low.

    test bench_sat_mul   ... bench:       5,211 ns/iter (+/- 32)
    test bench_sat_mul_2 ... bench:       4,561 ns/iter (+/- 20)
    

    Taylor Trump at 2016-07-22 19:49:41

  3. Triage: today I get

    
    running 2 tests
    test bench_sat_mul   ... bench:       2,977 ns/iter (+/- 6)
    test bench_sat_mul_2 ... bench:       3,021 ns/iter (+/- 2)
    
    

    Steve Klabnik at 2020-07-09 18:28:30