u32 saturating_mul with small constant is slower than a multiply+compare
#![feature(test)]
#![feature(core_intrinsics)]
extern crate test;
static mut XXX: u32 = 10;
use test::Bencher;
#[bench]
fn bench_sat_mul(b: &mut Bencher) {
b.iter(|| unsafe {
for _ in 1..1000 {
let mut r = std::intrinsics::volatile_load(&XXX);
r = r.saturating_mul(3);
std::intrinsics::volatile_store(&mut XXX, r);
}
});
}
#[inline(always)]
fn fast_saturating_mul(a: u32, b: u32) -> u32 {
let r = a as u64 * b as u64;
if r > 0xFFFFFFFF { 0xFFFFFFFF } else { r as u32 }
}
#[bench]
fn bench_sat_mul_2(b: &mut Bencher) {
b.iter(|| unsafe {
for _ in 1..1000 {
let mut r = std::intrinsics::volatile_load(&XXX);
r = fast_saturating_mul(r, 3);
std::intrinsics::volatile_store(&mut XXX, r);
}
});
}
Resulting timings (x86-64 Linux, Ivy Bridge processor):
test bench_sat_mul ... bench: 4,354 ns/iter (+/- 231)
test bench_sat_mul_2 ... bench: 3,710 ns/iter (+/- 108)
Maybe not a perfect benchmark, but there's probably something worth looking at. ~~Originally reported at https://users.rust-lang.org/t/unexpected-performance-from-array-bound-tests-and-more/6376/5 .~~
Did a bit more testing... apparently the constant "3" makes a substantial difference (presumably because one version gets transformed into an LEA). Still potentially interesting, but maybe not quite in the same way.
eefriedman at 2016-07-21 03:18:12
aarch64is affected too, and unless you were simply demonstrating the difference (let's say under load), the absolutex86_64performance seems way too low.test bench_sat_mul ... bench: 5,211 ns/iter (+/- 32) test bench_sat_mul_2 ... bench: 4,561 ns/iter (+/- 20)Taylor Trump at 2016-07-22 19:49:41
Triage: today I get
running 2 tests test bench_sat_mul ... bench: 2,977 ns/iter (+/- 6) test bench_sat_mul_2 ... bench: 3,021 ns/iter (+/- 2)Steve Klabnik at 2020-07-09 18:28:30