u32 saturating_mul+compare slower than u64 mul+compare
#![feature(test)]
#![feature(core_intrinsics)]
extern crate test;
static mut XXX: u32 = 10;
use test::Bencher;
#[bench]
fn bench_sat_mul(b: &mut Bencher) {
b.iter(|| unsafe {
for _ in 1..1000 {
let r = std::intrinsics::volatile_load(&XXX);
let r2 = std::intrinsics::volatile_load(&XXX);
let r = r.saturating_mul(r2);
let r = if r > 1000000000 { 1000000000 } else { r };
std::intrinsics::volatile_store(&mut XXX, r);
}
});
}
#[bench]
fn bench_sat_mul_2(b: &mut Bencher) {
b.iter(|| unsafe {
for _ in 1..1000 {
let r = std::intrinsics::volatile_load(&XXX);
let r2 = std::intrinsics::volatile_load(&XXX);
let r = r as u64 * r2 as u64;
let r = if r > 1000000000 { 1000000000 } else { r as u32 };
std::intrinsics::volatile_store(&mut XXX, r);
}
});
}
Originally reported at https://users.rust-lang.org/t/unexpected-performance-from-array-bound-tests-and-more/6376/5 .
The two different code sequences:
deac: f7 e2 mul %edx
deae: 0f 40 c7 cmovo %edi,%eax
deb1: 3d 00 ca 9a 3b cmp $0x3b9aca00,%eax
deb6: 0f 47 c3 cmova %ebx,%eax
df3c: 48 0f af f3 imul %rbx,%rsi
df40: 48 81 fe 00 ca 9a 3b cmp $0x3b9aca00,%rsi
df47: 0f 43 f2 cmovae %edx,%esi
There also seems to be some effect on loop unrolling.
LLVM?
Ariel Ben-Yehuda at 2016-07-21 09:24:58
aarch64is affected too:test bench_sat_mul ... bench: 8,476 ns/iter (+/- 11) test bench_sat_mul_2 ... bench: 6,536 ns/iter (+/- 256)Taylor Trump at 2016-07-22 19:49:31
Looks like this one is duplicate of https://github.com/rust-lang/rust/issues/34948?
Ingvar Stepanyan at 2016-12-31 16:46:21
Rephrased into https://rust.godbolt.org/z/vcTT3GMeb for ease of ASM-looking.
I don't see anything obvious that rust should be doing here, so this is probably up to LLVM. There's not even a
umul.satintrinsic. (Well, I guess we could emitumul.fix.satwith a scale of0, but that seems unlikely to be better.)scottmcm at 2021-04-14 04:50:22