Performance regression in nightly-2017-01-15
I've recently run into a performance regression in the Rust compiler in a project I'm working on. I've used nightlies to narrow it down to somewhere in 778988174..bf6d7b665. Specifically, the 2017-01-14 nightly is good, and 2017-01-15 is bad. The regression is independent of CARGO_INCREMENTAL=1.
If you have a machine with a few cores, you can test it on a clone of https://github.com/mit-pdos/distributary/. Run with
$ cargo run -v --release --bin vote -- --avg --articles=100000 --runtime=10 --migrate=3 soup:// | grep 'PUT+'
On -14, the "migration" mentioned in the middle finishes quite quickly. On -15, it takes a long time, and so PUT never transitions to PUT+.
I wanted to try a git bisect to narrow the problem down further, but compiling it takes forever. A cursory glance over the commit range doesn't reveal anything particularly suspect, and #rust-internals told me to post here to get some more eyes on the problem.
Here's the commits that happened between the two commit hashes above:
https://github.com/rust-lang/rust/compare/7789881...bf6d7b6
Corey Farwell at 2017-02-12 23:55:31
In addition to the "migration" performance issue, note also that the benchmark's
GETandPUTthroughputs degraded substantially from2017-01-14to2017-01-15:2017-01-14
avg PUT: 197047.48 avg GET0: 1332991.45 avg GET: 1332991.45 avg PUT+: 153669.90 avg GET0+: 1305670.42 avg GET+: 1332991.452017-01-15
avg PUT: 87309.19 avg GET0: 941140.38 avg GET: 941140.38 avg PUT+: 75320.96 avg GET0+: 921691.27 avg GET+: 941140.38Malte Schwarzkopf at 2017-02-12 23:58:42
So, after some more digging, it turns out that
CARGO_INCREMENTALdoes matter. If unset entirely, the performance regression is no longer present.CARGO_INCREMENTAL=andCARGO_INCREMENTAL=0still experience the bug.I'm running a
git bisectnow using./x.py build --stage 1 src/libtestto re-buildrustat each step, thencargo clean, and thenenv CARGO_INCREMENTAL=1 cargo build --bin vote --releasefollowed by a run. Will report back once I have results.Jon Gjengset at 2017-02-13 00:43:22
git bisectpointed to 5f90947c2c27d5d4ae30ccc0e83ffc95a8597128 as the first bad commit. I can confirm that usingrustcbuilt off of 02c7b117dadcfea4416f53b4bd99828bf750871f withCARGO_INCREMENTAL=1produces the expected performance, and usingrustcfrom 5f90947c2c27d5d4ae30ccc0e83ffc95a8597128 shows a significant performance decrease. IfCARGO_INCREMENTALis not set, 5f90947c2c27d5d4ae30ccc0e83ffc95a8597128 performs the same as 02c7b117dadcfea4416f53b4bd99828bf750871f.Jon Gjengset at 2017-02-13 02:48:18
/cc #38944 @michaelwoerister
Jon Gjengset at 2017-02-13 02:49:02
FWIW, the project in question does a significant amount of work with
HashMaps, so I would hazard a guess that the performance regression is somehow related to that. I'll see if I can cook up a benchmark. Could it be that this change is somehow preventing inlining of hashing orHashMapinsertions when linking withstd?Jon Gjengset at 2017-02-13 02:53:58
And a minimal test-case showing that
HashMap::insertis ~4x slower with 5f90947c2c27d5d4ae30ccc0e83ffc95a8597128 andCARGO_INCREMENTAL=1.#![feature(test)] extern crate test; extern crate rand; use test::Bencher; use rand::Rng; #[bench] fn hashmap_insert(b: &mut Bencher) { let mut val : u32 = 0; let mut rng = rand::thread_rng(); let mut map = std::collections::HashMap::new(); b.iter(|| { map.insert(rng.gen::<u8>() as usize, val); val += 1; }) }$ diff -r bench-bad/ bench-good/ $ rustup override list | grep bench- ~/bench-good rust-good ~/bench-bad rust-bad $ git -C ~/rust-good rev-parse HEAD 02c7b117dadcfea4416f53b4bd99828bf750871f $ git -C ~/rust-bad rev-parse HEAD 5f90947c2c27d5d4ae30ccc0e83ffc95a8597128 $ (~/bench-good) cargo bench Compiling libc v0.2.20 Compiling rand v0.3.15 Compiling bench-hashmap v0.1.0 (file://~/bench-good) Finished release [optimized] target(s) in 7.3 secs Running target/release/deps/bench_hashmap-0ee5f7a4b83ca3d0 running 1 test test hashmap_insert ... bench: 39 ns/iter (+/- 3) test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured $ (~/bench-good) cargo clean; env CARGO_INCREMENTAL=1 cargo bench Compiling libc v0.2.20 Compiling rand v0.3.15 Compiling bench-hashmap v0.1.0 (file://~/bench-good) Finished release [optimized] target(s) in 7.39 secs Running target/release/deps/bench_hashmap-0ee5f7a4b83ca3d0 running 1 test test hashmap_insert ... bench: 38 ns/iter (+/- 3) test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured $ (~/bench-bad) cargo bench Compiling libc v0.2.20 Compiling rand v0.3.15 Compiling bench-hashmap v0.1.0 (file://~/bench-bad) Finished release [optimized] target(s) in 7.27 secs Running target/release/deps/bench_hashmap-e5736083e8c2908f running 1 test test hashmap_insert ... bench: 38 ns/iter (+/- 3) test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured $ (~/bench-bad) cargo clean; env CARGO_INCREMENTAL=1 cargo bench Compiling libc v0.2.20 Compiling rand v0.3.15 Compiling bench-hashmap v0.1.0 (file://~/bench-bad) Finished release [optimized] target(s) in 7.42 secs Running target/release/deps/bench_hashmap-e5736083e8c2908f running 1 test test hashmap_insert ... bench: 133 ns/iter (+/- 13) test result: ok. 0 passed; 0 failed; 0 ignored; 1 measuredJon Gjengset at 2017-02-13 03:11:41
Thanks for the report. My guess is that the code runs slower because less inlining is happening in incremental mode. It could probably be solved by strategic placement of some
#[inline]annotations in libstd.cc @rust-lang/libs
Michael Woerister at 2017-02-13 14:52:18
In general it's expected that incremental does not produce as-fast binaries as non-incremental, right? (due to a lack of inlining)
Alex Crichton at 2017-02-13 15:38:53
@alexcrichton it's still curious to see a regression like this though, no? also, a 4x performance decrease for
HashMap::insertseems like something that might affect a lot of people, and make them want to turn of incremental compilation entirely :/Jon Gjengset at 2017-02-14 02:31:00
cc @nikomatsakis
Aaron Turon at 2017-02-14 03:42:48
@jonhoo oh sure it's bad, but pick any factor slowdown and someone can come up with a use case where if something isn't inlined it'll have that much slowdown. Release builds should likely avoid incremental for now but the main target is debug builds today I believe.
Alex Crichton at 2017-02-14 15:12:41
I believe that this is "working as expected", yes, but it's very nice to have some test cases. We've been wanting to see if we can come up with better heuristics to help us with inlining across incremental units, and for that we need tests.
Niko Matsakis at 2017-02-16 17:03:12
ThinLTO might do away with this problem if it is fast enough to be used for incr. comp.
Michael Woerister at 2017-03-06 15:00:33
ThinLTO is now implemented and available to test this. Compiling in release mode the above benchmark I get:
test hashmap_insert ... bench: 39 ns/iter (+/- 0)Compiling in release incremental mode I get:
test hashmap_insert ... bench: 77 ns/iter (+/- 1)Compiling in release incremental mode with ThinLTO I get:
test hashmap_insert ... bench: 43 ns/iter (+/- 2)It's not 100% the same but I'd be pretty willing to bet that it's "close enough"
Alex Crichton at 2017-10-16 14:49:59
This looks very promising!
Michael Woerister at 2017-10-16 15:15:25
What should be the appropriate course of action for this ticket?
Esteban Kuber at 2019-05-15 22:13:24