2x benchmark loss in rayon-hash from multiple codegen-units
I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1.
$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0
$ cargo bench --bench set_sum
Compiling [...]
Finished release [optimized] target(s) in 5.51 secs
Running target/release/deps/set_sum-833cf161cf760670
running 4 tests
test rayon_set_sum_parallel ... bench: 2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial ... bench: 7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel ... bench: 10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial ... bench: 7,733,258 ns/iter (+/- 134,850)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
Compiling [...]
Finished release [optimized] target(s) in 6.48 secs
Running target/release/deps/set_sum-833cf161cf760670
running 4 tests
test rayon_set_sum_parallel ... bench: 1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial ... bench: 6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel ... bench: 8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial ... bench: 6,144,775 ns/iter (+/- 75,377)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
rayon_set_sum_parallel is the showcase for this crate, and it suffers the most from CGU.
From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer, 34% Iterator::fold, 28% Sum::sum, and the hot loop in the first looks like:
16.72 │126d0: cmpq $0x0,(%r12,%rbx,8)
6.73 │126d5: ↓ jne 126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
16.65 │126d7: inc %rbx
0.00 │126da: cmp %rbp,%rbx
7.20 │126dd: ↑ jb 126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
0.05 │126df: ↓ jmp 1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
26.93 │126e1: mov 0x0(%r13,%rbx,4),%eax
4.26 │126e6: movq $0x1,0x38(%rsp)
2.27 │126ef: mov %rax,0x40(%rsp)
1.88 │126f4: mov %r14,%rdi
4.58 │126f7: → callq 15b90 <<u64 as core::iter::traits::Sum>::sum>
0.68 │126fc: movq $0x1,0x38(%rsp)
2.58 │12705: mov %r15,0x40(%rsp)
0.62 │1270a: movq $0x1,0x48(%rsp)
0.31 │12713: mov %rax,0x50(%rsp)
0.49 │12718: movb $0x0,0x58(%rsp)
2.50 │1271d: xor %esi,%esi
0.41 │1271f: mov %r14,%rdi
0.85 │12722: → callq 153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
1.30 │12727: mov %rax,%r15
2.16 │1272a: ↑ jmp 126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>
With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer, with this hot loop:
2.28 │1426d: test %rbx,%rbx
│14270: ↓ je 14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
5.40 │14272: mov (%rbx),%ebx
2.75 │14274: add %rbx,%rax
1.47 │14277: lea (%rdx,%rsi,4),%rbx
0.21 │1427b: nopl 0x0(%rax,%rax,1)
29.56 │14280: cmp %rdi,%rsi
0.04 │14283: ↓ jae 14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
2.87 │14285: add $0x4,%rbx
20.22 │14289: cmpq $0x0,(%rcx,%rsi,8)
1.48 │1428e: lea 0x1(%rsi),%rsi
8.00 │14292: ↑ je 14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
25.25 │14294: ↑ jmp 1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>
@cuviper I wonder if we can add any
#[inline]hints to help resolve this?Niko Matsakis at 2018-01-25 15:19:35
@nikomatsakis you mean in libstd? Neither
Chain::foldnoru64::sumhave#[inline]attributes today, but usually that's not needed for generic functions. I guess being generic means they didn't (couldn't) get pre-compiled in libstd, but I suppose they must have gotten separate codegen units in the end.But I hope we don't have to generally recommend
#[inline]all over the place...Josh Stone at 2018-01-25 18:42:32
I'm not entirely sure this isn't expected; 2x is a bit much though. We might need to wait on the heuristics that @michaelwoerister has planned.
Mark Rousskov at 2018-01-25 21:56:34
@cuviper I'm not sure that being generic affects their need for an inlining hint. It is true that those functions will be instantiated in the downstream crate, but I think they are still isolated into their own codegen unit. (ThinLTO is of course supposed to help here.) I'm not sure what's the best fix though.
Niko Matsakis at 2018-01-26 14:58:11
FWIW, the gap has closed some (perhaps from LLVM 6?), with parallel codegen now "only" 40% slower:
$ rustc -Vv rustc 1.25.0-nightly (4d2d3fc5d 2018-02-13) binary: rustc commit-hash: 4d2d3fc5dadf894a8ad709a5860a549f2c0b1032 commit-date: 2018-02-13 host: x86_64-unknown-linux-gnu release: 1.25.0-nightly LLVM version: 6.0 $ cargo bench --bench set_sum Updating registry `https://github.com/rust-lang/crates.io-index` Compiling [...] Finished release [optimized] target(s) in 5.37 secs Running target/release/deps/set_sum-d85cde9ece817246 running 4 tests test rayon_set_sum_parallel ... bench: 1,420,138 ns/iter (+/- 34,709) test rayon_set_sum_serial ... bench: 7,718,603 ns/iter (+/- 141,127) test std_set_sum_parallel ... bench: 8,886,208 ns/iter (+/- 137,102) test std_set_sum_serial ... bench: 7,845,670 ns/iter (+/- 106,685) test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out $ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum Compiling [...] Finished release [optimized] target(s) in 6.40 secs Running target/release/deps/set_sum-d85cde9ece817246 running 4 tests test rayon_set_sum_parallel ... bench: 1,089,784 ns/iter (+/- 167,357) test rayon_set_sum_serial ... bench: 6,196,760 ns/iter (+/- 335,661) test std_set_sum_parallel ... bench: 8,497,259 ns/iter (+/- 128,929) test std_set_sum_serial ... bench: 6,151,935 ns/iter (+/- 76,954) test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered outThe profile of
rayon_set_sum_parallelis now 49.5%bridge_unindexed_producer_consumer, 48%Sum::sum, with this hot loop:19.74 │14fa0: cmpq $0x0,(%r12,%rbx,8) 6.54 │14fa5: ↓ jne 14fb2 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1a2> 21.06 │14fa7: add $0x1,%rbx │14fab: cmp %rbp,%rbx 4.04 │14fae: ↑ jb 14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190> 0.04 │14fb0: ↓ jmp 14fde <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1ce> 28.36 │14fb2: mov 0x0(%r13,%rbx,4),%eax 4.34 │14fb7: movq $0x1,0x58(%rsp) 5.45 │14fc0: mov %rax,0x60(%rsp) 0.84 │14fc5: mov %r14,%rdi 3.80 │14fc8: → callq 16380 <<u64 as core::iter::traits::Sum>::sum> 2.52 │14fcd: add %rax,%r15 2.07 │14fd0: add $0x1,%rbx │14fd4: cmp %rbp,%rbx 0.26 │14fd7: ↑ jb 14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>versus
codegen-units=1:0.63 │14092: lea (%rdi,%rcx,4),%rbp 0.01 │14096: nopw %cs:0x0(%rax,%rax,1) 25.16 │140a0: cmpq $0x0,(%rsi,%rcx,8) 7.68 │140a5: ↓ jne 140b6 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x176> 28.34 │140a7: add $0x1,%rcx 0.21 │140ab: add $0x4,%rbp │140af: cmp %rdx,%rcx 2.63 │140b2: ↑ jb 140a0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x160> 0.04 │140b4: ↓ jmp 140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e> 24.54 │140b6: test %rbp,%rbp 0.00 │140b9: ↓ je 140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e> 0.32 │140bb: add $0x1,%rcx 5.10 │140bf: mov 0x0(%rbp),%ebp 3.31 │140c2: add %rbp,%rax │140c5: cmp %rdx,%rcx 1.40 │140c8: ↑ jb 14092 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x152>Josh Stone at 2018-02-14 19:13:16
@rust-lang/cargo, maybe
cargo benchshould default to one codegen-unit?Michael Woerister at 2018-02-15 09:27:16
If
cargo benchuses one CGU butcargo build --releaseuses multiple CGUs, benchmark results are less representative of the code that actually runs at the end of the day.Hanna Kruppe at 2018-02-15 09:30:38
Yeah, I was thinking that too after I wrote that.
--releasereally sounds like it should be the profile configured for best possible performance. We need an additionalopt-devprofile that has "good enough" performance.Michael Woerister at 2018-02-15 09:33:28
@michaelwoerister today Cargo uses separate profiles for
cargo build --releaseandcargo bench, but, by default, they use the same options. So it is possible to override number of cgus for benchmarks specifically manually.However, long-term there is a desire to phase out all special profiles except release and dev (at least in the narrow sense of profiles, as "flags for the compiler"). See also https://github.com/rust-lang/rfcs/pull/2282 which suggests adding user-defined profiles.
Alex Kladov at 2018-02-15 09:35:47
The question of whether or not to have an
opt-devprofile and so forth has been around for some time. My current opinion is that we can probably keep it two options, but I would do this by altering the "debug" profile: instead of doing no optimization, it should do some minimal amount of optimization. It will take some time to determine precisely what that means, but I strongly suspect that we can get "decently performing code" by doing some mild inlining and few other minor optimizations. So the default might be something more like -O1 or -Og.In that case, we might consider some other profiles (e.g., no optimization at all, or mild optimization), but I sort of suspect that the use cases are thinner. In my experience, the only reason to want -O0 was for debuginfo, and let's premise that our default profile keeps debuginfo intact. Something like opt-dev I guess corresponds to "I don't care about debuginfo but i'm not benchmarking" -- that might be better served by customizing the "default' build settings for you project?
Still, it seems like there is ultimately maybe a need for three profiles:
- super-debug: no opts, I want perfect debuginfo
- debug (default): do some optimization. By default, preserves debuginfo, but maybe you can "tune it up" if you're rather have faster code.
- release: pull out all the stops.
Probably though this conversation should be happening somewhere else. =)
Niko Matsakis at 2018-02-15 17:05:22
Probably though this conversation should be happening somewhere else
Yeah, this is closely connected to the work on revamping profiles in Cargo.
Aaron Turon at 2018-02-15 17:07:59
I just benchmarked as part of updating to rayon-1.0, and it's back to 2x slowdown, with
Chain::foldappearing in the profile again. I guess the prior inlining improvement was just lucky, whether they're grouped in the same codegen unit.Josh Stone at 2018-02-15 18:44:42
Is this still an issue of concern?
Jubilee at 2022-07-18 03:37:38
Well
rayon-hashis deprecated, but I guess the comparison may still be informative?$ rustc -Vv rustc 1.64.0-nightly (263edd43c 2022-07-17) binary: rustc commit-hash: 263edd43c5255084292329423c61a9d69715ebfa commit-date: 2022-07-17 host: x86_64-unknown-linux-gnu release: 1.64.0-nightly LLVM version: 14.0.6 $ cargo bench --bench set_sum [...] running 6 tests test hashbrown_set_sum_parallel ... bench: 214,551 ns/iter (+/- 5,456) test hashbrown_set_sum_serial ... bench: 1,828,014 ns/iter (+/- 14,492) test rayon_hash_set_sum_parallel ... bench: 464,519 ns/iter (+/- 53,670) test rayon_hash_set_sum_serial ... bench: 6,327,378 ns/iter (+/- 57,630) test std_hash_set_sum_parallel ... bench: 4,392,456 ns/iter (+/- 1,768,937) test std_hash_set_sum_serial ... bench: 2,904,370 ns/iter (+/- 18,217) $ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum [...] running 6 tests test hashbrown_set_sum_parallel ... bench: 217,752 ns/iter (+/- 43,351) test hashbrown_set_sum_serial ... bench: 1,624,025 ns/iter (+/- 23,969) test rayon_hash_set_sum_parallel ... bench: 494,929 ns/iter (+/- 50,196) test rayon_hash_set_sum_serial ... bench: 4,843,448 ns/iter (+/- 72,878) test std_hash_set_sum_parallel ... bench: 2,632,228 ns/iter (+/- 628,283) test std_hash_set_sum_serial ... bench: 1,773,990 ns/iter (+/- 66,189)I was originally sad about
rayon_set_sum_parallel, and now that's pretty close, actually slower with CGU=1 but within+/-. It also looks consistent and great forhashbrown_set_sum_parallel. The rest have a pretty large benefit from CGU=1.Josh Stone at 2022-07-18 19:03:21