2x benchmark loss in rayon-hash from multiple codegen-units

fd1a083
Opened by Josh Stone at 2023-10-08 20:47:32

I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1.

$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

$ cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 5.51 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial   ... bench:   7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel   ... bench:  10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial     ... bench:   7,733,258 ns/iter (+/- 134,850)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.48 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial   ... bench:   6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel   ... bench:   8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial     ... bench:   6,144,775 ns/iter (+/- 75,377)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

rayon_set_sum_parallel is the showcase for this crate, and it suffers the most from CGU.

From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer, 34% Iterator::fold, 28% Sum::sum, and the hot loop in the first looks like:

 16.72 │126d0:   cmpq   $0x0,(%r12,%rbx,8)
  6.73 │126d5: ↓ jne    126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
 16.65 │126d7:   inc    %rbx
  0.00 │126da:   cmp    %rbp,%rbx
  7.20 │126dd: ↑ jb     126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
  0.05 │126df: ↓ jmp    1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
 26.93 │126e1:   mov    0x0(%r13,%rbx,4),%eax
  4.26 │126e6:   movq   $0x1,0x38(%rsp)
  2.27 │126ef:   mov    %rax,0x40(%rsp)
  1.88 │126f4:   mov    %r14,%rdi
  4.58 │126f7: → callq  15b90 <<u64 as core::iter::traits::Sum>::sum>
  0.68 │126fc:   movq   $0x1,0x38(%rsp)
  2.58 │12705:   mov    %r15,0x40(%rsp)
  0.62 │1270a:   movq   $0x1,0x48(%rsp)
  0.31 │12713:   mov    %rax,0x50(%rsp)
  0.49 │12718:   movb   $0x0,0x58(%rsp)
  2.50 │1271d:   xor    %esi,%esi
  0.41 │1271f:   mov    %r14,%rdi
  0.85 │12722: → callq  153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
  1.30 │12727:   mov    %rax,%r15
  2.16 │1272a: ↑ jmp    126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>

With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer, with this hot loop:

  2.28 │1426d:   test   %rbx,%rbx
       │14270: ↓ je     14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  5.40 │14272:   mov    (%rbx),%ebx
  2.75 │14274:   add    %rbx,%rax
  1.47 │14277:   lea    (%rdx,%rsi,4),%rbx
  0.21 │1427b:   nopl   0x0(%rax,%rax,1)
 29.56 │14280:   cmp    %rdi,%rsi
  0.04 │14283: ↓ jae    14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  2.87 │14285:   add    $0x4,%rbx
 20.22 │14289:   cmpq   $0x0,(%rcx,%rsi,8)
  1.48 │1428e:   lea    0x1(%rsi),%rsi
  8.00 │14292: ↑ je     14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
 25.25 │14294: ↑ jmp    1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>
  1. @cuviper I wonder if we can add any #[inline] hints to help resolve this?

    Niko Matsakis at 2018-01-25 15:19:35

  2. @nikomatsakis you mean in libstd? Neither Chain::fold nor u64::sum have #[inline] attributes today, but usually that's not needed for generic functions. I guess being generic means they didn't (couldn't) get pre-compiled in libstd, but I suppose they must have gotten separate codegen units in the end.

    But I hope we don't have to generally recommend #[inline] all over the place...

    Josh Stone at 2018-01-25 18:42:32

  3. I'm not entirely sure this isn't expected; 2x is a bit much though. We might need to wait on the heuristics that @michaelwoerister has planned.

    Mark Rousskov at 2018-01-25 21:56:34

  4. @cuviper I'm not sure that being generic affects their need for an inlining hint. It is true that those functions will be instantiated in the downstream crate, but I think they are still isolated into their own codegen unit. (ThinLTO is of course supposed to help here.) I'm not sure what's the best fix though.

    Niko Matsakis at 2018-01-26 14:58:11

  5. FWIW, the gap has closed some (perhaps from LLVM 6?), with parallel codegen now "only" 40% slower:

    $ rustc -Vv
    rustc 1.25.0-nightly (4d2d3fc5d 2018-02-13)
    binary: rustc
    commit-hash: 4d2d3fc5dadf894a8ad709a5860a549f2c0b1032
    commit-date: 2018-02-13
    host: x86_64-unknown-linux-gnu
    release: 1.25.0-nightly
    LLVM version: 6.0
    
    $ cargo bench --bench set_sum
        Updating registry `https://github.com/rust-lang/crates.io-index`
       Compiling [...]
        Finished release [optimized] target(s) in 5.37 secs
         Running target/release/deps/set_sum-d85cde9ece817246
    
    running 4 tests
    test rayon_set_sum_parallel ... bench:   1,420,138 ns/iter (+/- 34,709)
    test rayon_set_sum_serial   ... bench:   7,718,603 ns/iter (+/- 141,127)
    test std_set_sum_parallel   ... bench:   8,886,208 ns/iter (+/- 137,102)
    test std_set_sum_serial     ... bench:   7,845,670 ns/iter (+/- 106,685)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
    
    $ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
       Compiling [...]
        Finished release [optimized] target(s) in 6.40 secs
         Running target/release/deps/set_sum-d85cde9ece817246
    
    running 4 tests
    test rayon_set_sum_parallel ... bench:   1,089,784 ns/iter (+/- 167,357)
    test rayon_set_sum_serial   ... bench:   6,196,760 ns/iter (+/- 335,661)
    test std_set_sum_parallel   ... bench:   8,497,259 ns/iter (+/- 128,929)
    test std_set_sum_serial     ... bench:   6,151,935 ns/iter (+/- 76,954)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
    

    The profile of rayon_set_sum_parallel is now 49.5% bridge_unindexed_producer_consumer, 48% Sum::sum, with this hot loop:

     19.74 │14fa0:   cmpq   $0x0,(%r12,%rbx,8)
      6.54 │14fa5: ↓ jne    14fb2 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1a2>
     21.06 │14fa7:   add    $0x1,%rbx
           │14fab:   cmp    %rbp,%rbx
      4.04 │14fae: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
      0.04 │14fb0: ↓ jmp    14fde <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1ce>
     28.36 │14fb2:   mov    0x0(%r13,%rbx,4),%eax
      4.34 │14fb7:   movq   $0x1,0x58(%rsp)
      5.45 │14fc0:   mov    %rax,0x60(%rsp)
      0.84 │14fc5:   mov    %r14,%rdi
      3.80 │14fc8: → callq  16380 <<u64 as core::iter::traits::Sum>::sum>
      2.52 │14fcd:   add    %rax,%r15
      2.07 │14fd0:   add    $0x1,%rbx
           │14fd4:   cmp    %rbp,%rbx
      0.26 │14fd7: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
    

    versus codegen-units=1:

      0.63 │14092:   lea    (%rdi,%rcx,4),%rbp
      0.01 │14096:   nopw   %cs:0x0(%rax,%rax,1)
     25.16 │140a0:   cmpq   $0x0,(%rsi,%rcx,8)
      7.68 │140a5: ↓ jne    140b6 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x176>
     28.34 │140a7:   add    $0x1,%rcx
      0.21 │140ab:   add    $0x4,%rbp
           │140af:   cmp    %rdx,%rcx
      2.63 │140b2: ↑ jb     140a0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x160>
      0.04 │140b4: ↓ jmp    140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
     24.54 │140b6:   test   %rbp,%rbp
      0.00 │140b9: ↓ je     140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
      0.32 │140bb:   add    $0x1,%rcx
      5.10 │140bf:   mov    0x0(%rbp),%ebp
      3.31 │140c2:   add    %rbp,%rax
           │140c5:   cmp    %rdx,%rcx
      1.40 │140c8: ↑ jb     14092 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x152>
    

    Josh Stone at 2018-02-14 19:13:16

  6. @rust-lang/cargo, maybe cargo bench should default to one codegen-unit?

    Michael Woerister at 2018-02-15 09:27:16

  7. If cargo bench uses one CGU but cargo build --release uses multiple CGUs, benchmark results are less representative of the code that actually runs at the end of the day.

    Hanna Kruppe at 2018-02-15 09:30:38

  8. Yeah, I was thinking that too after I wrote that. --release really sounds like it should be the profile configured for best possible performance. We need an additional opt-dev profile that has "good enough" performance.

    Michael Woerister at 2018-02-15 09:33:28

  9. @michaelwoerister today Cargo uses separate profiles for cargo build --release and cargo bench, but, by default, they use the same options. So it is possible to override number of cgus for benchmarks specifically manually.

    However, long-term there is a desire to phase out all special profiles except release and dev (at least in the narrow sense of profiles, as "flags for the compiler"). See also https://github.com/rust-lang/rfcs/pull/2282 which suggests adding user-defined profiles.

    Alex Kladov at 2018-02-15 09:35:47

  10. The question of whether or not to have an opt-dev profile and so forth has been around for some time. My current opinion is that we can probably keep it two options, but I would do this by altering the "debug" profile: instead of doing no optimization, it should do some minimal amount of optimization. It will take some time to determine precisely what that means, but I strongly suspect that we can get "decently performing code" by doing some mild inlining and few other minor optimizations. So the default might be something more like -O1 or -Og.

    In that case, we might consider some other profiles (e.g., no optimization at all, or mild optimization), but I sort of suspect that the use cases are thinner. In my experience, the only reason to want -O0 was for debuginfo, and let's premise that our default profile keeps debuginfo intact. Something like opt-dev I guess corresponds to "I don't care about debuginfo but i'm not benchmarking" -- that might be better served by customizing the "default' build settings for you project?

    Still, it seems like there is ultimately maybe a need for three profiles:

    • super-debug: no opts, I want perfect debuginfo
    • debug (default): do some optimization. By default, preserves debuginfo, but maybe you can "tune it up" if you're rather have faster code.
    • release: pull out all the stops.

    Probably though this conversation should be happening somewhere else. =)

    Niko Matsakis at 2018-02-15 17:05:22

  11. Probably though this conversation should be happening somewhere else

    Yeah, this is closely connected to the work on revamping profiles in Cargo.

    Aaron Turon at 2018-02-15 17:07:59

  12. I just benchmarked as part of updating to rayon-1.0, and it's back to 2x slowdown, with Chain::fold appearing in the profile again. I guess the prior inlining improvement was just lucky, whether they're grouped in the same codegen unit.

    Josh Stone at 2018-02-15 18:44:42

  13. Is this still an issue of concern?

    Jubilee at 2022-07-18 03:37:38

  14. Well rayon-hash is deprecated, but I guess the comparison may still be informative?

    $ rustc -Vv
    rustc 1.64.0-nightly (263edd43c 2022-07-17)
    binary: rustc
    commit-hash: 263edd43c5255084292329423c61a9d69715ebfa
    commit-date: 2022-07-17
    host: x86_64-unknown-linux-gnu
    release: 1.64.0-nightly
    LLVM version: 14.0.6
    
    $ cargo bench --bench set_sum
    [...]
    running 6 tests
    test hashbrown_set_sum_parallel  ... bench:     214,551 ns/iter (+/- 5,456)
    test hashbrown_set_sum_serial    ... bench:   1,828,014 ns/iter (+/- 14,492)
    test rayon_hash_set_sum_parallel ... bench:     464,519 ns/iter (+/- 53,670)
    test rayon_hash_set_sum_serial   ... bench:   6,327,378 ns/iter (+/- 57,630)
    test std_hash_set_sum_parallel   ... bench:   4,392,456 ns/iter (+/- 1,768,937)
    test std_hash_set_sum_serial     ... bench:   2,904,370 ns/iter (+/- 18,217)
    
    $ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
    [...]
    running 6 tests
    test hashbrown_set_sum_parallel  ... bench:     217,752 ns/iter (+/- 43,351)
    test hashbrown_set_sum_serial    ... bench:   1,624,025 ns/iter (+/- 23,969)
    test rayon_hash_set_sum_parallel ... bench:     494,929 ns/iter (+/- 50,196)
    test rayon_hash_set_sum_serial   ... bench:   4,843,448 ns/iter (+/- 72,878)
    test std_hash_set_sum_parallel   ... bench:   2,632,228 ns/iter (+/- 628,283)
    test std_hash_set_sum_serial     ... bench:   1,773,990 ns/iter (+/- 66,189)
    

    I was originally sad about rayon_set_sum_parallel, and now that's pretty close, actually slower with CGU=1 but within +/-. It also looks consistent and great for hashbrown_set_sum_parallel. The rest have a pretty large benefit from CGU=1.

    Josh Stone at 2022-07-18 19:03:21