Making a function generic (but not using the parameter at all) causes ~12% slowdown
I found a situation where making a function that contains the majority of the hot code a generic function produces about a 12% performance regression in benchmarks, even though the generic parameter is not actually used and the type parameter in question is only ever used with one type.
The code in question is at https://bitbucket.org/marshallpierce/stream-vbyte-rust. The benchmark that exhibits the largest degradation is encode_scalar_rand_1k.
95ba949 is the commit that introduces the generic and associated slowdown.
When comparing the output of perf annotate -l for that benchmark in the two cases, it looks like the hot code is compiled significantly differently. cmp.rs:846 (an implementation of lt()) shows up as the leading offender in perf annotate with about 12% of samples in the slow case, basically the same as the performance delta I'm seeing, whereas it's not present at all in the base case.
In the base case, it looks like inlining collapsed function calls up through encode::encode, a few layers up the call stack:
...
0.00 : 110ef: cmp $0x1,%rax
0.00 : 110f3: jbe 11150 <stream_vbyte::encode::encode+0x1f0>
: _ZN12stream_vbyte6encode17encode_num_scalarE():
0.00 : 110f5: cmp $0x2,%rdx
0.00 : 110f9: jb 111d3 <stream_vbyte::encode::encode+0x273>
0.00 : 110ff: movzbl -0x2b(%rbp),%ecx
0.00 : 11103: mov %cl,0x1(%r14,%r12,1)
: _ZN4core4iter5range8{{impl}}11next<usize>E():
0.00 : 11108: cmp $0x3,%rax
0.00 : 1110c: jb 11150 <stream_vbyte::encode::encode+0x1f0>
: _ZN12stream_vbyte6encode17encode_num_scalarE():
0.00 : 1110e: cmp $0x3,%rdx
0.00 : 11112: jb 111da <stream_vbyte::encode::encode+0x27a>
0.00 : 11118: movzbl -0x2a(%rbp),%ecx
0.00 : 1111c: mov %cl,0x2(%r14,%r12,1)
: _ZN4core4iter5range8{{impl}}11next<usize>E():
0.00 : 11121: cmp $0x4,%rax
0.00 : 11125: jb 11150 <stream_vbyte::encode::encode+0x1f0>
: _ZN12stream_vbyte6encode17encode_num_scalarE():
0.00 : 11127: cmp $0x4,%rdx
...
In the slower case, encode_num_scalar was inlined only into its direct caller, do_encode_quads:
: _ZN9byteorder8{{impl}}9write_u32E():
lib.rs:1726 2.11 : 1149a: mov %r10d,-0x2c(%rbp)
: _ZN12stream_vbyte6encode17encode_num_scalarE():
: output[i] = buf[i];
mod.rs:161 0.77 : 1149e: test %r13,%r13
0.00 : 114a1: je 1167d <stream_vbyte::scalar::do_encode_quads+0x41d>
0.16 : 114a7: movzbl -0x2c(%rbp),%ecx
1.62 : 114ab: mov %cl,(%r9,%rax,1)
: _ZN4core3cmp5impls8{{impl}}2ltE():
cmp.rs:846 2.90 : 114af: cmp $0x1,%r8
0.00 : 114b3: jbe 11510 <stream_vbyte::scalar::do_encode_quads+0x2b0>
: _ZN12stream_vbyte6encode17encode_num_scalarE():
0.41 : 114b5: cmp $0x2,%r13
0.00 : 114b9: jb 11648 <stream_vbyte::scalar::do_encode_quads+0x3e8>
0.08 : 114bf: movzbl -0x2b(%rbp),%ecx
mod.rs:161 0.55 : 114c3: mov %cl,0x1(%r9,%rax,1)
: _ZN4core4iter5range8{{impl}}11next<usize>E():
range.rs:218 1.82 : 114c8: cmp $0x3,%r8
0.00 : 114cc: jb 11510 <stream_vbyte::scalar::do_encode_quads+0x2b0>
: _ZN12stream_vbyte6encode17encode_num_scalarE():
0.16 : 114ce: cmp $0x3,%r13
0.00 : 114d2: jb 1166d <stream_vbyte::scalar::do_encode_quads+0x40d>
0.04 : 114d8: movzbl -0x2a(%rbp),%ecx
0.14 : 114dc: mov %cl,0x2(%r9,%rax,1)
: _ZN4core4iter5range8{{impl}}11next<usize>E():
1.24 : 114e1: cmp $0x4,%r8
0.00 : 114e5: jb 11510 <stream_vbyte::scalar::do_encode_quads+0x2b0>
: _ZN12stream_vbyte6encode17encode_num_scalarE():
0.06 : 114e7: cmp $0x4,%r13
0.00 : 114eb: jb 11674 <stream_vbyte::scalar::do_encode_quads+0x414>
0.00 : 114f1: movzbl -0x29(%rbp),%ecx
Note the non-zero percentages attached to various cmp instructions. Also, based on my casual "look at how many times encode_num_scalar appears to have been inlined" analysis, it looks like the slow case unrolled the loop 4x while the fast case did not. (Discussing on rust-internals led to https://github.com/rust-lang/rfcs/issues/2219.)
Meta
rustc --version --verbose:
rustc 1.21.0 (3b72af97e 2017-10-09) binary: rustc commit-hash: 3b72af97e42989b2fe104d8edbaee123cdf7c58f commit-date: 2017-10-09 host: x86_64-unknown-linux-gnu release: 1.21.0 LLVM version: 4.0