Future::and_then and other adapters are not always zero cost
I expected these two code snippets to produce the same (or very similar) code when fully optimized (release+LTO) but they don't:
let tx = busy_wait(tx.write(4));
busy_wait(tx.write(2));
busy_wait(tx.write(4).and_then(|tx| tx.write(2)));
The first is panic!-free but the second is not; it still contains calls to
panic!("cannot poll a chained future twice") even though it never polls either
future after it has yielded its value.
STR
// src/main.rs
#![feature(lang_items)]
#![feature(never_type)]
#![no_main]
#![no_std]
extern crate futures;
use core::{fmt, ptr};
use futures::{Async, Future};
// entry point
#[no_mangle]
pub fn _start() -> ! {
let tx = Tx { _0: () };
busy_wait(tx.write(4).and_then(|tx| tx.write(2)));
loop {}
}
struct Tx {
_0: (),
}
impl Tx {
fn write(self, byte: u8) -> Write {
Write {
tx: Some(self),
byte: byte,
}
}
}
struct Write {
byte: u8,
tx: Option<Tx>,
}
impl Future for Write {
type Error = !;
type Item = Tx;
fn poll(&mut self) -> Result<Async<Tx>, !> {
unsafe {
let tx = self.tx.take().expect("cannot poll `write` twice");
// NOTE `0x0` and `0x4` emulate memory mapped IO registers
// Can we send data yet?
if ptr::read_volatile(0x0 as *const bool) {
// Send one byte of data
ptr::write_volatile(0x4 as *mut u8, self.byte);
Ok(Async::Ready(tx))
} else {
self.tx = Some(tx);
Ok(Async::NotReady)
}
}
}
}
fn busy_wait<F>(mut f: F) -> F::Item
where F: Future<Error = !>
{
loop {
if let Ok(Async::Ready(t)) = f.poll() {
return t;
}
}
}
#[lang = "panic_fmt"]
extern "C" fn panic_fmt(_fmt: fmt::Arguments,
_file: &'static str,
_line: u32)
-> ! {
// HACK to keep the `_file` string in the final binary
unsafe { ptr::write_volatile(0x8 as *mut _, _file) }
loop {}
}
$ head -n7 Cargo.toml
[dependencies.futures]
default-features = false
version = "0.1.3"
[profile.release]
lto = true
panic = "abort"
$ cargo rustc --release --verbose -- -C link-arg=-nostartfiles
$ objdump -Cd target/release/foo
00000000000002c0 <_start>:
2c0: b0 04 mov $0x4,%al
2c2: 31 c9 xor %ecx,%ecx
2c4: 66 66 66 2e 0f 1f 84 data16 data16 nopw %cs:0x0(%rax,%rax,1)
2cb: 00 00 00 00 00
2d0: 80 f9 01 cmp $0x1,%cl
2d3: 74 3b je 310 <_start+0x50>
2d5: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
2dc: 00 00 00 00
2e0: f6 04 25 00 00 00 00 testb $0x1,0x0
2e7: 01
2e8: 74 f6 je 2e0 <_start+0x20>
2ea: 88 04 25 04 00 00 00 mov %al,0x4
2f1: 84 c9 test %cl,%cl
2f3: 75 3d jne 332 <_start+0x72>
2f5: b1 01 mov $0x1,%cl
2f7: b0 02 mov $0x2,%al
2f9: f6 04 25 00 00 00 00 testb $0x1,0x0
300: 01
301: 74 cd je 2d0 <_start+0x10>
303: c6 04 25 04 00 00 00 movb $0x2,0x4
30a: 02
30b: eb 23 jmp 330 <_start+0x70>
30d: 0f 1f 00 nopl (%rax)
310: f6 04 25 00 00 00 00 testb $0x1,0x0
317: 01
318: 74 f6 je 310 <_start+0x50>
31a: 88 04 25 04 00 00 00 mov %al,0x4
321: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
328: 0f 1f 84 00 00 00 00
32f: 00
330: eb fe jmp 330 <_start+0x70>
332: 50 push %rax
333: e8 38 00 00 00 callq 370 <core::panicking::panic::h194ce5d68a8f28a1>
338: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
33f: 00
0000000000000340 <core::panicking::panic_fmt::h561c5ee168a3d2cb>:
(..)
$ objcopy -O binary target/release/foo foo.bin
$ strings foo.bin
(..)
/home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.3/src/chain.rs
If the program is changed to the no-and_then variant, the disassembly looks
like this:
$ objdump -Cd target/release/foo
0000000000000280 <_start>:
280: f6 04 25 00 00 00 00 testb $0x1,0x0
287: 01
288: 74 f6 je 280 <_start>
28a: c6 04 25 04 00 00 00 movb $0x4,0x4
291: 04
292: 66 66 66 66 66 2e 0f data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
299: 1f 84 00 00 00 00 00
2a0: f6 04 25 00 00 00 00 testb $0x1,0x0
2a7: 01
2a8: 74 f6 je 2a0 <_start+0x20>
2aa: c6 04 25 04 00 00 00 movb $0x2,0x4
2b1: 02
2b2: 66 66 66 66 66 2e 0f data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
2b9: 1f 84 00 00 00 00 00
2c0: eb fe jmp 2c0 <_start+0x40>
There are no panic_fmt calls in it.
Meta
$ rustc -V
rustc 1.15.0-nightly (43006fcea 2016-11-15)
I don't think this is a problem with the implementation of AndThen/Chain in
the futures crate because I have tried to re-implement Future in a few
different ways e.g. without error-handling, sprinkling #[inline] everywhere but
none of that helps.
Code like this:
busy_wait(futures::done(Ok(42)).and_then(|x| futures::done(Ok(x))));
optimizes the same as the "split" version and in that case LLVM evaluates the
expression at compile time and replaces it with 42. So LLVM can actually lower
and_then to panic! free code; it just doesn't optimize well this particular
case (perhaps, because of the volatile memory operations?)
join is another adapter that has the same issue.
cc @alexcrichton @eddyb
Hilariously this is not an issue at
-Copt-level=2(how I initially tried to reproduce this). This is an issue at-Copt-level=3Alex Crichton at 2016-11-22 16:05:06
Opened https://github.com/rust-lang/rust/issues/37933
Alex Crichton at 2016-11-22 16:53:24
Whoa wow I totally thought this was in the futures-rs repo, not the rust-lang/rust repo! Ok, inlining https://github.com/rust-lang/rust/issues/37933 into here:
Originally reported at https://github.com/rust-lang/rust/issues/37930, minimized at https://gist.github.com/alexcrichton/86db63138ef979422d8fa10538dc6dcb, the code optimizes better at opt-level=2 than opt-level=3.
For example:
$ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=2 && grep 'call .* @_ZN4core9panicking5panic' main.ll $ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=3 && grep 'call .* @_ZN4core9panicking5panic' main.ll tail call void @_ZN4core9panicking5panic17h194ce5d68a8f28a1E({ %str_slice, %str_slice, i32 }* noalias readonly dereferenceable(40) bitcast ({ %str_slice, %str_slice, i32, [4 x i8] }* @"_ZN47_$LT$main..Chain$LT$A$C$$u20$B$C$$u20$C$GT$$GT$4poll14_MSG_FILE_LINE17hf0dd0d6c8e2c1d26E" to { %str_slice, %str_slice, i32 }*))That is, with opt-level=2 we can prove the code doesn't panic, but at opt-level=3 something goes awry and we don't prove that it doesn't panic.
Do we need to tweak our opt-level=3 pipeline? Has it diverged from clang? Is it inappropriate to run by default?
Alex Crichton at 2016-11-22 16:54:29
I should also note that the
opt-level=2(set in Cargo.toml's profile.release) has no effect if you are still using the futures crates. IOW, it's necessary to inline the futures code into a single crate and set opt-level=2.Jorge Aparicio at 2016-11-22 16:57:42
I am investigating @alexcrichton's issue.
Some code generated ends up looking like
%f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ] %trunc.i.i.i = trunc i8 %f.sroa.0.0.ph.i to i2 %switch.i = icmp eq i2 %trunc.i.i.i, 0 ... x-block: br i1 %switch.i, label %x2-block, label %x1-block ... y-block: %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0 br i1 %cond.i.i.i, label %y1-block, label %y2-blockInstCombinemanages to combine the 2 compares and transform the phi to a phi on booleans, which leads to SCCP to later optimize out one of the conditionals.-C opt-level=3, through some convoluted path involving it maintaining a branch table of the formswitch i2 %trunc.i.i, label %unreachable.i.i [ i2 0, label %bb2.i.i i2 1, label %bb3.i.i i2 -2, label %bb1.i.i ]through way too many passes, ends up with a conditional of the form
%f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ] %switch = icmp eq i8 %f.sroa.0.0.ph.i, 1 ... x-block: br i1 %switch, label %x1-block, label %x2-block ... y-block: %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0 br i1 %cond.i.i.i, label %y1-block, label %y2-blockWhich it is then unable to optimize.
Ariel Ben-Yehuda at 2016-11-22 21:28:26
Minimal example (if both icmps have the same bitwise value, this optimizes to a switch of true/false. If they don't, it doesn't):
define i8 @_start(i8* %condition) { entry-block: br label %bb1.preheader.i bb1.preheader.i: %f.sroa.0.0.ph.i = phi i8 [ 1, %bb4.i.i.i.i ], [ 0, %entry-block ] %switch.i = icmp eq i8 %f.sroa.0.0.ph.i, 1 br label %loop loop: br i1 %switch.i, label %ret1, label %i101 i101: %0 = load volatile i8, i8* %condition %1 = icmp eq i8 %0, 0 br i1 %1, label %bb4.i.i.i.i, label %loop bb4.i.i.i.i: %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0 br i1 %cond.i.i.i, label %bb1.preheader.i, label %ret0 ret1: ret i8 1 ret0: ret i8 0 }The double nested loop seems to be essential to getting the optimization to trigger. I wonder whether this is the case in the more complicated example too.
Ariel Ben-Yehuda at 2016-11-22 21:49:28
Sure, looks like this issue is encountered on @japaric's posted
no_stdcase too.Ariel Ben-Yehuda at 2016-11-22 22:11:31
https://llvm.org/bugs/show_bug.cgi?id=31125
Ariel Ben-Yehuda at 2016-11-22 22:59:32
Awesome, thanks for the minimization @arielb1!
Alex Crichton at 2016-11-22 23:20:24
I had the minimization correct, but I got confused on finding the pass to blame.
Ariel Ben-Yehuda at 2016-12-09 13:44:18
I'm confused. I can now grep
_ZN4core9panicking5panicwith bothopt-level=2andopt-level=3when compiling the sample in @alexcrichton's comment. Did we regress further?Cc @rust-lang/wg-codegen
Anthony Ramine at 2018-04-02 11:51:55
Triage:
futurehas changed a lot, and so the output is likely to change. I'm not great at future internals yet, but someone wanting to reproduce this bug should port the code in @alexcrichton 's comment and see how it optimizes.Steve Klabnik at 2019-09-22 15:02:59