Future::and_then and other adapters are not always zero cost

a8e15ee
Opened by Jorge Aparicio at 2023-10-08 20:49:35

I expected these two code snippets to produce the same (or very similar) code when fully optimized (release+LTO) but they don't:

let tx = busy_wait(tx.write(4));
busy_wait(tx.write(2));
busy_wait(tx.write(4).and_then(|tx| tx.write(2)));

The first is panic!-free but the second is not; it still contains calls to panic!("cannot poll a chained future twice") even though it never polls either future after it has yielded its value.

STR

// src/main.rs
#![feature(lang_items)]
#![feature(never_type)]
#![no_main]
#![no_std]

extern crate futures;

use core::{fmt, ptr};

use futures::{Async, Future};

// entry point
#[no_mangle]
pub fn _start() -> ! {
    let tx = Tx { _0: () };

    busy_wait(tx.write(4).and_then(|tx| tx.write(2)));

    loop {}
}

struct Tx {
    _0: (),
}

impl Tx {
    fn write(self, byte: u8) -> Write {
        Write {
            tx: Some(self),
            byte: byte,
        }
    }
}

struct Write {
    byte: u8,
    tx: Option<Tx>,
}

impl Future for Write {
    type Error = !;
    type Item = Tx;

    fn poll(&mut self) -> Result<Async<Tx>, !> {
        unsafe {
            let tx = self.tx.take().expect("cannot poll `write` twice");

            // NOTE `0x0` and `0x4` emulate memory mapped IO registers
            // Can we send data yet?
            if ptr::read_volatile(0x0 as *const bool) {
                // Send one byte of data
                ptr::write_volatile(0x4 as *mut u8, self.byte);
                Ok(Async::Ready(tx))
            } else {
                self.tx = Some(tx);
                Ok(Async::NotReady)
            }
        }
    }
}

fn busy_wait<F>(mut f: F) -> F::Item
    where F: Future<Error = !>
{
    loop {
        if let Ok(Async::Ready(t)) = f.poll() {
            return t;
        }
    }
}

#[lang = "panic_fmt"]
extern "C" fn panic_fmt(_fmt: fmt::Arguments,
                        _file: &'static str,
                        _line: u32)
                        -> ! {
    // HACK to keep the `_file` string in the final binary
    unsafe { ptr::write_volatile(0x8 as *mut _, _file) }
    loop {}
}
$ head -n7 Cargo.toml
[dependencies.futures]
default-features = false
version = "0.1.3"

[profile.release]
lto = true
panic = "abort"

$ cargo rustc --release --verbose -- -C link-arg=-nostartfiles
$ objdump -Cd target/release/foo
00000000000002c0 <_start>:
 2c0:   b0 04                   mov    $0x4,%al
 2c2:   31 c9                   xor    %ecx,%ecx
 2c4:   66 66 66 2e 0f 1f 84    data16 data16 nopw %cs:0x0(%rax,%rax,1)
 2cb:   00 00 00 00 00
 2d0:   80 f9 01                cmp    $0x1,%cl
 2d3:   74 3b                   je     310 <_start+0x50>
 2d5:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
 2dc:   00 00 00 00
 2e0:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 2e7:   01
 2e8:   74 f6                   je     2e0 <_start+0x20>
 2ea:   88 04 25 04 00 00 00    mov    %al,0x4
 2f1:   84 c9                   test   %cl,%cl
 2f3:   75 3d                   jne    332 <_start+0x72>
 2f5:   b1 01                   mov    $0x1,%cl
 2f7:   b0 02                   mov    $0x2,%al
 2f9:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 300:   01
 301:   74 cd                   je     2d0 <_start+0x10>
 303:   c6 04 25 04 00 00 00    movb   $0x2,0x4
 30a:   02
 30b:   eb 23                   jmp    330 <_start+0x70>
 30d:   0f 1f 00                nopl   (%rax)
 310:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 317:   01
 318:   74 f6                   je     310 <_start+0x50>
 31a:   88 04 25 04 00 00 00    mov    %al,0x4
 321:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 328:   0f 1f 84 00 00 00 00
 32f:   00
 330:   eb fe                   jmp    330 <_start+0x70>
 332:   50                      push   %rax
 333:   e8 38 00 00 00          callq  370 <core::panicking::panic::h194ce5d68a8f28a1>
 338:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 33f:   00

0000000000000340 <core::panicking::panic_fmt::h561c5ee168a3d2cb>:
(..)
$ objcopy -O binary target/release/foo foo.bin

$ strings foo.bin
(..)
/home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.3/src/chain.rs

If the program is changed to the no-and_then variant, the disassembly looks like this:

$ objdump -Cd target/release/foo
0000000000000280 <_start>:
 280:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 287:   01
 288:   74 f6                   je     280 <_start>
 28a:   c6 04 25 04 00 00 00    movb   $0x4,0x4
 291:   04
 292:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 299:   1f 84 00 00 00 00 00
 2a0:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 2a7:   01
 2a8:   74 f6                   je     2a0 <_start+0x20>
 2aa:   c6 04 25 04 00 00 00    movb   $0x2,0x4
 2b1:   02
 2b2:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 2b9:   1f 84 00 00 00 00 00
 2c0:   eb fe                   jmp    2c0 <_start+0x40>

There are no panic_fmt calls in it.

Meta

$ rustc -V
rustc 1.15.0-nightly (43006fcea 2016-11-15)

I don't think this is a problem with the implementation of AndThen/Chain in the futures crate because I have tried to re-implement Future in a few different ways e.g. without error-handling, sprinkling #[inline] everywhere but none of that helps.

Code like this:

busy_wait(futures::done(Ok(42)).and_then(|x| futures::done(Ok(x))));

optimizes the same as the "split" version and in that case LLVM evaluates the expression at compile time and replaces it with 42. So LLVM can actually lower and_then to panic! free code; it just doesn't optimize well this particular case (perhaps, because of the volatile memory operations?)

join is another adapter that has the same issue.

cc @alexcrichton @eddyb

  1. Hilariously this is not an issue at -Copt-level=2 (how I initially tried to reproduce this). This is an issue at -Copt-level=3

    Alex Crichton at 2016-11-22 16:05:06

  2. Opened https://github.com/rust-lang/rust/issues/37933

    Alex Crichton at 2016-11-22 16:53:24

  3. Whoa wow I totally thought this was in the futures-rs repo, not the rust-lang/rust repo! Ok, inlining https://github.com/rust-lang/rust/issues/37933 into here:


    Originally reported at https://github.com/rust-lang/rust/issues/37930, minimized at https://gist.github.com/alexcrichton/86db63138ef979422d8fa10538dc6dcb, the code optimizes better at opt-level=2 than opt-level=3.

    For example:

    $ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=2 && grep 'call .* @_ZN4core9panicking5panic' main.ll
    $ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=3 && grep 'call .* @_ZN4core9panicking5panic' main.ll
      tail call void @_ZN4core9panicking5panic17h194ce5d68a8f28a1E({ %str_slice, %str_slice, i32 }* noalias readonly dereferenceable(40) bitcast ({ %str_slice, %str_slice, i32, [4 x i8] }* @"_ZN47_$LT$main..Chain$LT$A$C$$u20$B$C$$u20$C$GT$$GT$4poll14_MSG_FILE_LINE17hf0dd0d6c8e2c1d26E" to { %str_slice, %str_slice, i32 }*))
    

    That is, with opt-level=2 we can prove the code doesn't panic, but at opt-level=3 something goes awry and we don't prove that it doesn't panic.

    Do we need to tweak our opt-level=3 pipeline? Has it diverged from clang? Is it inappropriate to run by default?

    Alex Crichton at 2016-11-22 16:54:29

  4. I should also note that the opt-level=2 (set in Cargo.toml's profile.release) has no effect if you are still using the futures crates. IOW, it's necessary to inline the futures code into a single crate and set opt-level=2.

    Jorge Aparicio at 2016-11-22 16:57:42

  5. I am investigating @alexcrichton's issue.

    Some code generated ends up looking like

      %f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ]
      %trunc.i.i.i = trunc i8 %f.sroa.0.0.ph.i to i2
      %switch.i = icmp eq i2 %trunc.i.i.i, 0
    ...
    x-block:
      br i1 %switch.i, label %x2-block, label %x1-block
    ...
    y-block:
      %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
      br i1 %cond.i.i.i, label %y1-block, label %y2-block
    

    InstCombine manages to combine the 2 compares and transform the phi to a phi on booleans, which leads to SCCP to later optimize out one of the conditionals.

    -C opt-level=3, through some convoluted path involving it maintaining a branch table of the form

      switch i2 %trunc.i.i, label %unreachable.i.i [
        i2 0, label %bb2.i.i
        i2 1, label %bb3.i.i
        i2 -2, label %bb1.i.i
      ]
    

    through way too many passes, ends up with a conditional of the form

      %f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ]
      %switch = icmp eq i8 %f.sroa.0.0.ph.i, 1
    ...
    x-block:
      br i1 %switch, label %x1-block, label %x2-block
    ...
    y-block:
      %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
      br i1 %cond.i.i.i, label %y1-block, label %y2-block
    

    Which it is then unable to optimize.

    Ariel Ben-Yehuda at 2016-11-22 21:28:26

  6. Minimal example (if both icmps have the same bitwise value, this optimizes to a switch of true/false. If they don't, it doesn't):

    define i8 @_start(i8* %condition) {
    entry-block:
      br label %bb1.preheader.i
    
    bb1.preheader.i:
      %f.sroa.0.0.ph.i = phi i8 [ 1, %bb4.i.i.i.i ], [ 0, %entry-block ]
      %switch.i = icmp eq i8 %f.sroa.0.0.ph.i, 1
      br label %loop
    
    loop:
      br i1 %switch.i, label %ret1, label %i101
    
    i101:
      %0 = load volatile i8, i8* %condition
      %1 = icmp eq i8 %0, 0
      br i1 %1, label %bb4.i.i.i.i, label %loop
    
    bb4.i.i.i.i:
      %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
      br i1 %cond.i.i.i, label %bb1.preheader.i, label %ret0
    
    ret1:
      ret i8 1
    
    ret0:
      ret i8 0
    }
    

    The double nested loop seems to be essential to getting the optimization to trigger. I wonder whether this is the case in the more complicated example too.

    Ariel Ben-Yehuda at 2016-11-22 21:49:28

  7. Sure, looks like this issue is encountered on @japaric's posted no_std case too.

    Ariel Ben-Yehuda at 2016-11-22 22:11:31

  8. https://llvm.org/bugs/show_bug.cgi?id=31125

    Ariel Ben-Yehuda at 2016-11-22 22:59:32

  9. Awesome, thanks for the minimization @arielb1!

    Alex Crichton at 2016-11-22 23:20:24

  10. I had the minimization correct, but I got confused on finding the pass to blame.

    Ariel Ben-Yehuda at 2016-12-09 13:44:18

  11. I'm confused. I can now grep _ZN4core9panicking5panic with both opt-level=2 and opt-level=3 when compiling the sample in @alexcrichton's comment. Did we regress further?

    Cc @rust-lang/wg-codegen

    Anthony Ramine at 2018-04-02 11:51:55

  12. Triage: future has changed a lot, and so the output is likely to change. I'm not great at future internals yet, but someone wanting to reproduce this bug should port the code in @alexcrichton 's comment and see how it optimizes.

    Steve Klabnik at 2019-09-22 15:02:59