write!(wr,"foo") is 10% to 72% slower than wr.write("foo".as_bytes())

7d6225f
Opened by Erick Tryzelaar at 2023-12-09 06:49:29

This example demonstrates that the trivial case of write!(wr, "foo") is much slower than calling wr.write("foo".as_bytes()):

extern mod extra;

use std::io::mem::MemWriter;
use extra::test::BenchHarness;

#[bench]
fn bench_write_value(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        for _ in range(0, 1000) {
            mem.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_ref(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0, 1000) {
            wr.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_macro1(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0, 1000) {
            write!(wr, "abc");
        }
    });
}

#[bench]
fn bench_write_macro2(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0, 1000) {
            write!(wr, "{}", "abc");
        }
    });
}

With no optimizations:

running 4 tests
test bench_write_macro1 ... bench:    280153 ns/iter (+/- 73615)
test bench_write_macro2 ... bench:    322462 ns/iter (+/- 24886)
test bench_write_ref    ... bench:     79974 ns/iter (+/- 3850)
test bench_write_value  ... bench:     78709 ns/iter (+/- 4003)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

With --opt-level=3:

running 4 tests
test bench_write_macro1 ... bench:     62397 ns/iter (+/- 5485)
test bench_write_macro2 ... bench:     80203 ns/iter (+/- 3355)
test bench_write_ref    ... bench:     55275 ns/iter (+/- 5156)
test bench_write_value  ... bench:     56273 ns/iter (+/- 7591)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Is there anything we can do to improve this? I can think of a couple options, but I bet there are more:

  • Special case no-argument write! to compile down into wr.write("foo".as_bytes()). If we go this route, it'd be nice to also convert a series of str write!("foo {} {}", "bar", "baz").
  • Revive wr.write_str("foo"). From what I understand, that's being blocked on #6164.
  • Figure out why llvm isn't able to optimize away the write! overhead. Are there functions that should be getting inlined that are not? My scatter shot attempt didn't get any results.
  1. I suspect that fn bench_write_ref does not make virtual calls. Adding this

    #[inline(never)]
    fn writer_write(w: &mut Writer, b: &[u8]) {
        w.write(b);
    }
    
    #[bench]
    fn bench_write_virt(bh: &mut BenchHarness) {
        bh.iter(|| {
            let mut mem = MemWriter::new();
            let wr = &mut mem as &mut Writer;
            for _ in range(0, 1000) {
                writer_write(wr, "abc".as_bytes());
            }
        });
    }
    

    I have, with --opt-level 3

    running 5 tests
    test bench_write_macro1 ... bench:    680823 ns/iter (+/- 34497)
    test bench_write_macro2 ... bench:    950790 ns/iter (+/- 72309)
    test bench_write_ref    ... bench:    505846 ns/iter (+/- 41965)
    test bench_write_value  ... bench:    511815 ns/iter (+/- 36681)
    test bench_write_virt   ... bench:    553466 ns/iter (+/- 43716)
    

    So, the virtual call have an impact, but it does not explain totally the slowness of write!() shown by this bench.

    Guillaume P. at 2013-12-02 08:56:08

  2. Visiting for bug triage. This seems to still be present. The code to run the benchmarks is now:

    extern crate test;
    
    use std::io::MemWriter;
    use test::Bencher;
    
    #[bench]
    fn bench_write_value(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = MemWriter::new();
            for _ in range(0u, 1000) {
                mem.write("abc".as_bytes());
            }
        });
    }
    
    #[bench]
    fn bench_write_ref(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = MemWriter::new();
            let wr = &mut mem as &mut Writer;
            for _ in range(0u, 1000) {
                wr.write("abc".as_bytes());
            }
        });
    }
    
    #[bench]
    fn bench_write_macro1(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = MemWriter::new();
            let wr = &mut mem as &mut Writer;
            for _ in range(0u, 1000) {
                write!(wr, "abc");
            }
        });
    }
    
    #[bench]
    fn bench_write_macro2(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = MemWriter::new();
            let wr = &mut mem as &mut Writer;
            for _ in range(0u, 1000) {
                write!(wr, "{}", "abc");
            }
        });
    }
    

    Without optimizations:

    running 4 tests
    test bench_write_macro1 ... bench:   1470468 ns/iter (+/- 291966)
    test bench_write_macro2 ... bench:   1799612 ns/iter (+/- 316293)
    test bench_write_ref    ... bench:   1336574 ns/iter (+/- 251664)
    test bench_write_value  ... bench:   1317880 ns/iter (+/- 254668)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured
    

    With --opt-level=3:

    running 4 tests
    test bench_write_macro1 ... bench:    127671 ns/iter (+/- 1452)
    test bench_write_macro2 ... bench:    196158 ns/iter (+/- 2053)
    test bench_write_ref    ... bench:     43881 ns/iter (+/- 453)
    test bench_write_value  ... bench:     43859 ns/iter (+/- 336)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured
    

    Nick Hamann at 2014-08-06 23:35:03

  3. Still an issue, using the following code:

    #![allow(unused_must_use)]
    #![feature(test)]
    
    extern crate test;
    
    use std::io::Write;
    use std::vec::Vec;
    
    use test::Bencher;
    
    #[bench]
    fn bench_write_value(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = Vec::new();
            for _ in 0..1000 {
                mem.write("abc".as_bytes());
            }
        });
    }
    
    #[bench]
    fn bench_write_ref(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = Vec::new();
            let wr = &mut mem as &mut Write;
            for _ in 0..1000 {
                wr.write("abc".as_bytes());
            }
        });
    }
    
    #[bench]
    fn bench_write_macro1(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = Vec::new();
            let wr = &mut mem as &mut Write;
            for _ in 0..1000 {
                write!(wr, "abc");
            }
        });
    }
    
    #[bench]
    fn bench_write_macro2(bh: &mut Bencher) {
        bh.iter(|| {
            let mut mem = Vec::new();
            let wr = &mut mem as &mut Write;
            for _ in 0..1000 {
                write!(wr, "{}", "abc");
            }
        });
    }
    

    Typically giving something like the following:

    $ cargo bench
    running 4 tests
    test bench_write_macro1 ... bench:      21,604 ns/iter (+/- 82)
    test bench_write_macro2 ... bench:      29,273 ns/iter (+/- 85)
    test bench_write_ref    ... bench:       1,396 ns/iter (+/- 387)
    test bench_write_value  ... bench:       1,391 ns/iter (+/- 163)
    

    Deleted user at 2015-10-30 01:15:07

  4. I got approximately the same results with rust nightly.

    ariasuni at 2017-06-23 01:42:02

  5. Just for the record, running with v1.20.0-nightly:

    $ cargo bench
    running 4 tests
    test bench_write_macro1 ... bench:      36,556 ns/iter (+/- 69)
    test bench_write_macro2 ... bench:      54,377 ns/iter (+/- 958)
    test bench_write_ref    ... bench:      13,730 ns/iter (+/- 24)
    test bench_write_value  ... bench:      13,755 ns/iter (+/- 81)
    

    Ritiek Malhotra at 2017-07-14 15:50:39

  6. Today:

    running 4 tests
    test bench_write_macro1 ... bench:      16,220 ns/iter (+/- 982)
    test bench_write_macro2 ... bench:      25,542 ns/iter (+/- 2,220)
    test bench_write_ref    ... bench:       4,889 ns/iter (+/- 314)
    test bench_write_value  ... bench:       4,819 ns/iter (+/- 956)
    

    Steve Klabnik at 2018-09-24 16:50:48

  7. Two years later:

    running 4 tests
    test bench_write_macro1 ... bench:      17,561 ns/iter (+/- 174)
    test bench_write_macro2 ... bench:      23,285 ns/iter (+/- 2,771)
    test bench_write_ref    ... bench:       3,234 ns/iter (+/- 194)
    test bench_write_value  ... bench:       3,238 ns/iter (+/- 123)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
    
    ❯ rustc +nightly --version
    rustc 1.47.0-nightly (30f0a0768 2020-08-18)
    

    Steve Klabnik at 2020-08-26 19:36:34

  8. I was curious if the difference was in the implementation of io::Write vs. fmt::Write, or in the details of write_fmt requiring format_args!, which is relevant since write! can be asked to "serve" either case.

    To discover more information on this, I compared Vec to String, a seemingly "apples to apples...ish" comparison. https://gist.github.com/workingjubilee/2d2e3a7fded1c2101aafb51dc79a7ec5

    running 10 tests
    test string_write_fmt    ... bench:      10,053 ns/iter (+/- 1,141)
    test string_write_macro1 ... bench:      10,177 ns/iter (+/- 2,363)
    test string_write_macro2 ... bench:      17,499 ns/iter (+/- 1,847)
    test string_write_ref    ... bench:       2,270 ns/iter (+/- 265)
    test string_write_value  ... bench:       2,333 ns/iter (+/- 126)
    test vec_write_fmt       ... bench:      15,722 ns/iter (+/- 1,673)
    test vec_write_macro1    ... bench:      15,767 ns/iter (+/- 1,638)
    test vec_write_macro2    ... bench:      23,968 ns/iter (+/- 8,942)
    test vec_write_ref       ... bench:       2,296 ns/iter (+/- 178)
    test vec_write_value     ... bench:       2,230 ns/iter (+/- 235)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out
    

    Jubilee at 2020-09-05 23:54:43

  9. I found it interesting that this example revealed that String actually had less overhead than Vec on write! (the benchmark had high variance but I considered it indicative enough).

    Then I looked and saw:

        fn write_fmt(&mut self, fmt: fmt::Arguments<'_>) -> Result<()> {
            // Create a shim which translates an io::Write to a fmt::Write and saves
            // off I/O errors. instead of discarding them
            struct Adaptor<'a, T: ?Sized + 'a> {
                inner: &'a mut T,
                error: Result<()>,
            }
    
            /* More code related to implementing and using the resulting shim,
             * seemingly involving a lot of poking a reference at runtime??? */
        }
    

    So I am no longer surprised. I believe it is possible to fast-path some cases, but this is not actually an apples-to-apples comparison here. The orange is write_fmt, which has to use a variety of formatting machinery. What needs to happen is that write! should skip write_fmt in simple cases where the formatting machinery is not required, or the formatting machinery should itself be much faster in the case of the obvious fast path.

    Jubilee at 2020-09-07 22:19:46

  10. (checking oldest issues for fun and profit)

    Did fmt::Arguments::as_str improve the situation here?

    Crystal Durham at 2022-07-22 16:56:01

  11. Not really (nightly 2022-07-23):

    running 5 tests
    test string_write_fmt    ... bench:       6,953 ns/iter (+/- 161)
    test string_write_macro1 ... bench:       7,034 ns/iter (+/- 227)
    test string_write_macro2 ... bench:       9,308 ns/iter (+/- 760)
    test string_write_ref    ... bench:         688 ns/iter (+/- 32)
    test string_write_value  ... bench:         673 ns/iter (+/- 34)
    

    Jakub Beránek at 2022-08-10 15:27:57

  12. It seems to be just because nobody bothered to yet. https://gist.github.com/CAD97/2922d3132777310c1e22564f6c6dabdf

    running 8 tests
    test new_string_write_macro1 ... bench:       1,875 ns/iter (+/- 228)
    test new_string_write_macro2 ... bench:      13,594 ns/iter (+/- 999)
    test new_vec_write_macro1    ... bench:       1,637 ns/iter (+/- 175)
    test new_vec_write_macro2    ... bench:      13,808 ns/iter (+/- 1,320)
    test old_string_write_macro1 ... bench:      10,080 ns/iter (+/- 1,618)
    test old_string_write_macro2 ... bench:      13,880 ns/iter (+/- 1,327)
    test old_vec_write_macro1    ... bench:      10,037 ns/iter (+/- 779)
    test old_vec_write_macro2    ... bench:      14,783 ns/iter (+/- 1,280)
    
    --- old/src/std/io/mod.rs
    +++ new/src/std/io/mod.rs
    @@ -1650,8 +1650,12 @@
         #[stable(feature = "rust1", since = "1.0.0")]
         fn write_fmt(&mut self, fmt: fmt::Arguments<'_>) -> Result<()> {
    +        if let Some(s) = fmt.as_str() {
    +            return self.write_all(s.as_bytes());
    +        }
    +
             // Create a shim which translates a Write to a fmt::Write and saves
             // off I/O errors. instead of discarding them
             struct Adapter<'a, T: ?Sized + 'a> {
                 inner: &'a mut T,
                 error: Result<()>,
             }
    --- old/src/core/fmt/mod.rs
    +++ new/src/core/fmt/mod.rs
    @@ -186,4 +186,8 @@
         #[stable(feature = "rust1", since = "1.0.0")]
         fn write_fmt(mut self: &mut Self, args: Arguments<'_>) -> Result {
    +        if let Some(s) = args.as_str() {
    +            self.write_str(s)
    +        } else {
    -        write(&mut self, args)
    +            write(&mut self, args)
    +        }
         }
    

    Crystal Durham at 2022-08-11 10:05:15

  13. It doesn't solve all the cases, but this definitely sounds like a nice improvement, that could be done in parallel with the ongoing fmt rewrite attempts. Will you send a PR? :)

    Jakub Beránek at 2022-08-11 10:11:34

  14. ten years later: rustc 1.76.0-nightly (f967532a4 2023-12-08)

    test bench_write_macro1 ... bench:       6,062 ns/iter (+/- 41)
    test bench_write_macro2 ... bench:       6,070 ns/iter (+/- 233)
    test bench_write_ref    ... bench:         838 ns/iter (+/- 49)
    test bench_write_value  ... bench:         813 ns/iter (+/- 7)
    

    kekeimiku at 2023-12-09 04:06:03

  15. The macro1 and macro2 versions should be mostly identical now, thanks to format args flattening. Sadly, it's still not as fast as the non-macro version. Mara has recently posted an awesome blog post, where she explains the different trade-offs of format args macros. It's not practical to perfectly expand the macros everywhere, because while it could improve runtime performance, it could also hurt compile times and binary size a lot. That being said, in this simple case, where only a string is being written and only Display is used, it would be great if we could optimize it better.

    Jakub Beránek at 2023-12-09 06:49:29