Rust will memmov |self| when passed by value to inline function

d8acff3
Opened by Bobby Holley at 2024-02-24 19:42:47

See the assembly for https://is.gd/0Owhgw . Note that you'll need to explicitly specify Release, because I can't link to Release code for some reason.

Note this bit here:

movl $400, %edx movq %rbx, %rsi callq memcpy@PLT

That looks like a very unnecessary and large copy.

I ran into this working on rust-selectors for stylo. In the parser, we have a very large stack-allocated SmallVec, under the assumption that it is never moved. I added some inline functions that take the buffer as |self|, and parsing got several percent slower. I rewrote the code to pass things as &mut, and the overhead went away.

Interestingly, doing:

for x in some_small_vec.into_iter() { ... }

Doesn't have any memcpy in the disassembly, despite the fact that into_iter() takes self. Presumably some iterator optimization going on.

CC @SimonSapin @mbrubeck @Manishearth @Gankro @jonathandturner

  1. (Deleted my last comment, which was intended for somewhere else).

    Bobby Holley at 2017-06-20 05:00:09

  2. Actually, I was wrong. This totally affects SmallVec::into_iter() :-(

    Paste the following into play: https://gist.github.com/bholley/69dcf3654dfde0ebd8bb8bd8a7c2386d

    I guess using drain() is a reasonable workaround. Really bad footgun though.

    Bobby Holley at 2017-06-20 23:11:18

  3. cc @rust-lang/compiler

    Michael Woerister at 2017-06-21 14:00:06

  4. cc @pcwalton Is this the same memcpy optimization problem?

    Eduard-Mihai Burtescu at 2017-06-21 18:49:14

  5. @eddyb - I'm going to show my ignorance here a bit, but why is this a memcpy optimization? If the function is inlined, shouldn't the variable be reused rather than memcpy'd? When @bholley described it, just sounded like perhaps a bug in the inliner. (though, again, it's been a while since I worked in that area)

    Sophia J. Turner at 2017-06-21 18:53:46

  6. @jonathandturner LLVM inlines after rustc generates the copies.

    Eduard-Mihai Burtescu at 2017-06-21 19:04:37

  7. It isn't in general safe to remove these copies when an inline happens. For example modifications to the inner function's copy shouldn't be visible to the outer function's copy. Similarly pointers in the inner function may alias the outer function's copy of the value.

    Rust has a lot of info that can prove eliminating these copies is safe, but llvm doesn't generally have that information. For instance you can easily argue the copy is unnecessary when T: !Copy, but I don't think llvm has any notion of move semantics or deinitializing memory.

    So either we need to feed llvm stronger alias information and/or improve llvm's optimization passes, or we need to make Rust (MIR) do the inlining so it can apply its domain-specific knowledge to elide the copies.

    Either way, fixing this is actually pretty non-trivial, afaict.

    Aria Desires at 2017-06-21 19:20:37

  8. For example modifications to the inner function's copy shouldn't be visible to the outer function's copy.

    I'm trying to think of a situation this occurs, seems like Rust protects you:

    • move: you have ownership, so you're free to modify
    • shared borrow: you can't modify
    • mutable borrow: your changed version will be visible to the calling function

    Though I guess maybe that's less the case than it seems at first blush?

    Sophia J. Turner at 2017-06-21 19:26:47

  9. Doesn't apply to Copy types. We did try to avoid some of the copies for non-Copy types although I'm not clear on how much that helps or if it was even kept in MIR trans.

    Eduard-Mihai Burtescu at 2017-06-21 19:38:56

  10. This issue and others like it apparently come up quite regularly in stylo and are something of a hindrance to achieving its promised performance. It would be good if the compiler team could think a bit about how to unblock some of the advanced optimizations necessary here. cc @nikomatsakis @rust-lang/compiler

    Brian Anderson at 2017-07-05 21:54:01

  11. Hmm. We discussed this for some time in the @rust-lang/compiler team meeting but didn't decide on a precise course of action. It seems really important to make progress here, but we're not sure how to allocate the "person power" to make it happen, and we didn't want to call this P-high without somebody who is planning to follow-up.

    One thing that I personally think would be helpful to is to drill a bit more into the details of when memmoves etc occur and if we can categorize the various cases better. Also to get a clearer idea on what sorts of annotations or other changes would help to do better optimization.

    cc @arielb1, who might have thoughts on the matter

    Niko Matsakis at 2017-07-06 20:45:01

  12. triage: P-medium

    Niko Matsakis at 2017-08-24 20:51:24

  13. Are there any suggestions on ways to catch instances of this issue? I came across a very obvious case where the run-time of my long-running program increased by a factor of 15x because I did something like this in a loop:

    let a = if flag { struct.large_array1 } else {struct.large_array2};
    f(a[x-1], a[x], a[x+1])
    

    I wasn't expecting a memcpy to occur here, but clearly it was copying the entire array every time I ran this code. Using &s.large_array1 fixed it.

    I work with large arrays all over my code, so I'm sure more subtle instances are occurring. This seems like a pretty serious performance issue for anyone working with large Copy types.

    robsmith11 at 2018-05-05 07:27:07

  14. You can dump the assembly and then grep for memmov with one line of before-context, which should show you the sizes of all the immediate memmovs in the binary. From there it's pretty straightforward to look for the biggest ones. I did this successfully for Firefox.

    Bobby Holley at 2018-05-05 19:18:29

  15. FWIW, I put a tool together that helps find these memcpy's: https://github.com/jrmuizel/memcpy-find

    Jeff Muizelaar at 2018-11-28 18:41:34

  16. It looks like the memcpy in the original test case is removed when running with -Zmir-opt-level=2.

    example::main:
            sub     rsp, 472
            mov     qword ptr [rsp + 24], 5
            mov     qword ptr [rsp + 32], 5
            mov     qword ptr [rsp + 40], 5
            mov     qword ptr [rsp + 48], 5
            mov     qword ptr [rsp + 56], 5
            mov     qword ptr [rsp + 64], 5
            mov     qword ptr [rsp + 72], 5
            mov     qword ptr [rsp + 80], 5
            mov     qword ptr [rsp + 88], 5
            mov     qword ptr [rsp + 96], 5
            lea     rax, [rsp + 104]
            mov     qword ptr [rsp + 104], 5
            mov     qword ptr [rsp + 112], 5
            mov     qword ptr [rsp + 120], 5
            mov     qword ptr [rsp + 128], 5
            mov     qword ptr [rsp + 136], 5
            mov     qword ptr [rsp + 144], 5
            mov     qword ptr [rsp + 152], 5
            mov     qword ptr [rsp + 160], 5
            mov     qword ptr [rsp + 168], 5
            mov     qword ptr [rsp + 176], 5
            mov     qword ptr [rsp + 184], 5
            mov     qword ptr [rsp + 192], 5
            mov     qword ptr [rsp + 200], 5
            mov     qword ptr [rsp + 208], 5
            mov     qword ptr [rsp + 216], 5
            mov     qword ptr [rsp + 224], 5
            mov     qword ptr [rsp + 232], 5
            mov     qword ptr [rsp + 240], 5
            mov     qword ptr [rsp + 248], 5
            mov     qword ptr [rsp + 256], 5
            mov     qword ptr [rsp + 264], 5
            mov     qword ptr [rsp + 272], 5
            mov     qword ptr [rsp + 280], 5
            mov     qword ptr [rsp + 288], 5
            mov     qword ptr [rsp + 296], 5
            mov     qword ptr [rsp + 304], 5
            mov     qword ptr [rsp + 312], 5
            mov     qword ptr [rsp + 320], 5
            mov     qword ptr [rsp + 328], 5
            mov     qword ptr [rsp + 336], 5
            mov     qword ptr [rsp + 344], 5
            mov     qword ptr [rsp + 352], 5
            mov     qword ptr [rsp + 360], 5
            mov     qword ptr [rsp + 368], 5
            mov     qword ptr [rsp + 376], 5
            mov     qword ptr [rsp + 384], 5
            mov     qword ptr [rsp + 392], 5
            mov     qword ptr [rsp + 400], 5
            mov     qword ptr [rsp + 408], 5
            mov     qword ptr [rsp + 416], 5
            lea     rcx, [rip + .L__unnamed_1]
            mov     qword ptr [rsp + 424], rcx
            mov     qword ptr [rsp + 432], 2
            mov     qword ptr [rsp + 8], rax
            mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
            mov     qword ptr [rsp + 16], rax
            mov     qword ptr [rsp + 440], 0
            lea     rax, [rsp + 8]
            mov     qword ptr [rsp + 456], rax
            mov     qword ptr [rsp + 464], 1
            lea     rdi, [rsp + 424]
            call    qword ptr [rip + std::io::stdio::_print@GOTPCREL]
            add     rsp, 472
            ret
    

    Jeff Muizelaar at 2021-01-13 19:58:29

  17. Is that just replacing the memcpy with an inline series of moves? If so I hope we can do better. :-)

    Bobby Holley at 2021-01-15 00:23:06

  18. No. There are 50 movs to initialize with 5 and only 8 other movs to set up parameters for the call to print. It would be better if the initialization of the array could get removed too, but that's another issue.

    Here's what it looks like by default without -Zmir-opt-level=2

    example::main:
            push    r14
            push    rbx
            sub     rsp, 872
            mov     qword ptr [rsp + 24], 5
            mov     qword ptr [rsp + 32], 5
            mov     qword ptr [rsp + 40], 5
            mov     qword ptr [rsp + 48], 5
            mov     qword ptr [rsp + 56], 5
            mov     qword ptr [rsp + 64], 5
            mov     qword ptr [rsp + 72], 5
            mov     qword ptr [rsp + 80], 5
            mov     qword ptr [rsp + 88], 5
            mov     qword ptr [rsp + 96], 5
            mov     qword ptr [rsp + 104], 5
            mov     qword ptr [rsp + 112], 5
            mov     qword ptr [rsp + 120], 5
            mov     qword ptr [rsp + 128], 5
            mov     qword ptr [rsp + 136], 5
            mov     qword ptr [rsp + 144], 5
            mov     qword ptr [rsp + 152], 5
            mov     qword ptr [rsp + 160], 5
            mov     qword ptr [rsp + 168], 5
            mov     qword ptr [rsp + 176], 5
            mov     qword ptr [rsp + 184], 5
            mov     qword ptr [rsp + 192], 5
            mov     qword ptr [rsp + 200], 5
            mov     qword ptr [rsp + 208], 5
            mov     qword ptr [rsp + 216], 5
            mov     qword ptr [rsp + 224], 5
            mov     qword ptr [rsp + 232], 5
            mov     qword ptr [rsp + 240], 5
            mov     qword ptr [rsp + 248], 5
            mov     qword ptr [rsp + 256], 5
            mov     qword ptr [rsp + 264], 5
            mov     qword ptr [rsp + 272], 5
            mov     qword ptr [rsp + 280], 5
            mov     qword ptr [rsp + 288], 5
            mov     qword ptr [rsp + 296], 5
            mov     qword ptr [rsp + 304], 5
            mov     qword ptr [rsp + 312], 5
            mov     qword ptr [rsp + 320], 5
            mov     qword ptr [rsp + 328], 5
            mov     qword ptr [rsp + 336], 5
            mov     qword ptr [rsp + 344], 5
            mov     qword ptr [rsp + 352], 5
            mov     qword ptr [rsp + 360], 5
            mov     qword ptr [rsp + 368], 5
            mov     qword ptr [rsp + 376], 5
            mov     qword ptr [rsp + 384], 5
            mov     qword ptr [rsp + 392], 5
            mov     qword ptr [rsp + 400], 5
            mov     qword ptr [rsp + 408], 5
            mov     qword ptr [rsp + 416], 5
            lea     r14, [rsp + 472]
            lea     rsi, [rsp + 24]
            mov     rbx, qword ptr [rip + memcpy@GOTPCREL]
            mov     edx, 400
            mov     rdi, r14
            call    rbx
            lea     rdi, [rsp + 24]
            mov     edx, 400
            mov     rsi, r14
            call    rbx
            lea     rax, [rsp + 104]
            mov     qword ptr [rsp + 8], rax
            mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
            mov     qword ptr [rsp + 16], rax
            lea     rax, [rip + .L__unnamed_1]
            mov     qword ptr [rsp + 424], rax
            mov     qword ptr [rsp + 432], 2
            mov     qword ptr [rsp + 440], 0
            lea     rax, [rsp + 8]
            mov     qword ptr [rsp + 456], rax
            mov     qword ptr [rsp + 464], 1
            lea     rdi, [rsp + 424]
            call    qword ptr [rip + std::io::stdio::_print@GOTPCREL]
            add     rsp, 872
            pop     rbx
            pop     r14
            ret
    

    The initialization is there along with 2 400 byte memcpys.

    Jeff Muizelaar at 2021-01-15 00:56:36

  19. Looks like nightly (probably #82806) removes the memcpy from the original example (https://godbolt.org/z/nGfxaj), but not from the smallvec example (https://godbolt.org/z/Y34K36)

    cc @nikic

    erikdesjardins at 2021-03-13 22:18:26

  20. The memcpy was removed from the original example in 1.52: https://godbolt.org/z/5fnGqG1KG. The array initialization is still there, but that's gonna be hard to remove since println! takes a reference to the array element. It disappears if you force a move: https://godbolt.org/z/hYWT6deWo.

    The memcpy was removed from the smallvec example in 1.70: https://godbolt.org/z/6oGcfMaor. There's still some unnecessary stack traffic that seems to be due to phase ordering: https://godbolt.org/z/9o8e6eqPc.

    erikdesjardins at 2024-02-24 19:42:20