Match expressions use O(n) stack space with n branches in debug mode

3e9a1d7
Opened by Evan Wallace at 2023-12-01 01:22:06

I ran into this problem while working on a parser (https://github.com/evanw/esbuild/tree/rust). Here's a reduced test case: https://gist.github.com/evanw/06e074a1d6d5c21e8d32e2c26de07714. It contains two recursive functions, small and large, that each contain a match expression. Every call prints out the amount of stack space used.

In debug:

small:
stack space: 0.3kb
stack space: 0.7kb
stack space: 1.0kb
stack space: 1.4kb
stack space: 1.7kb
stack space: 2.1kb
stack space: 2.4kb
stack space: 2.8kb
stack space: 3.1kb
stack space: 3.4kb
stack space: 3.8kb
large:
stack space: 0.6kb
stack space: 1.3kb
stack space: 1.9kb
stack space: 2.6kb
stack space: 3.2kb
stack space: 3.8kb
stack space: 4.5kb
stack space: 5.1kb
stack space: 5.8kb
stack space: 6.4kb
stack space: 7.0kb

In release:

small:
stack space: 0.0kb
stack space: 0.1kb
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.5kb
stack space: 0.6kb
stack space: 0.7kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.0kb
stack space: 1.1kb
large:
stack space: 0.0kb
stack space: 0.1kb
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.5kb
stack space: 0.6kb
stack space: 0.7kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.0kb
stack space: 1.1kb

I would expect the amount of stack space used by a match expression to be proportional to the stack space of the largest branch, not to the total stack space of all branches. The problem isn't too bad here but it causes my actual parser to use huge amounts of stack space and to crash with a stack overflow when parsing virtually all normal-sized inputs.

  1. I have investigated this somewhat and have found that the stack usage is from register spilling. For some reason LLVM is spilling the registers to a separate location in each branch around the overflow check. I have no idea why it might be doing this though.

    Note that without the println! the IR we produce only creates stack slots for the three arguments.

    James Miller at 2016-06-15 06:33:22

  2. /cc @rust-lang/compiler

    James Miller at 2016-06-15 06:33:45

  3. Oh interesting, yeah maybe it only happens when there are early exits inside a match expression. Removing my heavy use of try! inside the match expressions in my actual parser makes my stack size problem completely disappear, but then of course I can't recover from parse errors.

    Evan Wallace at 2016-06-15 07:09:14

  4. I would expect that this may be due to the specifics of the lifetimes that we emit for the bindings.

    Niko Matsakis at 2016-06-15 07:51:40

  5. Oh, never mind, just saw @aatch's comment.

    Niko Matsakis at 2016-06-15 07:51:57

  6. @Aatch seems like the fast register allocator spills all live registers at the end of each basic block.

    Björn Steinbrink at 2016-06-15 12:53:05

  7. I suspect this may have far reaching implications making code worse all over the place. For example Stylo is full of very large match expressions and I wonder if that limitation is making it unnecessarily bloated. I may be completely wrong though, given @Aatch's comment about println!.

    Cc @rust-lang/wg-codegen

    Anthony Ramine at 2018-04-02 15:15:16

  8. Triage (1.44.1)

    We definitely hit this in debug in rust-analyzer. Our code for expression lowering is a single giant recursive match, and it uses 20k of stack space per recursion level in debug if all branches are there. If I comment out all the branches which are not used in my specific test, stack space usage goes down dramatically.

    In release, I think I am seeing stack space proportional to the max branch, as commenting branches doesn't make that huge a difference with --release. I think this is true about the original report as well? With --release, both small and large use the same amount of stack space.

    See https://github.com/rust-analyzer/rust-analyzer/pull/5316/commits/12d52a726a1e6ca0f6b08343d13d0f6b50d94d8c for a real-world example of this.

    Alex Kladov at 2020-07-11 16:02:53

  9. Maybe @rust-lang/wg-mir-opt can do something to clean up the match logic to make it easier for llvm to figure out the register spilling correctly.

    Oli Scherer at 2020-07-12 11:26:40

  10. The problem here is having live values across BB boundaries, because the register allocator in debug mode simply spills and reloads everything, even for unconditional branches.

    Silly example:

    define internal i8 @testcase(i8 %0) {
      br label %bb2
    
    bb2:
      ret i8 %0
    }
    

    becomes:

    testcase:                               # @testcase
      .cfi_startproc
    # %bb.0:
                                              # kill: def $dil killed $dil killed $edi
      movb>-%dil, -1(%rsp)          # 1-byte Spill
      jmp>.LBB15_1
    .LBB15_1:                               # %bb2
      movb>--1(%rsp), %al           # 1-byte Reload
      retq
    

    And in this example, it's not so much the match itself, but the overflow check that causes values that are live across BB boundaries. Compiling with -Cdebug-assertions=no gives the same stack usage for small and large.

    small:
    stack space: 0.2kb
    stack space: 0.4kb
    stack space: 0.6kb
    stack space: 0.8kb
    stack space: 0.9kb
    stack space: 1.1kb
    stack space: 1.3kb
    stack space: 1.5kb
    stack space: 1.7kb
    stack space: 1.9kb
    stack space: 2.1kb
    large:
    stack space: 0.2kb
    stack space: 0.4kb
    stack space: 0.6kb
    stack space: 0.8kb
    stack space: 0.9kb
    stack space: 1.1kb
    stack space: 1.3kb
    stack space: 1.5kb
    stack space: 1.7kb
    stack space: 1.9kb
    stack space: 2.1kb
    

    Each overflow check causes two spill/reload pairs. One for token (1 byte) and one for the result of the subtraction. Which, for alignment reasons, adds up to 8 bytes of stack usage each. I'm not sure there's much we can do there in terms of MIR construction, but I'd love to be proven wrong here :-)

    Also, a good bit of the stack usage is actually used by the println! call. Without the output, small uses 136 bytes of stack with debug assertions, and large uses 440 bytes. Without debug assertions, both use 40 bytes of stack.

    In the general case the difference between debug and release mode, can probably be explained by the fact that in release mode, not only do we get a better register allocator, but we also use lifetime intrinsics in LLVM, which allow stack allocated values that are used in only one arm to share space with values only used in other arms. The latter would explain why the observed stack usage in the rust analyzer example goes from Sum(arms) to Max(arms). Short of doing some form of stack coloring of our own, I don't see a way to improve this in terms of MIR generation either.

    Björn Steinbrink at 2020-07-12 12:59:53

  11. While we can move the count - 1 computation out of the match arms or create a mir optimization that can figure out that the overflow check is unnecessary, because we already checked for count > 0, this won't help in general. E.g. a developer replacing all the count - 1 with count.wrapping_sub(1) just places a function call where there was an overflow check, giving us the same additional basic block. So assuming we want to handle, yea I agree that we can't do anything with a mir-optimization beyond a bunch of not too-high-hanging fruit.

    Oli Scherer at 2020-07-12 13:29:04

  12. This hit full-moon, where users are getting stack overflows for standard input. I am not performing one long match, but several in a row; if this is not the same bug, let me know.

    https://users.rust-lang.org/t/stack-overflow-on-debug-mode-but-not-release-cant-figure-out-solution-without-increasing-stack-allocation-for-release/59869

    boyned//Kampfkarren at 2021-05-17 09:36:10