Tracking issue for oom=panic (RFC 2116)

ee1b88e
Opened by Aria Desires at 2022-10-13 19:51:12

Several users who are invested in the "thread is a task" model that Rust was originally designed around have expressed a desire to unwind in the case of OOM. Specifically each task has large and unpredictable memory usage, and tearing down the first task that encounters OOM is considered a reasonable strategy.

Aborting on OOM is considered unacceptable because it would be expensive to discard all the work that the other tasks have managed to do.

We already (accidentally?) specified that all allocators can panic on OOM, including the global one, with the introduction of generic allocators on collections. So in theory no one "should" be relying on the default oom=abort semantics. This knob would only affect the default global allocator. Presumably there would just be a function somewhere in the stdlib which is cfg'd to be a panic or abort based on this flag?

This is not a replacement for proper fallible allocation handling routines requested in #29802 because most of the potential users of that API build with panic=abort. Also the requesters of this API are unwilling to go through the effort to ensure all their dependents are using fallible allocations everywhere.

I am dubious on this API but I'm filling an issue so that it's written down and everyone who wants it has a place to go and assert that it is in fact good and desirable.

  1. Update: This now the tracking issue for a mechanism to make OOM / memory allocation errors panic (and so by default unwind the thread) instead of aborting the process. This feature was accepted in RFC https://github.com/rust-lang/rfcs/pull/2116.

    It was separated from https://github.com/rust-lang/rust/issues/48043, which tracks another feature of the same RFC, since the two features will likely be stabilized at different times.

    Blockers

    This should be blocked until we can audit all places in the code that allocate to ensure they safely handle unwinding on OOM.

    • [ ] https://github.com/rust-lang/rust/issues/123369
    • [ ] https://github.com/rust-lang/unsafe-code-guidelines/issues/506

    Steps:

    • [ ] Implement the RFC
      • [ ] std::alloc::set_allocation_error_hook (tracked at https://github.com/rust-lang/rust/issues/51245) could potentially be this mechanism, but that hook is currently document as not allowed to unwind.
    • [ ] Adjust documentation (see instructions on forge)
    • [ ] Stabilization PR (see instructions on forge)

    Original feature request:


    Several users who are invested in the "thread is a task" model that Rust was originally designed around have expressed a desire to unwind in the case of OOM. Specifically each task has large and unpredictable memory usage, and tearing down the first task that encounters OOM is considered a reasonable strategy.

    Aborting on OOM is considered unacceptable because it would be expensive to discard all the work that the other tasks have managed to do.

    We already (accidentally?) specified that all allocators can panic on OOM, including the global one, with the introduction of generic allocators on collections. So in theory no one "should" be relying on the default oom=abort semantics. This knob would only affect the default global allocator. Presumably there would just be a function somewhere in the stdlib which is cfg'd to be a panic or abort based on this flag?

    This is not a replacement for proper fallible allocation handling routines requested in #29802 because most of the potential users of that API build with panic=abort. Also the requesters of this API are unwilling to go through the effort to ensure all their dependents are using fallible allocations everywhere.

    I am dubious on this API but I'm filling an issue so that it's written down and everyone who wants it has a place to go and assert that it is in fact good and desirable.

    Simon Sapin at 2018-06-27 12:40:05

  2. Note that in previous topics I stated that libunwind doesn't handle OOM well, but this appears to be outdated or incorrect information. They preallocate their memory before an unwind is even requested. In the limit they might run out of memory and give up but it's not like we've lost anything for trying.

    Aria Desires at 2017-08-01 18:58:33

  3. I figured I'd comment on this with respect to https://github.com/rust-lang/rust/issues/42808 and the outcome there along with the @rust-lang/libs team discussion. It was discovered in https://github.com/rust-lang/rust/issues/42808 that the integration of the Alloc trait caused a number of code size related regressions which when diagnosed were almost exclusively related to unwind tables and landing pads. This investigation then caused https://github.com/rust-lang/rust/pull/44049 as the only two stable implementations of allocators don't panic.

    Note, though, that this is not necessarily directly related to OOM but just allocator implementations themselves. Should we forbid, in general, panics from allocators? Should we only allow the oom method to panic?

    Unsure!

    Alex Crichton at 2017-08-23 21:57:48

  4. I personally think it is reasonable to forbid the allocator itself from panicking and instead have the oom method be responsible for handling allocation failures and panicking or aborting as necessary. I also think it would be useful if oom was told the size of the allocation that failed so it could heuristically determine whether it is an actual memory exhaustion event or just someone accidentally specifying a silly large number.

    Peter Atashian at 2017-08-23 22:05:40

  5. @retep998 fn oom does receive the AllocErr, which for AllocErr::Exhausted will carry the input Layout that was requested. Does that not suffice for what you describe?

    (I personally wanted to try to stuff the Layout onto both AllocErr variants, but I worried about bloating the size of the error type after it was no longer an associated item for the allocator...)

    Felix S Klock II at 2017-08-24 20:32:36

  6. @pnkfelix Yep, that would do exactly what I want. It's my fault for not looking at the existing API and seeing that we already had that 😛

    Peter Atashian at 2017-08-24 20:39:35

  7. RFC https://github.com/rust-lang/rfcs/pull/2116 was accepted, I’ve repurposed this feature request into a tracking issue. See more in this thread’s original message.

    Simon Sapin at 2018-06-27 12:40:45

  8. We now have std::alloc::set_alloc_error_hook (tracked at https://github.com/rust-lang/rust/issues/51245) which is document as not allowed to unwind. If that restriction could be lifted, this hook could be the mechanism to implement this feature (by having the program register at start-up a hook that panics).

    Simon Sapin at 2018-06-27 12:46:35

  9. Minor question: if someone sets oom=panic and panic=abort, should that function identically to oom=abort, or should it be a compilation error with the rationale that such a configuration is likely a mistake on the user's part?

    bstrie at 2018-10-15 05:22:13

  10. I expect it would be mostly identical, though the error message that’s printed might not be exactly the same. I don’t know if it’s worth forbidding explicitly even if it’s not particularly useful. An observable effect (though again not particularly useful since std::alloc::set_alloc_error_hook also exists) would be that a hook set through std::panic::set_hook is run on OOM.

    Simon Sapin at 2018-10-15 07:03:40

  11. Making allocation errors default to panic has undergone FCP successfully: https://github.com/rust-lang/rust/issues/66741

    I wonder if there's a point to this cargo-level feature anymore?

    Ehsanul Hoque at 2020-05-11 02:37:09

  12. https://github.com/rust-lang/rust/issues/66741 is only relevant in programs where libstd is not used at all. This issue is about the behavior of libstd.

    Simon Sapin at 2020-05-11 13:42:51

  13. I had sent my ideas to https://github.com/rust-lang/wg-allocators/issues/62 ; but as it looks like there already is a tracking issue, let's move the comments here.

    Assume a rust server running multiple things on behalf of clients. It uses regular Rust code, that is not written to care specifically about this particular use case.

    The server wants to limit the memory consumption clients could force upon it. It does so by having each client's allocations be in an arena allocator, and then wants to abort just that client's connection in case of “OOM.”

    Unfortunately, this means that OOM must not abort the whole process, and AFAIU that is what the current behavior is, even with set_alloc_error_hook -- or at least so do the docs read like.

    I'm not sure oom=panic as a global setting could work properly in all cases, as potentially panicking requires memory allocation? But I think that already allowing set_alloc_error_hook to panic would allow the user to swap the error hook at the same time as the allocator, and in case of OOM revert to the global allocator and then, once it's possible to allocate again, perform the panic.

    Does that make sense?

    Léo Gaspard at 2020-05-16 01:04:15

  14. It's probably the case that (a) the amount of memory needed to alloc panic data is fairly low and (b) you'd hit oom from asking for some large amount that can't be offered. In this situation, you'd probably be able to allocate the panic data.

    However, if you fail to allocate while panicing you'll just get a panic, and a panic during panic is an abort. So at least you're not trapped in a loop ;P

    But also: you can't carelessly change the global allocator while allocated things are present in the world or they'll get potentially freed on the wrong allocator, which is obviously bad.

    Basically the entire scenario you want can't be done with rust's current conventions regarding memory allocation. You'd need specialized local-allocator using code, or waiting for alloc-wg to make the entire ecosystem more alloc aware over time.

    Lokathor at 2020-05-16 01:17:03

  15. Just for clarification: I was assuming the existence of a global allocator that is able to use different memory pools for allocation depending on some thread-local. Entering a task's context would set the thread-local to the task's memory pool, and either exiting it or hitting an OOM during it would reset it to the global memory pool. It knows which memory pool to free from without the memory pool being necessarily the same as the one that was configured during allocation, thanks to knowing the address of the memory block to free.

    By doing this, I think that what I want to do can be achieved with the current conventions regarding memory allocations, but iff set_alloc_error_hook can be set not to abort, but to panic instead :)

    Léo Gaspard at 2020-05-16 12:37:03

  16. Oh, and I forgot to answer your initial points: unfortunately, a client able to make the server abort is a DoS vulnerability, so… that's not a good thing :( And I can totally see the client being able to fill the pool by making the server allocate one byte at a time -- the server could try to prevent that by keeping its own counter of how much memory each task has used, but it requires potentially large changes in every library used by the server, while the solution based on allocators would work without requiring cooperation from every piece of the code (including those I don't control) :)

    Léo Gaspard at 2020-05-16 12:40:50

  17. It knows which memory pool to free from without the memory pool being necessarily the same as the one that was configured during allocation, thanks to knowing the address of the memory block to free.

    Yes, this aspect is critical to the soundness of this kind of scheme.

    iff set_alloc_error_hook can be set not to abort, but to panic instead :)

    Yes, see https://github.com/rust-lang/rust/issues/43596#issuecomment-400659080. I don’t know why this restriction exists, though.

    Simon Sapin at 2020-05-16 16:02:55

  18. For why this restriction exists, I don't know why either, but (for other people who may be reading this, I know you've already read it) the three messages starting from https://github.com/rust-lang/rust/issues/51245#issuecomment-393503805 explain that it's indeed here :)

    My first guess would be: panicking might need to allocate itself on some platforms… maybe? And potentially this allocation would not be handled by rust, and thus not be covered by rust's allocation handling, and then Things would happen.

    Now, for use cases like the one I describe, where there still is memory available for the program and it's just an arena that got full, allowing the replacement would make sense. But maybe it'd require set_alloc_error_hook to be unsafe?

    Léo Gaspard at 2020-05-16 16:58:59

  19. panicking might need to allocate itself on some platforms… maybe?

    It does, on all platforms.

    First if you use panic! with a formatting string and additional arguments, the formatted string is allocated:

    https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/panicking.rs#L362-L363

    Then that string is boxed to make panic payload pointer:

    https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/panicking.rs#L375

    If instead you use panic! with a single argument, that value is the payload which is also boxed:

    https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/panicking.rs#L424

    panic! with zero argument has a &'static str payload that goes in the same code path as with single argument.

    https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/macros.rs#L11-L12

    potentially this allocation would not be handled by rust, and thus not be covered by rust's allocation handling, and then Things would happen.

    No, these allocations do go through Rust’s global allocator.

    If they fail and call a alloc error hook that panics, this is another case of panicking while panicking that can already happen with Drop impls. This causes the process to abort. This happening may be undesirable, but it’s not Undefined Behavior.

    But maybe it'd require set_alloc_error_hook to be unsafe?

    Why?

    Simon Sapin at 2020-05-16 17:27:04

  20. For why this restriction exists, I don't know why either, but (for other people who may be reading this, I know you've already read it) the three messages starting from #51245 (comment) explain that it's indeed here :)

    Because we mark global allocator functions as nounwind to allow the compiler to optimize better.

    Amanieu d'Antras at 2020-05-16 17:35:59

  21. @Amanieu do you mean functions like std::alloc::alloc? Those return null on failure though. This is about unwinding from std::alloc::handle_alloc_error, which is already in an error path so optimizing it might be less important?

    Simon Sapin at 2020-05-16 17:51:36

  22. But maybe it'd require set_alloc_error_hook to be unsafe?

    Why?

    This was written under the assumption that possibly some platforms would have unwinding happen outside of the scope of what rustc controls, which might end up doing wrong things if in OOM conditions. It's very possible that this assumption is wrong, though -- I would even hope it is, as it would mean that this unsafe would not be deserved :)

    Léo Gaspard at 2020-05-16 21:38:55

  23. It's a little hard to tell from the comments both here and on https://github.com/rust-lang/rust/issues/51245 what is happening and/or needs to happen to make progress on this. My company has a use case very similar to the one described (don't want OOM on one task to tear down the whole process), and we might be able to contribute to moving this forward, but would need some guidance on how to help do that. Could someone point me in the right direction?

    jgallagher-cs at 2020-11-20 14:57:54

  24. Looks like panicking on OOM will never correctly handle all OOM cases, because panic itself can OOM. What about throwing an object that does not require allocation? Instead of

    panic!("oom")
    

    we can do

    resume_unwind(Box::new(AllocError)); // AllocError is ZST
    

    This has the added benefit that the catcher can distinguish this from other panics (which usually are unrecoverable programming errors).

    @alexcrichton @pnkfelix Would the Rust team be interested in reviewing a PR that does this? There are a few other places in libstd that do not allow catching an OOM error - I'm looking into these and so far all of them are fixable.

    Xiang Fan at 2021-09-10 13:59:34

  25. resume_unwind doesn't exist for #![no_std]. In addition it doesn't actually print a panic message. For that you need panic_any which doesn't exist for #![no_std] either. #![no_std] only allows panicking with an &str or core::fmt::Arguments. It doesn't support arbitrary panic payloads.

    bjorn3 at 2021-09-10 14:30:37

  26. resume_unwind doesn't exist for #![no_std]

    I take it that oom=panic is for users with std? See rust-lang/rfcs#2116:

    • For users with unwinding, an oom=panic configuration is added to make global allocators panic on oom.
    • For users without unwinding, a try_reserve() -> Result<(), CollectionAllocErr> method is added.

    ~~Global allocators and~~ unwinding are std-only. If someday no_std gains unwinding support, we'll probably have resume_unwind for no_std as well.

    Xiang Fan at 2021-09-10 14:45:07

  27. Global allocators are available in no_std+alloc scenarios, while resume_unwind is std only I think. Disallowing oom=panic in no_std+alloc would be fine though - those users would likely want to use fallible APIs instead of panicking on oom.

    Robin Lambertz at 2021-09-10 16:42:19

  28. if there isn't a way to deny that dependencies ever use infallible allocation then that's basically an unworkable situation.

    Lokathor at 2021-09-10 16:46:19

  29. Practically speaking, -C oom=panic would be implemented by having the compiler-generated main function call set_alloc_error_hook to change the hook to one that panics instead of aborting. The changes needed to allow alloc_error_handler to unwind are in #88098 and #88700.

    no_std programs should instead define a custom alloc_error_handler which can unwind or abort as needed.

    Amanieu d'Antras at 2021-09-10 18:41:20

  30. The issue I'm raising is oom=panic will never be oom-safe as long as panicking allocates. An allocation error handler simply should not allocate. To solve this, I think oom=unwind is a good replacement: the actual reason we wanted oom=panic is to unwind.

    Bikeshed: Should this be renamed to alloc-error=unwind after 2b789bd0570983e82533f9ed30c80312ac334694 changed everything oom to alloc_error?

    Xiang Fan at 2021-09-10 21:23:51

  31. To solve this, I think oom=unwind is a good replacement: the actual reason we wanted oom=panic is to unwind.

    That still requires allocating the exception object I think. Also catch_unwind which actually catches the panic returns a boxed panic payload.

    The issue I'm raising is oom=panic will never be oom-safe as long as panicking allocates.

    I believe this problem can be mitigated by using a bit of memory reserved just for allocations necessary during panicking. Maybe it could even be an eagerly initialized thread local variable.

    bjorn3 at 2021-09-10 21:47:16

  32. let's also remember that one allocation failure doesn't mean eternal allocation failure. If you try to allocate a billion byte array and fail, maybe it's still possible to allocate 20 bytes for an error to describe what went wrong.

    Lokathor at 2021-09-10 21:49:50

  33. catch_unwind which actually catches the panic returns a boxed panic payload.

    A boxed ZST doesn't require allocation.

    That still requires allocating the exception object I think.

    I believe this problem can be mitigated by using a bit of memory reserved just for allocations necessary during panicking.

    I agree. However, there's a distinction between reserving an exception object and reserving a panic payload. Exception objects are an implementation detail and it's easy to make them use reserved memory. Panic payload is already exposed via the resume_unwind API which allows the user to catch the payload. The user can simply drop the payload and now you've lost your reserved memory :(

    Xiang Fan at 2021-09-10 21:59:29

  34. let's also remember that one allocation failure doesn't mean eternal allocation failure. If you try to allocate a billion byte array and fail, maybe it's still possible to allocate 20 bytes for an error to describe what went wrong.

    True. Still, is it really a good idea to rely on "small objects" being available? Most allocators don't request memory in increments of 20 bytes - they might request, say, a 64K block each time. Allocating 20 bytes can still become a 64K memory request and fail.

    Xiang Fan at 2021-09-10 22:22:50

  35. Panic payload is already exposed via the resume_unwind API which allows the user to catch the payload. The user can simply drop the payload and now you've lost your reserved memory

    If the AllocError is a ZST and if Box of a ZST were implemented such that the inner pointer is NULL and allocators are required to implement freeing a NULL pointer as a no-op then there would be no issue, right?

    More importantly, the oom=panic feature will create memory safety hazards in code that currently doesn't have memory-safety hazards. In general I think we'll have to assume by default that crates are not compatible with oom=panic unless they explicitly claim otherwise, and I think the build system should enforce this. I expanded on this at https://github.com/rust-lang/rfcs/pull/2116#issuecomment-1006144661.

    Brian Smith at 2022-01-05 23:03:58

  36. if Box of a ZST were implemented such that the inner pointer is NULL

    I'm not sure what you mean here, but it is currently guaranteed that Box<T> is a non-null pointer, no matter the T. This allows Option<Box<T>> to be represented as a single pointer. From https://doc.rust-lang.org/stable/std/boxed/index.html#memory-layout:

    For zero-sized values, the Box pointer still has to be valid for reads and writes and sufficiently aligned.

    And the docs on pointer validity says:

    • A null pointer is never valid, not even for accesses of size zero.

    Robin Lambertz at 2022-01-06 00:35:43

  37. Box doesn't actually allocate for zero-sized values - it creates a non-null dangling pointer:

    fn main() {
        let val = Box::new(());
        println!("{:?}", Box::into_raw(val));
    }
    

    prints 0x1

    A while back, I experimented with getting a complete zero-alloc panic to work (I used a custom GlobalAlloc that enforced that no allocations happened after a certain point). I believe I got it working for ZST payloads - I can try to dig up the code.

    Aaron Hill at 2022-01-06 00:53:02

  38. Tagging as needs-summary; it'd help to know the current state of this so we can understand what it would take to stabilize this.

    Josh Triplett at 2022-02-09 18:20:58

  39. It's not implemented yet. The implementation PR is #88098 and is waiting for review.

    Amanieu d'Antras at 2022-02-09 18:42:10

  40. @Amanieu what's the current status of this tracking issue? It looks like you implemented the feature in https://github.com/rust-lang/rust/pull/88098, is this just waiting on an eventual push for stabilization?

    Jane Losare-Lusby at 2022-06-15 18:41:39

  41. I think this just needs to be stabilized. We might need to audit the standard library to check that we stay sound in the face of unwinding from allocations. Also I'm not sure what the state of the documentation is.

    Amanieu d'Antras at 2022-06-16 11:42:39

  42. What are the chances of stabilizing this soon? I've written workaround for it #84612 a year ago, and I wonder if I should push for the workaround still.

    Kornel at 2022-10-12 12:40:33

  43. I don't think std is ready for oom=panic yet. The unwinding path is not allocation-free, so likely currently oom=panic only works when a large allocation fail (but the allocator can still satisfy a few small ones).

    Gary Guo at 2022-10-12 20:29:54

  44. that still seems like a strict improvement over the current situation.

    Lokathor at 2022-10-12 20:56:55

  45. @nbdd0121 As long as this behavior is safe (I assume it causes abort like a panic during a panic), I don't think this issue is a blocker to stabilization, because this deficiency doesn't change the public interface of oom=panic. It's a quality-of-implementation issue, but even aborting sometimes is better than aborting every time.

    Kornel at 2022-10-12 21:30:08

  46. Currently we panic with error message, and that will require allocation, and I consider that a publicly visible behaviour.

    Gary Guo at 2022-10-12 21:36:40

  47. If the error message is fixed, we should be able to pre-allocate the necessary object (though it might require some hacky code in the panic internals).

    Aaron Hill at 2022-10-12 21:42:38

  48. Even for fixed error message we need to allocate a Box<dyn Any> for it, and since &str is not a ZST, there are actual heap allocation required. We couldn't preallocate it because we need to allocate one of them for each thread (not all threads are created by std APIs).

    C++ doesn't have this issue because it uses special __cxa_allocate_exception and __cxa_free_exception instead of normal heap allocation, which we can't do because our API is just a plain Box.

    We could make the panic payload a ZST, and that'll avoid allocation problem on unwind path entirely.

    All other allocations that we currently have can be avoided, I think.

    Gary Guo at 2022-10-12 21:53:02

  49. We’re constrained to Box<dyn Any> because it’s part of std::thread::Result, right?

    Simon Sapin at 2022-10-12 22:28:50

  50. There are 2 levels of allocation when panicking, first a Box<String> or Box<&str> must be allocated to hold the panic message. Then (with dwarf unwinding) an exception objects needs to be allocated on the heap, which holds the Box<Any>.

    Amanieu d'Antras at 2022-10-13 00:55:04

  51. The exception object doesn't need to be allocated on the heap. It can be #[thread_local].

    Gary Guo at 2022-10-13 00:57:06

  52. @nbdd0121 What if the exception object is subsequently passed across a thread boundary, and the thread later terminates while the exception object is still present on some other thread stack or heap?

    Ellen Emilia Anna Zscheile at 2022-10-13 07:37:09

  53. Libunwind exception objects can't be sent to another thread. They are freed when the exception is caught an not rethrown.

    However the process of unwinding could trigger additional memory allocations due to drop impls, which could cause another out-of-memory exception to be thrown. C++ handles this by only having one OOM exception object in TLS, if that one is already taken then the exception is allocated normally (with an abort if that allocation fails).

    Amanieu d'Antras at 2022-10-13 11:55:31

  54. wouldn't it make more sense to collapse multiple out-of-memory exceptions into one (or, like, have a simple counter for the occurrences, not that it matters much), and stop executing that drop impl (I'm not sure how that would affect invariants and such, but the unwinding would be there either way, so I don't think it would change much), are there downsides to that besides even more unpredictable control flow in OOM cases?

    Ellen Emilia Anna Zscheile at 2022-10-13 19:36:28

  55. and stop executing that drop impl

    This breaks the pinning invariant for things on the stack that are pinned and as such is unsound.

    bjorn3 at 2022-10-13 19:51:12