Move HashMap to liballoc

4462a81
Opened by Aria Desires at 2024-08-03 21:28:47

This is blocked on https://github.com/rust-lang/rust/pull/26870

  • [ ] Move FnvHasher from src/librustc/util/nodemap.rs to libcollections
  • [ ] Expose FnvHasher as DeterministicState
  • [ ] Move HashMap's implementation into libcollections defaulting to DeterministicState
  • [ ] Unhardcode HashMap's constructors/methods from using RandomState (instead relying on default fallback)
  • [ ] Re-export HashMap in std::collections as pub type HashMap<K, V, S = RandomState> = core_collections::HashMap<K, V, S>
  • [ ] Do the same for HashSet
  • [ ] Re-export DeterministicState in std as well (so we have a better answer to "HashMap is slow")
  • [ ] Document the performance-stability/security tradeoff between DeterministicState and RandomState, and that users of the std facade will get a different default from direct users of libcollections (because RandomState requires OS rng to seed its state)

I am willing to mentor this if someone else wants to do it.

Note that re-exporting HashMap may be the trickiest one; it's not clear to me that it's trivial, and may involve manually re-exporting everything in std::collections::hash_map to get the right behaviour (annoying and potentially fragile).

  1. This is blocked on https://github.com/rust-lang/rust/pull/26870

    • [ ] Move FnvHasher from src/librustc/util/nodemap.rs to libcollections
    • [ ] Expose FnvHasher as DeterministicState
    • [ ] Move HashMap's implementation into libcollections defaulting to DeterministicState
    • [ ] Unhardcode HashMap's constructors/methods from using RandomState (instead relying on default fallback)
    • [ ] Re-export HashMap in std::collections as pub type HashMap<K, V, S = RandomState> = core_collections::HashMap<K, V, S>
    • [ ] Do the same for HashSet
    • [ ] Re-export DeterministicState in std as well (so we have a better answer to "HashMap is slow")
    • [ ] Document the performance-stability/security tradeoff between DeterministicState and RandomState, and that users of the std facade will get a different default from direct users of libcollections (because RandomState requires OS rng to seed its state)

    I am willing to mentor this if someone else wants to do it.

    Note that re-exporting HashMap may be the trickiest one; it's not clear to me that it's trivial, and may involve manually re-exporting everything in std::collections::hash_map to get the right behaviour (annoying and potentially fragile).

    <!-- TRIAGEBOT_START --> <!-- TRIAGEBOT_ASSIGN_START --> <!-- TRIAGEBOT_ASSIGN_DATA_START$${"user":"clarfonthey"}$$TRIAGEBOT_ASSIGN_DATA_END --> <!-- TRIAGEBOT_ASSIGN_END --> <!-- TRIAGEBOT_END -->

    rustbot at 2023-05-16 05:10:58

  2. CC @jroesch to confirm that this will work correctly with your design

    Aria Desires at 2015-07-23 18:59:54

  3. Yeah this seems to be completely compatible with the current design of defaults. Let me know if there are any issues doing this.

    Jared Roesch at 2015-07-23 20:25:04

  4. I would also be fine just not having a default in libcollections and eschew the DeterministicState entirely.

    Alex Crichton at 2015-07-23 20:40:11

  5. I would like to have DeterministicState in std because it's painful to explain that if you want a fast HashMap you need to use such-and-such external crate.

    Aria Desires at 2015-07-23 21:11:35

  6. Perhaps, yeah, but I also don't mind expanding std::hash to just have more hashing algorithms. We use RandomState as a name so we're not committing to an algorithm, but that's basically just because it's the default. If you have to explicitly opt-in to an algorithm it seems fine to just expose the names themselves.

    Alex Crichton at 2015-07-23 23:15:23

  7. Oh yeah, I'm fine if it's also just straight up FnvState or whatever. Just want a secure and a fast option.

    Aria Desires at 2015-07-24 00:09:59

  8. #26870 landed, but you want a snapshot before you can start moving things around.

    Eduard-Mihai Burtescu at 2015-07-26 13:15:55

  9. Thanks, snapshot landed just now. I would be happy to do this.

    Peter Blackson at 2015-07-28 14:23:07

  10. It sounds like we can't while the feature is gated?

    Aria Desires at 2015-07-28 14:34:12

  11. Niko and I talked today and the feature becoming ungated is a decision the core team has to make. We are already going to make a couple tweaks to the algorithm this week, looking forward to this being stable ;)

    Jared Roesch at 2015-07-29 01:18:02

  12. @Gankro I'd like to work on this in whatever limited capacity I can, if you're still willing to mentor

    Panashe Fundira at 2016-08-22 03:54:49

  13. Looks like I unassigned you? oops

    Panashe Fundira at 2016-08-22 03:55:13

  14. @munyari Sadly I don't think progress can be made on this without adopting some strategy for type parameter defaulting.

    Eduard-Mihai Burtescu at 2016-08-22 03:57:14

  15. libcollections is gone, so I believe this can be closed. @Gankro if that's wrong please let me know!

    Steve Klabnik at 2018-12-10 19:05:19

  16. @steveklabnik all the collections now live in liballoc instead, except for HashMap which is still in std for the same reason it wasn't in libcollections. Please re-open.

    jethrogb at 2018-12-11 02:59:41

  17. Since libstd now uses hashbrown, is this still relevant/doable?

    Camille Gillot at 2020-04-18 21:50:46

  18. So I didn't explain this well, but historically the main blocker was the the default random hasher relied on thread-locals to allow HashMap instances on the same thread to share a seed (since generating it was expensive, and this was regarded as sufficient to guard against HashDOS).

    Thread-local storage is a feature "exclusive" to std, which liballoc (then libcollections) wasn't allowed to use.

    That said it looks like things may have changed a lot with how hashbrown does seeding. @Amanieu: is it now trivial for us to move HashMap to libcollections? Or does the std shim still use the thread-localy stuff?

    Aria Desires at 2020-04-22 18:35:13

  19. hashbrown itself is #[no_std] since it uses a non-HashDOS-safe hasher. The std shim is what adds the SipHash hasher which depends on randomness and TLS.

    Personally, I've just been using hashbrown directly in my no_std code.

    Amanieu d'Antras at 2020-04-22 18:40:09

  20. This is certainly still relevant and hashbrown hasn't made it more doable. The main reason why this hasn't already been done is the random state for the secure hasher. Thread local storage is only part of the story, the main issue is a that a cryptographically-secure randomness source is needed to initialize the state. This is not available in (the platform-independent) libcore/liballoc. This hasn't changed with hashbrown.

    Btw, the tasks in the issue description need updating from the pre-1.0 era.

    Yes, you can use hashbrown in no_std, but library crates that use that in their public APIs are not compatible with the std HashMap.

    jethrogb at 2020-04-24 09:49:12

  21. A blocker I saw is that hashbrown itself depends on liballoc. Therefore, making a hashbrown-powered HashMap in liballoc is impossible. A way out would be to push the libstd compatibility layer to hasbrown itself, and make libstd only do reexports and type aliases. I do not know how this interacts with stability annotations.

    Camille Gillot at 2020-04-24 10:14:11

  22. We can figure out how to handle the cyclic dependency. That's not a problem.

    jethrogb at 2020-04-24 10:23:14

  23. Any way I could help getting this moving ?

    GuillaumeDIDIER at 2020-07-10 09:41:51

  24. @GuillaumeDIDIER you can try to convince the powers-that-be that something like extern existensials need to be added to the language.

    jethrogb at 2020-07-10 09:48:21

  25. @rustbot claim

    Kisaragi at 2023-05-16 05:10:56

  26. an idea I have for moving HashMap to alloc: move RandomState to alloc too and make new just call an extern function something like #[panic_handler] where alloc has a default implementation based on a global atomic iif #[cfg(target_has_atomic = "ptr")], std overrides that default. programs without std can supply their own implementation, programs must supply their own implementation if #[cfg(not(target_has_atomic = "ptr"))]

    Jacob Lifshay at 2023-05-16 09:07:22

  27. That is a breaking change. You can currently use liballoc without providing said implementation.

    bjorn3 at 2023-05-16 12:34:06

  28. hmm, well, if alloc always supplies a default implementation it shouldn't be breaking. the #[cfg(not(target_has_atomic = "ptr"))] version could return a constant or the address of a variable on stack or panic.

    alternatively, maybe RandomState::new and all non-generic callers could be marked #[inline] and rustc would be made to only require the extern function to exist if RandomState::new is transitively used by the user's program.

    Jacob Lifshay at 2023-05-16 15:52:05

  29. Has any progress been made on this recently? I'd like to help out with this if I can, but want to make sure that I'm not repeating the same work someone's already doing.

    Clar Fon at 2023-08-26 05:56:38

  30. Has any progress been made on this recently? I'd like to help out with this if I can, but want to make sure that I'm not repeating the same work someone's already doing.

    No; at least, no progress has been made on my side.

    However, there are some key-pieces (on zulip thread):

    • would we make new HashMap on alloc?: No, we would want both to be same; otherwise it will be not a "solution" to no_std HashMap crates.
    • would we want to default hasher with other?: No, it would be a pitfall (the current implementation is defaulting with HashDOS-resistant one)

    So, it would be something like:

    1. Move definition and generic impls into alloc
    2. Make alias on std, which refers to alloc and defaults with current hasher
    3. Do something about impl HashMap<K, V, RandomState>
    4. Create PR and see what happens

    Kisaragi at 2023-08-26 23:50:23

  31. In terms of that third point, we already can have incoherent impls in the standard library specifically, so, it should be fine to keep the impl HashMap<K, V, RandomState> block in libstd while making all of the impls in alloc not reference RandomState at all.

    Clar Fon at 2023-08-27 03:13:35

  32. We can have incoherent inherent impls, not incoherent trait impls.

    bjorn3 at 2023-08-27 07:05:02

  33. @rustbot claim

    I'll try something soonish to see what I can do.

    Clar Fon at 2023-09-02 18:03:47

  34. So, immediate issue detected: right now, HashMap is implemented by directly including the hashbrown dependency, which itself relies on the alloc crate. This is problematic if we want to include it in the alloc crate itself, since that's a circular dependency.

    There are honestly two paths forward:

    1. We move the code for the hashbrown crate directly into the alloc crate. This seems like the most reasonable option, since the stated reason for this crate being offered separately at all is to work in no_std environments, and that's the goal of the move.
    2. We create a separate alloc_internals crate which contains just the allocator API and its supporting types, and offer the option of making the hashbrown crate depend on that. I'm not actually sure if this is less intrusive than just moving the code in.

    The main consequence of the first option is that this requires effectively moving all development effort for HashBrown into the main rust-lang repo, which would require moving over issues and PRs and changing the development process for that. This feels like the proper decision IMHO, but that actual decision should be left to the libs team.

    The second option does open the doors to move other collection implementations from liballoc into separate crates, but I'm not sure if this is actually something that would be desired.

    Either way, I think that this is a sufficient roadblock that would have to be solved by the libs team before work continues on this. I'm willing to help out after a decision is made, but since I personally cannot make that decision, I'll unclaim this until the libs team can have a proper discussion about it. (If any libs team members are watching, nominating this issue for discussion would be appreciated, but I can't actually compel any of you to do anything about it.)

    Clar Fon at 2023-09-10 01:15:39

  35. imho rather than making a alloc_internals crate, that stuff should just be in core since it's only the interface traits and functions, not actually a global allocator.

    Jacob Lifshay at 2023-09-10 05:42:09

  36. imho rather than making a alloc_internals crate, that stuff should just be in core since it's only the interface traits and functions, not actually a global allocator.

    turns out Allocator is already in core, what does hashbrown need that isn't in core?

    Jacob Lifshay at 2023-09-10 05:45:41

  37. I took a quick look at hashbrown and here's what it uses alloc for:

    • The default type parameter for the HashMap and HashSet allocators.
    • ToOwned for HashSet::get_or_insert_owned.
    • handle_alloc_error for allocation failures.

    This is sufficiently small that it's probably possible to make hashbrown only depend on core (with an optional dependency on alloc), though I'm not entirely sure how to deal with handle_alloc_error. Perhaps we could manually include of copy of the implementation directly in hashbrown? It should be fine to depend on unstable implementation details when building as part of the standard library.

    Amanieu d'Antras at 2023-09-10 11:33:28

  38. Is there any actual reason ToOwned can't be defined in core, setting aside the fact that the compiler currently doesn't allow incoherent trait impls?

    Clar Fon at 2023-09-10 16:09:25

  39. Is there any actual reason ToOwned can't be defined in core, setting aside the fact that the compiler currently doesn't allow incoherent trait impls?

    I don't think ToOwned belongs outside alloc. It is quite specific/biased to alloc types. For example, it considers alloc::string::String to be the (one and only) owned version of &str, even though there exist perfectly good alternatives in other crates (e.g. owned string types with small string optimization, or strings that store the length in the allocation, etc.).

    Mara Bos at 2023-09-21 08:00:45

  40. I would love it if we could find a lang way to allow splitting Clone into

    pub trait CloneFrom<Other: ?Sized> {
        fn clone_from(&mut self, other: &Other);
    }
    pub trait Clone : CloneFrom<Self> where Self: Sized {
        fn clone(&self) -> Self;
    }
    

    so that that CloneFrom can be in core for all the "update my arraystr from that str" cases, and thus reduce some of the ToOwned needs...

    scottmcm at 2023-09-21 18:40:24

  41. I don't think ToOwned belongs outside alloc. It is quite specific/biased to alloc types. For example, it considers alloc::string::String to be the (one and only) owned version of &str, even though there exist perfectly good alternatives in other crates (e.g. owned string types with small string optimization, or strings that store the length in the allocation, etc.).

    So, my reasoning for this is that while most of the relevant types for this do exist in the alloc crate, there's nothing that explicitly requires it other than the fact that it would be incoherent to defer implementations to the alloc crate.

    There currently isn't a supported way to do this besides (maybe) the proposed rust-lang/rfcs#3482, but it feels kind of arbitrary to require that something depend on allocations to itself use the trait. The way it effectively would work is a negative impl in core that is "replaced" with a positive impl in alloc, although again, this is absolutely nothing that is currently supported.

    (This example is probably way beyond the scope of this issue, although I figured I might as well explain what's going on in my head at the moment.)

    Clar Fon at 2023-09-21 19:36:13

  42. I would love it if we could find a lang way to allow splitting Clone into

    pub trait CloneFrom<Other: ?Sized> {
        fn clone_from(&mut self, other: &Other);
    }
    pub trait Clone : CloneFrom<Self> where Self: Sized {
        fn clone(&self) -> Self;
    }
    

    It would be great. This would also eliminate the need for A: Clone in CloneFrom for collections including Hashbrown.

    JustForFun88 at 2023-09-26 20:30:03

  43. https://github.com/rust-lang/rust/pull/116113 has a different approach for ?Sized Clone.

    David Tolnay at 2023-09-26 21:02:23

  44. Is this true that after this change std::collections::HashMap will be vulnerable to that DoS?

    Askar Safin at 2023-11-05 17:33:23

  45. The only difference is HashMap can be referred from #![no_std] environment. It will not change std::collections::HashMap's default hasher (RandomState, which is HashDoS-resistant)

    Kisaragi at 2023-11-05 17:41:47

  46. @KisaragiEffective, cool!

    So, std::collections::HashMap and alloc::collections::HashMap will behave slightly differently? This will be very hard-to-discover, so this should be properly documented on both rustdoc pages.

    Maybe we should call alloc::collections::HashMap differently to make sure nobody will assume it is DoS-resistant?

    Askar Safin at 2023-11-05 18:48:50

  47. As I understand alloc::collections::HashMap will require explicitly passing a hasher (eg alloc::collections::HashMap::<u32, String, MyHasher> is a valid type, but alloc::collections::HashMap::<u32, String> is not due to missing the hasher), while std::collections::HashMap will default to RandomState if you don't explicitly pass a hasher. And unless you depend on libstd, HashMap::new() and other methods which assume RandomState to be the used hasher will be unavailable.

    bjorn3 at 2023-11-05 19:30:22

  48. Poking back in here just to reprocess the current requirements on implementing this.

    My understanding is that right now, the issue is that the hashbrown crate internally uses liballoc for some of its methods in order to work, and thus can't be added as a dependency for the liballoc crate.

    One of the biggest concerns was the fact that it references ToOwned, which we don't really want to add into the core crate.

    Would it be possible to, perhaps, add ToOwned under a perma-unstable feature flag to core, alongside marking it for incoherent impls so it can have all its implementations in alloc? That way, hashbrown should be able to access it without us changing the public API.

    (Also, if anyone with permissions to do so would like to update the issue description, that would be appreciated, since right now it is very outdated.)

    Clar Fon at 2024-07-29 13:53:21

  49. I would love it if we could find a lang way to allow splitting Clone into

    pub trait CloneFrom<Other: ?Sized> {
        fn clone_from(&mut self, other: &Other);
    }
    pub trait Clone : CloneFrom<Self> where Self: Sized {
        fn clone(&self) -> Self;
    }
    

    so that that CloneFrom can be in core for all the "update my arraystr from that str" cases, and thus reduce some of the ToOwned needs...

    There's something similar in #126799

    GrigorenkoPV at 2024-07-30 21:41:45

  50. To make my former request even easier, here's a potential issue description you could copy-paste in to make it more up to date:

    Currently, `HashMap` and `HashSet` are only available in `std` because they depend on `RandomState`, even though this is only a *soft* dependency. These types could be moved to `alloc` as long as the user provides an implementation of `BuildHasher` themself.
    
    Right now, this is difficult because the implementation for `HashMap` and `HashSet` is in the external `hashbrown` crate, which depends on the `alloc` crate, and thus can't be used as a dependency for `alloc`.
    
    So, there are effectively two steps here:
    
    - [ ] Make `hashbrown` crate able to work without depending on the `alloc` crate.
    - [ ] Move code that doesn't depend on `RandomState` into `alloc` crate so we can have these types.
    - [ ] Convert the `std` types into aliases which default the `BuildHasher` type to `RandomState`, and implement the various methods for this default on the alias instead of its own type.
    
    ### Unresolved Questions
    
    * Should `hashbrown` still exist as a standalone crate once this move is complete? Its primary purpose at the moment is to have `no_std` support, so, we could potentially solve a lot of these problems by just moving its code directly into `alloc`. However, its `RawTable` API would have to be made available in these cases, and it's unclear how long that would take to stabilise.
    
    ### Implementation history
    
    *None currently.*
    
    <details><summary>Rendered version</summary>

    Currently, HashMap and HashSet are only available in std because they depend on RandomState, even though this is only a soft dependency. These types could be moved to alloc as long as the user provides an implementation of BuildHasher themself.

    Right now, this is difficult because the implementation for HashMap and HashSet is in the external hashbrown crate, which depends on the alloc crate, and thus can't be used as a dependency for alloc.

    So, there are effectively two steps here:

    • [ ] Make hashbrown crate able to work without depending on the alloc crate.
    • [ ] Move code that doesn't depend on RandomState into alloc crate so we can have these types.
    • [ ] Convert the std types into aliases which default the BuildHasher type to RandomState, and implement the various methods for this default on the alias instead of its own type.

    Unresolved Questions

    • Should hashbrown still exist as a standalone crate once this move is complete? Its primary purpose at the moment is to have no_std support, so, we could potentially solve a lot of these problems by just moving its code directly into alloc. However, its RawTable API would have to be made available in these cases, and it's unclear how long that would take to stabilise.

    Implementation history

    None currently.

    </details>

    Clar Fon at 2024-08-03 19:30:16

  51. Current progress/conclusion on figuring out how to make this work: I'm, very close to just copying the hashbrown code directly into liballoc. This would effectively solve all the current issues, but it would mean that we'd have to find another way to expose the RawTable APIs that crates currently depend on. So, here's my thought process:

    1. We make hashbrown just contain these raw table APIs which are currently unstable. Or at least, make its other stuff under a feature flag which does not require alloc. (Not requiring alloc can be a separate feature flag only used by the standard library, since it will require unstable core::alloc stuff.)
    2. We use those parts as the dependency in alloc, and then implement the proper HashMap and HashSet using those types.
    3. The raw table APIs can continue to evolve and iterate and eventually find their way into alloc, but this doesn't necessarily block alloc from incorporating the parts that we consider stable. (the HashMap/HashSet APIs themselves)

    This feels like the best conclusion since it avoids what I assume are some of the pitfalls to just moving everything into alloc:

    • People still want the RawTable APIs, but they're not stable enough for the standard library to use.
    • People currently depending on the no_std versions can still use them stably while the alloc version is unstable.

    This feels like the right approach to me, but please feel free to provide your feedback/concerns before I go all in on it, since it'll probably be a bit.

    Clar Fon at 2024-08-03 19:57:16

  52. Correct me if I'm wrong, but another reason to use the hashbrown crate right now is the Equivalent trait, which allows you to use some types for lookup that you can't with the standard library API. What will happen to that trait and the related methods under the plan you outlined?

    Juhan Oskar Hennoste at 2024-08-03 20:29:10

  53. Could we handle hashbrown the same way as backtrace-rs? Keep it available as external crate, but at the same time use #[path = "..."] mod foo; to include it in the standard library.

    bjorn3 at 2024-08-03 20:38:00

  54. Could we handle hashbrown the same way as backtrace-rs? Keep it available as external crate, but at the same time use #[path = "..."] mod foo; to include it in the standard library.

    I had no idea we did this in the standard library, or that it works as you'd expect. That feels promising, actually…

    Correct me if I'm wrong, but another reason to use the hashbrown crate right now is the Equivalent trait, which allows you to use some types for lookup that you can't with the standard library API. What will happen to that trait and the related methods under the plan you outlined?

    I think that the best path forward would be to try and get this incorporated into the standard library, rather than rely on it existing in hashbrown. While the raw table API is relatively complicated and thus won't be easy to stabilise, this isn't.


    EDIT: Right now, hashbrown just refers to the equivalent crate, which currently has 8 dependents on crates.io. If people are using this trait and have compelling examples, I would recommend they file the ACP for that on the standard library, likely moving the trait to core::cmp::Equivalent. I was thinking of just writing up the ACP for this but genuinely cannot justify it myself, since as far as the publicly available data is concerned, nobody has ever had this problem.

    EDIT 2: There appears to be an issue open on the equivalent crate repo for moving this to libstd, so, I would encourage you put your discussion there first: indexmap-rs/equivalent#3

    Clar Fon at 2024-08-03 21:15:56