BTreeSet and BTreeMap append optimization

88d6dee
Opened by Andrei Tonkikh at 2023-11-17 21:37:40

When the largest key of one tree is less than the smallest key of another tree, there is a way to merge trees using only O(log n) operations. Maybe logarithmic merge should be a separate method. I'm ready to work on it.

UPD: Also sometimes it may be better to push elements of one tree to another one with O(min_length log(max_length)) operations instead of rebuild with O(n + m) operations (like in #32526).

But the problem is that it's hard to determine which approach is better in a particular case.

  1. CC @gereeter

    Andrei Tonkikh at 2016-07-07 06:59:06

  2. Am I missing something? I don't see append() on BTreeSet

    Alfie John at 2016-11-25 11:17:36

  3. @alfiedotwtf what's wrong with this one? https://doc.rust-lang.org/std/collections/struct.BTreeSet.html#method.append

    Andrei Tonkikh at 2016-11-25 11:38:11

  4. Gah! I don't know how I managed to land on old documentation:

    https://doc.rust-lang.org/1.6.0/std/collections/struct.BTreeSet.html

    Thanks @xosmig :)

    Alfie John at 2016-11-25 11:43:32

  5. I would love to see this fixed! I have recently developed a data structure [1], which is supposed to be faster than BTreeSet for a specific usage pattern, but I can't write a fair benchmark until BTreeSet supports a O(k + log n) delete-range operation.

    [1] https://users.rust-lang.org/t/announcing-teardowntree/8259

    Kirill Khazan at 2016-12-05 11:20:09

  6. I think perhaps the current append implementation is a little problematic. Arguably, one would expect a "batch" operation like set.append(other_set) to be generally the better choice when you have to insert one set into the other, because presumably it is able to take into consideration certain optimizations that can not be made when inserting one item at a time. However, the complexity O(n + m) means that if, say, n is large and m is small, the cost of calling append is enormous.

    I ran into this problem today, where for a relatively large computational problem, in which I called append roughly 500k times on with n ~ 100k and m ~ 10, my program would not finish for half an hour. After changing to insert in a loop, the program finished in 2 seconds.

    There are two problems here, as I see it:

    • append sounds like a straightforward batch operation, so at least I personally wouldn't expect it to have such a severe drawback compared to repeated insertion for unbalanced set sizes.
    • This drawback is not communicated by the docs for append itself.

    Of course, ideally append would work well in all cases, but deciding when to pick each strategy is surely not a simple problem...

    Andreas Borgen Longva at 2020-08-18 19:03:34

  7. cc @ssomers (due to their many improvements to BTreeMap/BTreeSet in the last couple of years)

    memoryruins at 2020-08-18 21:36:09

  8. Rest assured, I was subscribed... It's not hard to imrpove this in many cases, as Andlon proved, but hard to prove it doesn't become worse in some other situations. A lot of benchmarking to do.

    Stein Somers at 2020-08-18 21:56:47

  9. I was bitten by this in the context of purposefully trying to merge a smaller BTreeSet into a larger one[^1]. Neither the function name nor the documentation helped to diagnose the issue. I think that improving the documentation for append would already help significantly.

    [^1]: "Small to Large" (😅) optimization: https://soi.ch/wiki/smaller-to-larger/

    Matheus Cardoso at 2023-11-17 21:37:40