missed optimization: enum move of the active variant
This code performs a memcpy of 2064 bytes when 32 bytes would suffice. See it live @godbolt:
#![feature(test)]
#![feature(rustc_private)]
extern crate smallvec;
extern crate test;
use smallvec::SmallVec;
#[inline(never)]
fn clobber<T>(x: T) { test::black_box(x); }
pub fn bar() {
// Capacity for 256 `f64`s on the stack.
// The stack size of `SmallVec` is 256 * 8 + 8 (len) + 8 (discriminant) = 2064 bytes
let mut v = SmallVec::<[f64; 256]>::new();
let size = ::std::mem::size_of::<SmallVec<[f64; 256]>>();
clobber(size);
for _ in 0..300 {
v.push(3.14);
}
// The vector reallocates to the heap, the size of the active variant
// is 8 (len) + 8 (capacity) + 8 (ptr) + 8 (discriminant) = 32 bytes
clobber(v); // memcpy's 2064 bytes instead of 32bytes...
}
generates this assembly for the second call to clobber:
mov edx, 2064
mov rdi, rbx
call memcpy@PLT
mov rdi, rbx
call example::clobber
The layout of SmallVec is:
pub struct SmallVec<A: Array> {
len: usize,
data: SmallVecData<A>,
}
enum SmallVecData<A: Array> {
Inline { array: A },
Heap { ptr: *mut A::Item, capacity: usize },
}
Note: If
NonZerowould be stable,SmallVecDatacould use it here to remove the 8 bytes of the discriminant.
So I guess that the problem is that moving an enum just naively generates a memcpy of the whole enum. I think this is a good default for small enums. When enums get large, they might contain many small variants, and maybe some large ones.
I don't know if there is an optimal way to solve this problem since adding branches to memcpy only some variants might incur a performance cost (unless LLVM knows the value of the active variants and can optimize the branches away).
We should probably at least always memcpy small enums up to some size (16bytes? 256 bytes?) and for larger enums have some heuristic like:
- are all variants approximately equally sized? Then just
memcpythe whole enum - are some variants small and some variants large? How many "size classes" are there? Generate as few branches as possible here (ideally two, one for small and one for large variants, but might require more).
One could sort the variants by size, so you can just check whether the discriminant is in the large or small variant range and branch on that. You might even go branchless by computing the memcpy length from the variant as a formula instead of a lookup
Oli Scherer at 2018-01-27 11:48:06
You might even go branchless by computing the memcpy length from the variant as a formula instead of a lookup
Since we know the size of the enum at compile-time, we can just do full memcpys for enums whose maximum size is smaller than X, and use the formula otherwise.
gnzlbg at 2018-01-28 10:35:06