Bad codegen: unnecessary on-stack copy in WebRender 2
Here's an unncessary on-stack copy in WebRender 2 with an inefficient mov chain.
Original code: https://github.com/servo/webrender/blob/master/src/tiling.rs#L111
fn add_layer_to_ubo(layer_ubos: &mut Vec<Vec<PackedLayer>>,
layer_to_ubo_map: &mut Vec<Option<usize>>,
layer_index: StackingContextIndex,
ctx: &RenderTargetContext) -> (usize, u32) {
let index_in_ubo = match layer_to_ubo_map[layer_index.0] {
Some(index_in_ubo) => {
index_in_ubo
}
None => {
let need_new_ubo = layer_ubos.is_empty() ||
layer_ubos.last().unwrap().len() == ctx.alpha_batch_max_layers;
if need_new_ubo {
for i in 0..layer_to_ubo_map.len() {
layer_to_ubo_map[i] = None;
}
layer_ubos.push(Vec::new());
}
let layer_ubo = layer_ubos.last_mut().unwrap();
let index = layer_ubo.len();
let sc = &ctx.layer_store[layer_index.0];
layer_ubo.push(PackedLayer {
transform: sc.transform,
inv_transform: sc.transform.invert(),
screen_vertices: sc.xf_rect.as_ref().unwrap().vertices,
world_clip_rect: sc.world_clip_rect.unwrap(),
});
layer_to_ubo_map[layer_index.0] = Some(index);
index
}
};
(layer_ubos.len() - 1, index_in_ubo as u32)
}
out-reduced.ll: https://gist.github.com/pcwalton/3dbfe14ea18a70bd96f7c74e27696663
out-reduced.s: https://gist.github.com/pcwalton/2ca2e9755eb75f7e61ccd252ff87ec9d
cc @eddyb
Patrick Walton at 2016-08-08 22:50:13
This is interesting, the longest BB in ASM comes from this small sequence of
llvm.memcpycalls.EDIT: all local variables are aligned to 8, and the
llvm.memcpycalls use an alignment of 8.Eduard-Mihai Burtescu at 2016-08-09 01:24:46
I got clang to emit similar IR, but it lowers into XMM moves (done in a fashion that could potentially load a whole cache-line in parallel).
EDIT: similar code in Rust produces similarly good assembly, so there's something more going on here.
EDIT2: involving a local variable produces a 8-aligned alloca but it still gets XMM moves.
Eduard-Mihai Burtescu at 2016-08-09 01:39:36
Does it have to do with the floats? Shot in the dark here.
Patrick Walton at 2016-08-09 08:33:52
@pcwalton No difference after changing
u64tof64.Eduard-Mihai Burtescu at 2016-08-09 13:13:16
I believe there are three useless moves here:
[rbp-168] -> [rbp-480] -> [rbp-608] -> [rdx+rax](the last being into the vector). All happen after the call toinvert.Patrick Walton at 2016-08-09 18:40:52
In the LLVM IR, the 3
llvm.memcpychain is:%46->%47->%value.i.sroa.8.0..sroa_cast30->%value.i.sroa.8.0..sroa_cast.Eduard-Mihai Burtescu at 2016-08-09 18:42:42
Triage: is there a way to reproduce this today, easily? Is this issue still tracking something useful?
Steve Klabnik at 2019-09-16 13:23:14