Bad codegen: unnecessary on-stack copy in WebRender 2

a7065b5
Opened by Patrick Walton at 2019-09-16 13:42:35

Here's an unncessary on-stack copy in WebRender 2 with an inefficient mov chain.

Original code: https://github.com/servo/webrender/blob/master/src/tiling.rs#L111

fn add_layer_to_ubo(layer_ubos: &mut Vec<Vec<PackedLayer>>,
                    layer_to_ubo_map: &mut Vec<Option<usize>>,
                    layer_index: StackingContextIndex,
                    ctx: &RenderTargetContext) -> (usize, u32) {
    let index_in_ubo = match layer_to_ubo_map[layer_index.0] {
        Some(index_in_ubo) => {
            index_in_ubo
        }
        None => {
            let need_new_ubo = layer_ubos.is_empty() ||
                               layer_ubos.last().unwrap().len() == ctx.alpha_batch_max_layers;

            if need_new_ubo {
                for i in 0..layer_to_ubo_map.len() {
                    layer_to_ubo_map[i] = None;
                }
                layer_ubos.push(Vec::new());
            }

            let layer_ubo = layer_ubos.last_mut().unwrap();
            let index = layer_ubo.len();
            let sc = &ctx.layer_store[layer_index.0];
            layer_ubo.push(PackedLayer {
                transform: sc.transform,
                inv_transform: sc.transform.invert(),
                screen_vertices: sc.xf_rect.as_ref().unwrap().vertices,
                world_clip_rect: sc.world_clip_rect.unwrap(),
            });
            layer_to_ubo_map[layer_index.0] = Some(index);
            index
        }
    };

    (layer_ubos.len() - 1, index_in_ubo as u32)
}

out-reduced.ll: https://gist.github.com/pcwalton/3dbfe14ea18a70bd96f7c74e27696663 out-reduced.s: https://gist.github.com/pcwalton/2ca2e9755eb75f7e61ccd252ff87ec9d

  1. cc @eddyb

    Patrick Walton at 2016-08-08 22:50:13

  2. This is interesting, the longest BB in ASM comes from this small sequence of llvm.memcpy calls.

    EDIT: all local variables are aligned to 8, and the llvm.memcpy calls use an alignment of 8.

    Eduard-Mihai Burtescu at 2016-08-09 01:24:46

  3. I got clang to emit similar IR, but it lowers into XMM moves (done in a fashion that could potentially load a whole cache-line in parallel).

    EDIT: similar code in Rust produces similarly good assembly, so there's something more going on here.

    EDIT2: involving a local variable produces a 8-aligned alloca but it still gets XMM moves.

    Eduard-Mihai Burtescu at 2016-08-09 01:39:36

  4. Does it have to do with the floats? Shot in the dark here.

    Patrick Walton at 2016-08-09 08:33:52

  5. @pcwalton No difference after changing u64 to f64.

    Eduard-Mihai Burtescu at 2016-08-09 13:13:16

  6. I believe there are three useless moves here: [rbp-168] -> [rbp-480] -> [rbp-608] -> [rdx+rax] (the last being into the vector). All happen after the call to invert.

    Patrick Walton at 2016-08-09 18:40:52

  7. In the LLVM IR, the 3 llvm.memcpy chain is: %46 -> %47 -> %value.i.sroa.8.0..sroa_cast30 -> %value.i.sroa.8.0..sroa_cast.

    Eduard-Mihai Burtescu at 2016-08-09 18:42:42

  8. Triage: is there a way to reproduce this today, easily? Is this issue still tracking something useful?

    Steve Klabnik at 2019-09-16 13:23:14