Make LLVM better at using XMM registers to perform structure moves
Here's a small snippet of Servo code from block flow fragmentation:
servo[0x100bf974e] <+6638>: mov r10b, byte ptr [rbp - 0x729]
servo[0x100bf9755] <+6645>: mov r9, qword ptr [rbp - 0x728]
servo[0x100bf975c] <+6652>: mov rax, qword ptr [rbp - 0x720]
servo[0x100bf9763] <+6659>: mov rdi, qword ptr [rbp - 0x718]
servo[0x100bf976a] <+6666>: mov r8, qword ptr [rbp - 0x6c8]
servo[0x100bf9771] <+6673>: mov rdx, qword ptr [rbp - 0x710]
servo[0x100bf9778] <+6680>: mov qword ptr [rbp - 0x710], rdx
servo[0x100bf977f] <+6687>: mov qword ptr [r13 + 0x78], r8
servo[0x100bf9783] <+6691>: mov qword ptr [r13 + 0x80], rdi
servo[0x100bf978a] <+6698>: mov qword ptr [r13 + 0x88], rax
servo[0x100bf9791] <+6705>: mov qword ptr [r13 + 0x90], r9
servo[0x100bf9798] <+6712>: mov byte ptr [r13 + 0x98], r10b
servo[0x100bf979f] <+6719>: mov al, byte ptr [rbp - 0x34a]
servo[0x100bf97a5] <+6725>: mov byte ptr [r13 + 0x9f], al
servo[0x100bf97ac] <+6732>: mov ax, word ptr [rbp - 0x34c]
servo[0x100bf97b3] <+6739>: mov word ptr [r13 + 0x9d], ax
servo[0x100bf97bb] <+6747>: mov eax, dword ptr [rbp - 0x350]
servo[0x100bf97c1] <+6753>: mov dword ptr [r13 + 0x99], eax
servo[0x100bf97c8] <+6760>: mov eax, dword ptr [r13 + 0x15c]
servo[0x100bf97cf] <+6767>: test ah, 0x6
servo[0x100bf97d2] <+6770>: je 0x100bf9949 ; <+7145>
I see this all over the place. It should be using XMM registers instead. This is bad because: (a) it clogs up the instruction stream; (b) it's an inefficient way to perform structure moves; (c) it kills tons of registers, resulting in spills elsewhere (notice rax, rdx, rdi, r8, r9, and r10 are all dead for no good reason); (d) it puts pressure on the register allocator, making compile times worse.
Is there some way to get LLVM to emit the right thing here?
cc @eddyb @brson
Tagging this as help wanted because it's a significant (and kinda obvious) optimization. Unfortunately we don't know how to make LLVM do it. Will require some digging.
Brian Anderson at 2016-07-28 17:30:23
@brson It might be impossible in the general case if LLVM can't get some guarantees about bytes that are never touched by the individual copies (i.e. the padding bytes). Locally, not reading those bytes can inform the optimizer that they are irrelevant, but if pointers to the values are used anywhere else, they have to be marked with the "padding holes" somehow.
Eduard-Mihai Burtescu at 2016-07-28 17:36:24
It may also be undesirable in the general case for security reasons.
Brian Anderson at 2016-07-28 17:38:37
Here's the code: https://github.com/servo/servo/blob/master/components/layout/block.rs#L1792
Patrick Walton at 2016-07-28 17:40:48
LLVM IR: https://www.dropbox.com/s/f00k5a6w660s9eh/layout-7b1ca0dc53bfeae5.ll.xz?dl=0
Patrick Walton at 2016-07-28 18:54:16
servo[0x100bf9771] <+6673>: mov rdx, qword ptr [rbp - 0x710] servo[0x100bf9778] <+6680>: mov qword ptr [rbp - 0x710], rdxAre you sure this is a release build? That looks extremely suspicious.
eefriedman at 2016-07-28 20:26:05
Looking at the IR for
<layout::block::BlockFlow as layout::flow::Flow>::fragment:- 718
alloca- 42
core::raw::TraitObject(you'd expect these to vanish)
- 42
- 649
store - 917
load
Honestly, I have no idea where exactly those assembly instructions come from. It might be possible to use debug info to correlate to the IR and see what exactly is causing them, but at this point, it just looks like a very large number of temporaries and loads/stores between them.
This part is a potential clue, but I can't find the
6constant in the function's IR:servo[0x100bf97c8] <+6760>: mov eax, dword ptr [r13 + 0x15c] servo[0x100bf97cf] <+6767>: test ah, 0x6 servo[0x100bf97d2] <+6770>: je 0x100bf9949 ; <+7145>Eduard-Mihai Burtescu at 2016-07-28 20:36:17
- 718
@pcwalton I convinced myself this is a release + debuginfo build, but that's not actually true, is it?
This explains where all the code comes from (
#[inline(always)]): https://github.com/servo/servo/blob/dfc007e10e8f0815966682e768685f14e55164c2/components/layout/block.rs#L772.Eduard-Mihai Burtescu at 2016-07-28 22:55:39
Is this before or after optimization?
mem2regshould have optimized this out (and should probably be the first pass run).Demi Marie Obenour at 2016-07-29 00:16:09
@DemiMarie @eefriedman Turns out the IR is from a debug build (my fault, forgot to add
--releasein the command I gave @pcwalton), but the assembly is from a release build.Eduard-Mihai Burtescu at 2016-07-29 00:18:29
What are the current options to teach LLVM about our padding holes? See also #17027 and #6736.
Cc @rust-lang/wg-codegen
Anthony Ramine at 2018-04-02 11:42:58
!tbaa.struct is a way to describe padding holes when doing
llvm.memcpyetc.Dan Gohman at 2018-04-02 15:53:55
Triage: not aware of any changes here. This is a pretty abstract bug...
Steve Klabnik at 2019-05-20 12:47:16