Reference to a #[thread_local] is dereferenced too early

ce3740f
Opened by Joshua Liebow-Feeser at 2020-04-20 20:23:42

I'm working on the dynamic library for elfmalloc for Mac, and one of the caveats that we have to work with is that, during dynamic library loading, TLS is unavailable (if you try to access a #[thread_local], the program simply aborts).

In order to work around this, we have a mechanism which checks to make sure that dynamic loading is finished before we access any thread-local storage (see the with method).

Recently, we encountered a strange bug: We were spuriously accessing thread-local values while dynamic loading was still happening. The code in question looks like this:

pub unsafe fn alloc(size: usize) -> *mut u8 {
    LOCAL_ELF_HEAP.with(|h| (*h.get()).alloc.alloc(size))
        .unwrap_or_else(|| super::large_alloc::alloc(size))
}

LOCAL_ELF_HEAP is a #[thread_local] variable, and in the with method, we make sure that dynamic loading is done before we dereference &self. To make a long story short, it appears as though the generated code has reordered the instructions to dereference LOCAL_ELF_HEAP before we check to make sure that dynamic loading is done. You can see that in the disassembly of the alloc method (output from lldb):

    0x10f1b7670 <+0>:  pushq  %rbp
    0x10f1b7671 <+1>:  movq   %rsp, %rbp
    0x10f1b7674 <+4>:  subq   $0x40, %rsp
    0x10f1b7678 <+8>:  leaq   -0x20(%rbp), %rax
    0x10f1b767c <+12>: leaq   -0x28(%rbp), %rcx
    0x10f1b7680 <+16>: movq   %rdi, -0x28(%rbp)
    0x10f1b7684 <+20>: movq   %rcx, -0x10(%rbp)
    0x10f1b7688 <+24>: movq   -0x10(%rbp), %rdx
    0x10f1b768c <+28>: leaq   0xaf5f5(%rip), %rdi       ; elfmalloc::general::global::LOCAL_ELF_HEAP::h7d8e035078da16c9
    0x10f1b7693 <+35>: movq   %rax, -0x30(%rbp)
    0x10f1b7697 <+39>: movq   %rdx, -0x38(%rbp)
    0x10f1b769b <+43>: callq  *(%rdi)
->  0x10f1b769d <+45>: leaq   -0x20(%rbp), %rdi
    0x10f1b76a1 <+49>: movq   %rax, %rsi
    0x10f1b76a4 <+52>: movq   -0x38(%rbp), %rdx
    0x10f1b76a8 <+56>: callq  0x10f1e2ac0               ; _$LT$alloc_tls..TLSSlot$LT$T$GT$$GT$::with::h2a59e6dd7499a672 at lib.rs:209
    0x10f1b76ad <+61>: leaq   -0x20(%rbp), %rdi
    0x10f1b76b1 <+65>: leaq   -0x28(%rbp), %rax
    0x10f1b76b5 <+69>: movq   %rax, -0x8(%rbp)
    0x10f1b76b9 <+73>: movq   -0x8(%rbp), %rsi
    0x10f1b76bd <+77>: callq  0x10f1eb9a0               ; _$LT$core..option..Option$LT$T$GT$$GT$::unwrap_or_else::h09d8edc34df58ca5 at option.rs:373
    0x10f1b76c2 <+82>: movq   %rax, -0x40(%rbp)
    0x10f1b76c6 <+86>: movq   -0x40(%rbp), %rax
    0x10f1b76ca <+90>: addq   $0x40, %rsp
    0x10f1b76ce <+94>: popq   %rbp
    0x10f1b76cf <+95>: retq

The -> points to where the next instruction to be executed at the time that the program aborted. The callq is to libdyld.dylib's _tlv_bootstrap, which is simply implemented as abort();.

The key thing to note is that the call to _tlv_boostrap (which attempts - and fails - to access the #[thread_local]) is before the call to with a couple of instructions later. I'm not positive, but my hypothesis is that the compiler reordered the accesses because immutable references are supposed to guarantee that the referent exists, and so it thought that the reordering was valid.

  1. I think compiler_fence may be a way to enforce the ordering you want?

    Steven Fackler at 2018-03-08 05:59:38

  2. Perhaps; I'll see if that fixes it. However, I feel like this should be something that the compiler fixes, as it seems like a soundness issue to me (namely, it lets you create an immutable reference to something that doesn't exist).

    Joshua Liebow-Feeser at 2018-03-08 06:07:09

  3. You can also produce immutable references to things that don't exist if you munmap chunks of your address space. The language makes some assumptions about the state of the world that it expects. One of those assumptions is that you are not running while the dynamic loader is stitching the world together :)

    Steven Fackler at 2018-03-08 06:11:41

  4. Unfortunately it looks like compiler_fence doesn't fix it. I'm not really sure why.

    You can also produce immutable references to things that don't exist if you munmap chunks of your address space. The language makes some assumptions about the state of the world that it expects. One of those assumptions is that you are not running while the dynamic loader is stitching the world together :)

    But in order to do that, you need to call munmap, which is unsafe. My point is that you can create a #[thread_local] in safe code, which means that you can violate Rust's memory safety guarantees in safe Rust.

    Joshua Liebow-Feeser at 2018-03-09 08:39:51

  5. What do you propose the solution to this is at the language level, then? Should we make #[thread_local] access unsafe?

    Steven Fackler at 2018-03-09 17:19:53

  6. But in order to do that, you need to call munmap, which is unsafe.

    You don't need to do that, something that can affect your process needs to do that.

    Steven Fackler at 2018-03-09 17:42:13

  7. What do you propose the solution to this is at the language level, then? Should we make #[thread_local] access unsafe?

    I'm not sure whether this is simple or not, but I think the ideal solution would simply be to somehow disallow that optimization from taking place when it comes to accessing thread locals on Mac. I agree that technically my argument implies that accessing #[thread_local]s should be unsafe, but I also understand that that's pretty draconian given how niche of a problem this is.

    But in order to do that, you need to call munmap, which is unsafe.

    You don't need to do that, something that can affect your process needs to do that.

    I have a philosophical disagreement with this point, but seeing as I've already ceded that it's fine to have #[thread_local] access be safe even though it technically shouldn't be, I'm not sure us agreeing on this is important for this discussion.

    Joshua Liebow-Feeser at 2018-03-09 18:55:38