dlclose() does not behave properly on Mac

0dd5bef
Opened by Damien Radtke at 2023-01-16 19:27:32

This report will reference this repository which reproduces the issue: https://github.com/dradtke/rust-dylib-issues

The Issue

The repository contains an application library, built as a dylib, and two example main programs, one in Rust and one in C. Each main application runs in a loop, loading the library with dlopen(), calling a method, and then closing with dlclose(). The expectation is that any changes to the library will be picked up immediately by the main application when it is recompiled.

However, the behavior between the two programs differs. If I run the two main programs side-by-side, then make a change to the returned message and recompile the library, only the C program immediately reflects the change. The Rust main program won't reflect any changes until it is fully restarted.

It appears that this is Mac-specific behavior. When the same test is run on Debian, the two main programs behave identically.

The Environment

Operating System: macOS Sierra 10.12.6 Rust Version:

rustc 1.23.0 (766bd11c8 2018-01-01)
binary: rustc
commit-hash: 766bd11c8a3c019ca53febdcd77b2215379dd67d
commit-date: 2018-01-01
host: x86_64-apple-darwin
release: 1.23.0
LLVM version: 4.0

C Compiler:

Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
  1. Your main.rs includes extern crate app; -- it may be that the linker on Linux is trimming the unused dependency but macOS is keeping it linked. If the library is loaded at startup, then dlopen/dlclose will just be bumping the reference count up and down.

    Josh Stone at 2018-02-02 21:41:54

  2. Ah, that's a good call. Unfortunately, it looks like removing extern crate app; causes it to segfault, which also doesn't happen on Linux.

    Damien Radtke at 2018-02-02 22:03:32

  3. Can you capture any information about the segfault? Perhaps a debugger backtrace?

    Josh Stone at 2018-02-03 01:30:03

  4. Likely a duplicate of https://github.com/rust-lang/rust/issues/28794.

    Simonas Kazlauskas at 2018-02-03 16:55:47

  5. A quick look at your Rust code reveals it invoking undefined behaviour. You use CString to null-terminate your literals, however CString::new(&symbol[..]).unwrap().into_raw() will immediately free the buffer CString allocates so the C code reads an invalid pointer.

    This could also be a cause for different behaviour.

    Simonas Kazlauskas at 2018-02-03 16:58:27

  6. Here's what the debugger says when I run it:

    Process 12004 launched: '/Users/dradtke/Workspace/rust/dylib/main/target/debug/main' (x86_64)
    Message: hello there world
    Process 12004 stopped
    * thread #1: tid = 0x78d98d, 0x00007fff9932dca9 libsystem_platform.dylib`OSSpinLockLock + 7, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
        frame #0: 0x00007fff9932dca9 libsystem_platform.dylib`OSSpinLockLock + 7
    libsystem_platform.dylib`OSSpinLockLock:
    ->  0x7fff9932dca9 <+7>:  lock
        0x7fff9932dcaa <+8>:  cmpxchgl %ecx, (%rdi)
        0x7fff9932dcad <+11>: jne    0x7fff9932dcb0            ; <+14>
        0x7fff9932dcaf <+13>: retq
    

    And the full backtrace:

    warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
    * thread #1: tid = 0x78d98d, 0x00007fff9932dca9 libsystem_platform.dylib`OSSpinLockLock + 7, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
      * frame #0: 0x00007fff9932dca9 libsystem_platform.dylib`OSSpinLockLock + 7
        frame #1: 0x00000001000250c6 main`je_arena_dalloc_large [inlined] je_malloc_mutex_lock + 38 at mutex.h:99 [opt]
        frame #2: 0x00000001000250ba main`je_arena_dalloc_large(tsdn=0x000000010060d008, arena=0x4d746c7561666544, chunk=0x0000000100200000, ptr=0x00000001003002c0) + 26 at arena.c:3075 [opt]
        frame #3: 0x0000000100026625 main`je_arena_ralloc [inlined] je_arena_sdalloc(slow_path=true) + 12 at arena.h:1516 [opt]
        frame #4: 0x0000000100026619 main`je_arena_ralloc [inlined] je_isdalloct(slow_path=true) + 164 at jemalloc_internal.h:1195 [opt]
        frame #5: 0x0000000100026575 main`je_arena_ralloc [inlined] je_isqalloc(slow_path=true) at jemalloc_internal.h:1205 [opt]
        frame #6: 0x0000000100026575 main`je_arena_ralloc(tsd=0x000000010060d008, arena=0x0000000000000000, ptr=<unavailable>, oldsize=<unavailable>, size=<unavailable>, alignment=<unavailable>, zero=<unavailable>, tcache=<unavailable>) + 2037 at arena.c:3376 [opt]
        frame #7: 0x000000010001cc79 main`je_rallocx [inlined] je_iralloct(ptr=<unavailable>, oldsize=<unavailable>, alignment=0, tcache=<unavailable>, arena=0x0000000000000000) + 263 at jemalloc_internal.h:1259 [opt]
        frame #8: 0x000000010001cb72 main`je_rallocx(ptr=0x00000001003002c0, size=33, flags=<unavailable>) + 674 at jemalloc.c:2414 [opt]
        frame #9: 0x0000000100019ee1 main`alloc_jemalloc::contents::__rde_realloc + 81 at lib.rs:170 [opt]
        frame #10: 0x0000000100002d68 main`alloc::vec::{{impl}}::reserve_exact<u8> [inlined] alloc::heap::{{impl}}::realloc + 19 at heap.rs:127 [opt]
        frame #11: 0x0000000100002d55 main`alloc::vec::{{impl}}::reserve_exact<u8> [inlined] alloc::raw_vec::{{impl}}::reserve_exact<u8,alloc::heap::Heap> + 28 at raw_vec.rs:429 [opt]
        frame #12: 0x0000000100002d39 main`alloc::vec::{{impl}}::reserve_exact<u8> + 25 at vec.rs:486 [opt]
        frame #13: 0x0000000100006bee main`std::ffi::c_str::{{impl}}::from_vec_unchecked + 30 at c_str.rs:360 [opt]
        frame #14: 0x0000000100006ba2 main`std::ffi::c_str::{{impl}}::_new + 114 at c_str.rs:335 [opt]
        frame #15: 0x00000001000020fc main`std::ffi::c_str::{{impl}}::new<&str>(t=(data_ptr = "../app/target/debug/libapp.dylibget_message", length = 32)) + 60 at c_str.rs:329
        frame #16: 0x0000000100002246 main`main::main + 102 at main.rs:19
        frame #17: 0x000000010003fc0f main`panic_unwind::__rust_maybe_catch_panic + 31 at lib.rs:101 [opt]
        frame #18: 0x000000010000fab9 main`std::rt::lang_start [inlined] std::panicking::try<(),closure> + 51 at panicking.rs:459 [opt]
        frame #19: 0x000000010000fa86 main`std::rt::lang_start [inlined] std::panic::catch_unwind<closure,()> at panic.rs:365 [opt]
        frame #20: 0x000000010000fa86 main`std::rt::lang_start + 422 at rt.rs:58 [opt]
        frame #21: 0x0000000100002705 main`main + 37
        frame #22: 0x00007fff9911f235 libdyld.dylib`start + 1
        frame #23: 0x00007fff9911f235 libdyld.dylib`start + 1
    

    Damien Radtke at 2018-02-05 16:02:22

  7. @nagisa

    however CString::new(&symbol[..]).unwrap().into_raw() will immediately free the buffer

    That's not true -- CString::into_raw() relinquishes ownership, and that will just leak unless you pass the memory back to CString::from_raw() later.

    But that does highlight to me that the other from_raw() calls are problematic. Especially CString::from_raw(dlerror()), as dlerror()'s return value is not meant to be freed by the caller. That should probably be CStr::from_ptr() instead.

    The other CString::from_raw(func()) might be OK, when you're absolutely sure that func() is returning memory that came from CString::into_raw(). Plus, those CStrings need to be using the same allocator, which is what I suspect broke after removing extern crate app, since the crash is in jemalloc.

    Generally speaking, allocating in one domain and freeing in another is fraught with danger.

    Josh Stone at 2018-02-05 23:05:44

  8. I have a somewhat similar problem with my library, so I tried the repository above. My Rust version is the same, but I'm on High Sierra 10.13.3.

    I ran it with DYLD_PRINT_APIS=1 to see dyld log.

    It (reloading) actually worked correctly.

    dlopen(../app/target/debug/libapp.dylib, 0x00000002)
    dyld_image_path_containing_address(0x1019ce000)
      dlopen(../app/target/debug/libapp.dylib) ==> 0x10261b000
    dlsym(0x10261b000, get_message)
      dlsym(0x10261b000, get_message) ==> 0x1019cf620
    Message: hello world
    dlclose(0x10261b000)
    dlclose(), found unused image 0x10261b000 libapp.dylib
    dlclose(), deleting 0x10261b000 libapp.dylib
    
    dlopen(../app/target/debug/libapp.dylib, 0x00000002)
    dyld_image_path_containing_address(0x1019ce000)
      dlopen(../app/target/debug/libapp.dylib) ==> 0x10261b000
    dlsym(0x10261b000, get_message)
      dlsym(0x10261b000, get_message) ==> 0x1019cf620
    Message: hello world
    dlclose(0x10261b000)
    dlclose(), found unused image 0x10261b000 libapp.dylib
    dlclose(), deleting 0x10261b000 libapp.dylib
    

    When I changed it to crate-type = ["cdylib"], dlclose no longer unloaded the lib (and the program either segfaulted or returned with error).

    dlopen(../app/target/debug/libapp.dylib, 0x00000002)
    dyld_image_path_containing_address(0x10e772000)
      dlopen(../app/target/debug/libapp.dylib) ==> 0x7fe1e0700000
    dlsym(0x7fe1e0700000, get_message)
      dlsym(0x7fe1e0700000, get_message) ==> 0x10e773460
    Message: hello world
    dlclose(0x7fe1e0700000)
    
    dlopen(../app/target/debug/libapp.dylib, 0x00000002)
      dlopen(../app/target/debug/libapp.dylib) ==> 0x7fe1e0700000
    dlsym(0x7fe1e0700000, get_mess)
      dlsym(0x7fe1e0700000, get_mess) ==> NULL
    dlerror()
    Failed to retrieve get_message symbol: dlsym(0x7fe1e0700000, get_mess): symbol not found
    

    This is quite weird, since the problem I have with my library is the opposite: unloading worked in Sierra, but stopped working in High Sierra (regardless of crate-type).

    Tuấn-Anh Nguyễn at 2018-02-09 16:43:38

  9. I'm experiencing this problem as well, on High Sierra (10.13.4). I noticed the following:

    When dlcloseing a library written in C (clang -shared), the dylib gets unloaded as expected.

    dlclose(0x7fc3eaf8b000)
    3043 dlclose(), found unused image 0x7fc3eaf8b000 libhsgame.dylib
    3044 dlclose(), deleting 0x7fc3eaf8b000 libhsgame.dylib
    

    When I try the same exact thing again with an identical cdylib written in Rust, dlclose does not unload the library. A refcount > 0 can't be the problem, because even when I dlclose the cdylib 100 times in a loop, dlclose still refuses to release the library (DYLD_PRINT_APIS=1 confirms it only gets opened once and closed 100 times).

    To my knowledge, dlclose only refuses to release a dylib when, other than a refcount > 0, the dylib is still being used somewhere (pointers holding addresses of the lib's symbols still exist), to avoid dangling pointers.

    If that's the case, then the question is, where do these pointers come from? If not, what the hell else is going on?

    Daniel Hauser at 2018-04-28 14:53:45

  10. Ok, I think I found something - I tried two more things with my Rust cdylib:

    • Switching to the system allocator, no effect.
    • Switching to the system allocator and turning the cdylib into a no_std crate, this fixes the problem - dlclose releases the lib.
    rustc --version
    rustc 1.27.0-nightly (ac3c2288f 2018-04-18)
    

    Daniel Hauser at 2018-04-28 15:25:34

  11. There’s a recent change in OS X that has "improved" dlclose recently to not actually unload libraries if some conditions are satisfied. See this comment. Perhaps that’s the reason your library wasn’t unloaded?

    Simonas Kazlauskas at 2018-04-28 21:15:09

  12. Thanks for the link @nagisa, this definitely seems to be related. Do you know a page where all cases, that make a dylib un-unloadable, are listed? @nanotech's comment only lists a few and I'd like to figure the exact reason why Rust cdylibs fall into that category.

    Daniel Hauser at 2018-04-28 22:09:16

  13. Historically, thread local storage (__thread) is what causes rust dylibs to not get unloaded or generally not work right with dlclose.

    I don’t know of a full list though.

    Simonas Kazlauskas at 2018-04-28 23:03:25

  14. This is very likely to be related to the issues described in https://github.com/rust-lang/rust/issues/88737 and https://github.com/rust-lang/rust/issues/88737#issuecomment-1178525208. Fixing this will likely not result in the behavior the user wants though -- the fact that dlclose works anywhere is kind of a bug, it failing to unload is actually macOS doing the right thing.

    In general you should not dlclose rust libraries that use libstd. There's no way for us to support this on many targets (and we don't quite support it correctly on all the platforms where we could, which is why sometimes it will unload, which can be unsound).

    Unfortunately, dlclose is just not really coherent in programs which have thread local storage (in particular if destructors to be run on that TLS data). See that issue for an explanation of why.

    Thom Chiovoloni at 2022-09-23 02:03:52

  15. Musl libc doesn't even implement dlclose at all. It returns without doing anything.

    bjorn3 at 2023-01-16 19:27:32