ill-typed unused FFI declarations can cause UB
This compiles and prints "p is not null and 0x0":
pub mod bad {
#[allow(improper_ctypes)]
extern {
pub fn malloc(x: usize) -> &'static mut ();
}
#[no_mangle]
pub fn bar() {
let _m = malloc as unsafe extern "C" fn(usize) -> &'static mut ();
}
}
pub mod good {
extern {
fn malloc(x: usize) -> *const u8;
}
pub fn foo() {
unsafe {
let p = malloc(0x13371337deadbeef); // your computer doesn't have enough memory
if p.is_null() {
panic!("p is null");
} else {
panic!("p is not null and {:?}", p);
}
}
}
}
fn main() {
bad::bar();
good::foo();
}
The problem is that the "ill-typed" malloc declaration (bad::malloc), which is compiled in but never actually called, ends up putting a nonnull attribute on malloc, which causes mod good to be miscompiled.
Here's another example that does not involve malloc. It does not get miscompiled currently, but it demonstrates the issue.
This compiles and prints "p is not null and 0x0":
pub mod bad { #[allow(improper_ctypes)] extern { pub fn malloc(x: usize) -> &'static mut (); } #[no_mangle] pub fn bar() { let _m = malloc as unsafe extern "C" fn(usize) -> &'static mut (); } } pub mod good { extern { fn malloc(x: usize) -> *const u8; } pub fn foo() { unsafe { let p = malloc(0x13371337deadbeef); // your computer doesn't have enough memory if p.is_null() { panic!("p is null"); } else { panic!("p is not null and {:?}", p); } } } } fn main() { bad::bar(); good::foo(); }The problem is that we have two declarations of the "malloc" symbol, but LLVM uses a global namespace for these. So during codegen, the 2nd declaration we generate overwrites the first. In this case, the "ill-typed" malloc declaration (
bad::malloc) comes last, up putting anonnullattribute onmalloc, which causesmod goodto be miscompiled.Here's another example that does not involve
malloc. It does not get miscompiled currently, but it demonstrates the issue.See here for a document that summarizes why this happens, and what our options are.
Ralf Jung at 2024-05-07 11:34:20
I haven't actually seen this in practice, but in brendanzab/gl-rs#438 I think someone has (erroneously) declared
GetProcAddressas returning a (non-null) function pointer, which might cause other random code in the same executable to break because the erroneousnonnullreaches them.Ariel Ben-Yehuda at 2017-11-22 22:01:29
Some questions:
- Is it possible for us to have two distinct definitions in LLVM for
malloc, such that the call which is not dead does not inherit the non-null attribute? - Alternatively, or perhaps just additionally, we could detect and lint against the mismatch.
Niko Matsakis at 2018-01-25 16:00:50
- Is it possible for us to have two distinct definitions in LLVM for
triage: P-medium
We should fix this, but not a burning issue.
Niko Matsakis at 2018-01-25 16:01:55
Note that this is not restricted to
extern "C", but it also impacts Rust function declarations: https://llvm.godbolt.org/z/03s-aLI wondered whether, in some cases, LLVM could merge the attributes of the declarations with the definition, introducing UB in safe Rust code like this:
fn foo(x: *mut u8) { ... } extern "Rust" { fn foos_mangled_name(x: &mut u8); }The above link shows that, currently, when merging
noalias, thenoaliasof the declaration can end up being removed when a definition withoutnoaliasis present in the same module.It is unclear to me whether making the definition
noaliaswould be a sound optimization of LLVM-IR.noaliasis modeled after C'srestrict, and in C, the behavior of programs with a declaration like the above, which doesn't match the "restrict"/"noalias" qualifier of the definition, is undefined. So LLVM can assume that it never happens, and if it does, mergingnoaliasin any way would be "ok" for C (not for Rust, and unclear whether it's ok for LLVM-IR, hopefully not).gnzlbg at 2019-08-25 10:08:39
Alternatively, or perhaps just additionally, we could detect and lint against the mismatch.
@nikomatsakis if the incorrect declaration is in a different crate, could it become visible here due to LTO ? If so, then warning against this would need some kind of whole-program analysis.
gnzlbg at 2019-08-25 11:47:31
my review of current status: running the initial example in the playground, it looks like a warning is emitted. running the example through MIRI also in the playground does not produce any additional warnings besides that the allocation is too large. I see that in the rust-lang/unsafe-code-guidelines issue which mentions this one there is interesting discussion, but neither the docs in unsafe-code-guidelines nor the rustonomicon ffi page appear to mention this in the currently built versions of their docs. it doesn't appear that there's a ready-to-implement suggested fix in either this thread or that thread.
Lauren at 2021-02-19 01:14:47
Miri checks the UB listed here, and so far there is no Rust-level UB related to these invalid declarations.
So arguably, right now, the code is fine but the LLVM backend handles it incorrectly. However, I am not sure if there is any way for the LLVM backend to prevent LLVM from "merging" these prototypes, or if there is any hope of adjusting LLVM to not do that any more. Ideally, if "merging" has to happen, the merged definition would have the minimal assumptions of both inputs, and if the inputs are incompatible, that should just be an error.
Ralf Jung at 2021-02-20 10:23:30
Is this still an issue? It looks like since 1.63, rust no longer removes the branch for the shown example https://godbolt.org/z/n9PYEKexh
Lukas Wirth at 2022-10-21 09:47:30
I don't see any reason to assume that this changed in any fundamental way. Examples that show UB are always fragile, but LLVM to my knowledge still assumes the one-definition-rule.
Ralf Jung at 2022-10-21 10:27:10
https://github.com/rust-lang/unsafe-code-guidelines/issues/198 discusses the issue. There isn't a "one definition rule" applicable here; the issue is about compatible/composite declarations and comes from LLVM's C language semantics. I'm not going to repeat everything I said there, but it's all relevant to this.
It is surprising that adding an unused declaration affects the compilation of a Rust program. However, unused declarations affecting the compilation are explicitly allowed by ISO C compatible declaration rules. It seems like rustc could avoid telling LLVM about unused declarations to avoid subjecting Rust programs to those rules. Or, rustc could just emit a different stronger warning for seemingly-unused
externdeclarations telling the user that even though the declaration went unused, the declaration may still affect compilation. (Indeed, because of this issue, there is really no such thing as an unusedexternfunction declaration.)Brian Smith at 2022-10-22 15:19:37
How do you define unused declarations? Are they unused if there are no references, if all referencing functions are dead code, or if all references aren't used during the current execution (but may be during another execution)?
bjorn3 at 2022-10-22 15:25:42
Focusing on unused declarations seems very fragile. The case of two used declarations with different attributes is just as dangerous, where one the attributes of one declaration might be applied to the unsuspecting other declaration.
Ralf Jung at 2022-10-22 16:10:14
https://github.com/rust-lang/unsafe-code-guidelines/issues/198#issuecomment-529092294 explains why this is especially dangerous if things are really operating as described, using an example of how one can publish a crate with no
unsafebut have it still affect the safety of the program. Static analysis tools looking forunsafewould need to be extended to considerexterndeclarations as well.Brian Smith at 2022-10-22 16:24:09
There's a lang team thing about handling this by having people put
unsafe extern, but I'd honestly prefer if this could be fixed by LLVM and/or rustc.Lokathor at 2022-10-22 16:31:06
Cranelift doesn't allow multiple function declarations in the same module with different signatures (at clif ir level) at all. It returns an
IncompatibleSignatureerror. So far I have only seen a single instance where a function was defined with two different signatures in the same module. I fixed it in https://github.com/rust-lang/rust/pull/95088.bjorn3 at 2022-10-22 16:37:57
@bjorn3 I think "same module" matters for non-LTO builds but for LTO builds then all the declarations in the whole program matter.
Brian Smith at 2022-10-23 06:27:46
So far I have only seen a single instance where a function was defined with two different signatures in the same module
What about this? https://github.com/rust-lang/rust/blob/6c9c2d862dd10718ba2b2a320c3390995ad414bc/library/std/src/sys/unix/args.rs#L207-L212
Or does that not count because they have different names (it would be pretty easy to imagine code that didn't use different names for this, though...)
Thom Chiovoloni at 2022-10-23 08:31:25
@bjorn3 I think "same module" matters for non-LTO builds but for LTO builds then all the declarations in the whole program matter.
True. I'm just saying that Cranelift doesn't support multiple declarations, so even if LLVM were to not miscompile them, it may still have to be forbidden in rust for compatibility with Cranelift.
bjorn3 at 2022-10-23 09:03:13
So far I have only seen a single instance where a function was defined with two different signatures in the same module
What about this?
https://github.com/rust-lang/rust/blob/6c9c2d862dd10718ba2b2a320c3390995ad414bc/library/std/src/sys/unix/args.rs#L207-L212
Or does that not count because they have different names (it would be pretty easy to imagine code that didn't use different names for this, though...)
Good point. macOS on M1 is not yet supported by cg_clif as several things are missing. For this I could special case objc_msgSend to have one fixed signature at cranelift_module level, but the right one at cranelift_codegen level I guess. (cranelift_codegen doesn't mind)
bjorn3 at 2022-10-23 09:07:01
Could somebody please share an up-to-date godbolt link showing the incorrect attribute merging behavior? There's a lot of code samples floating around in these threads, but I didn't manage to actually reproduce the issue.
Nikita Popov at 2022-10-23 11:01:52
I think that current state in the terms of initial MIR to LLVM code generation is that attributes attached to a call site are based on the appropriate declaration as used in Rust. At the same time the attributes from the LLVM function declaration are also in effect, and that LLVM function declaration might in fact be based on a distinct Rust declaration, hence the trouble.
An LLVM function declaration is based on a Rust declaration which happened to be generated first.
In the following example the
is_null()is optimized out. If once swaps the function source code order the check remains:#[no_mangle] pub unsafe fn a() -> *const u8 { extern { pub fn malloc(_: usize) -> &'static (); } malloc(usize::MAX) as *const _ as *const _ } #[no_mangle] pub unsafe fn b() -> *const u8 { extern { pub fn malloc(_: usize) -> *const u8; } let p = malloc(usize::MAX); if p.is_null() { panic!(); } p }To reduce the risk of such issues, maybe we could add optimization attributes to definitions and call sites, but omit them from declarations?
Tomasz Miąsko at 2022-10-23 12:13:00
Gah, the mistake I made it looking at
-C opt-level=0IR, rather than-C opt-level=3 -C no-prepopulate-passes. For the former the relevant attributes aren't added at all, so of course the problem does not occur. (https://rust.godbolt.org/z/T7E64ETc5 is a minimal example for just the attributes.)@tmiasko I think in principle applying attributes to definitions and call-sites only should be fine. We would have to be more thorough about call-site attributes though, because I believe we currently don't apply function-level optimization attributes to call-sites (e.g. allocator attributes will only be on the
__rust_allocdeclaration, not call-sites).Nikita Popov at 2022-10-23 13:29:39
For this I could special case objc_msgSend to have one fixed signature at cranelift_module level, but the right one at cranelift_codegen level I guess. (cranelift_codegen doesn't mind)
FWIW, while this is probably a good idea for the broader ecosystem, it's honestly kind of goofy that we are using these because the functions we're avoiding (
_NSGetArgvand such) are in public headers for the platform, and could also be gotten via the arguments to a static ctor, like how we do it on linux-gnu. (We also use_NSGetExecutablePathon iOS anyway).Thom Chiovoloni at 2022-10-23 18:27:40
Mach-O has __mod_init_func as counterpart to .init_array on ELF, but looking at the source of dyld it seems that the functions don't get any arguments, unlike Linux where it gets argc, argv and envp as arguments.
bjorn3 at 2022-10-23 18:53:41
Mach-O has __mod_init_func as counterpart to .init_array on ELF, but looking at the source of dyld it seems that the functions don't get any arguments, unlike Linux where it gets argc, argv and envp as arguments.
This is not true. The init functions get passed argc, argv, envp, applev, and a pointer to ProgramVars. I don't know where you were looking, but see https://github.com/apple-oss-distributions/dyld/blob/f73171cf0a177f453fdb124952908fe83864acab/dyld/DyldProcessConfig.h#L67 and related usage in dyld.
Thom Chiovoloni at 2022-10-23 18:57:22
An example of a crate in the ecosystem (which at least works on my machine, and has been tested several other apple OSes) that uses this is https://github.com/BlackHoleFox/appleargs/blob/797caf96cd9684f8747dbe336bb3c9022fc12b8d/src/lib.rs#L215-L221.
This is getting off-topic, though.
Thom Chiovoloni at 2022-10-23 19:01:06
I looked at https://opensource.apple.com/source/cctools/cctools-384/dyld/mod_init_funcs.c.auto.html. Looks like it is quite an outdated version though.
bjorn3 at 2022-10-23 19:03:52
Yeah,
apple-oss-distributions/*has the latest versions.Thom Chiovoloni at 2022-10-23 19:04:31
This can be done in entirely safe code without triggering the
unsafe_codelint:extern { pub fn malloc(x: usize) -> &'static mut (); } #[used] static FOO: unsafe extern "C" fn(usize) -> &'static mut () = malloc as unsafe extern "C" fn(usize) -> &'static mut (); fn main() { Vec::<u8>::with_capacity(0x13371337deadbeef); }This aborts without optimizations, but doesn't report the allocation failure with
-Copt-level=3.Edit: I think that is LLVM removing the allocation due to being unused. I can't reproduce it if I use the resulting vec in any way.
bjorn3 at 2023-04-09 19:46:10
What is the current state of this?
Is the proposed approach of applying attributes to call-sites rather than definitions a feasible one?
Josh Triplett at 2023-07-05 17:41:47
https://github.com/rust-lang/rfcs/pull/3439 I presume is where this is being discussed
Jules Bertholet at 2023-07-06 05:11:53
in C, the behavior of programs with a declaration like the above, which doesn't match the "restrict"/"noalias" qualifier of the definition, is undefined
This statement was the entire basis for this discussion here, I think. I also thought it to be true, based on what seems like fairly clear statements in the C standard:
"All declarations that refer to the same object or function shall have compatible type; otherwise, the behavior is undefined" [C23 6.2.7 §2] and "For two qualified types to be compatible, both shall have the identically qualified version of a compatible type; the order of type qualifiers within a list of specifiers or qualifiers does not affect the specified type" [C23 6.7.3 §11].
Alas, the C standard is written in a way to maximize confusion. It turns out that this might actually not be UB:
"For two function types to be compatible, [...]. In the determination of type compatibility and of a composite type, each parameter declared with function or array type is taken as having the adjusted type and each parameter declared with qualified type is taken as having the unqualified version of its declared type." [C23 6.7.6.3 §14]
So, it does seem like there's not actually UB when some declarations have
restrictand other do not? But then isn't LLVM wrong for doing the kind of attribute merging that it does?Ralf Jung at 2023-11-28 07:31:28
Relevant comment by @nikic
It's my understanding that this is, at least primarily, a rustc bug, not an LLVM bug. rustc is responsible for how it wants to handle attributes when there are multiple declarations/definitions of the same symbol. IIRC clang does handle this correctly in relevant cases by e.g. not emitting attributes like noalias on declarations. (Within one module, LLVM has no concept of multiple declarations/definitions for one symbol, this is fully a frontend concern. LLVM's own interpretation would only come into play during module linking.)
Only adding attributes to calls and not declarations is perfectly fine as far as LLVM is concerned, I don't see any implementation problems from that side. The only possible downside that comes to mind is that if a call gets devirtualized, we might have attributes with more information on the declaration rather than the (originally virtual) call. I doubt this is a significant problem though.
So... what does LLVM do during module linking? This does seem relevant for LTO.
Ralf Jung at 2024-05-07 11:58:29
I think I found the right place to fix this...
declare_fnis where the LLVM function declarations are generated. Unfortunately this is also used for definitions, so we can't just always omit attributes there. Also, some attributes are load-bearing for the ABI, those should (I assume) also be applied even on declarations. So the information whether this is a foreign item or not needs to be propagated quite deep through all theapply_attrs_llfnfunctions.Ralf Jung at 2024-07-25 20:05:41
I opened a PR for this: https://github.com/rust-lang/rust/pull/128247. However, I think it's only a partial solution... I have no idea what to do for the general case, where there are two imports of the same function with arbitrarily different signature. See the PR description.
Ralf Jung at 2024-07-26 20:09:52
Turns out this might be too hard for me to fix. I am also increasingly skeptical whether LLVM will cope well with functions declared as
@func_name = external global [0 x i8]-- aren't functions special e.g. in how they are treated by the linker on some platforms, and other things like that? Also this will hit entirely untested codepaths in LLVM.Nominating for t-compiler discussion to see what they are thinking about how to best resolve this issue. We may also want to re-consider the priority of this, as 4 years ago when the priority was set, it wasn't yet clear what the actual issue is.
Ralf Jung at 2024-08-05 10:25:08
@rustbot label: +I-lang-nominated
The proposed fixes for this issue might not practical, at least not without some help from the LLVM side, so we should consider what we want to do about this as a language.
Members of T-compiler in particular suspect that merely issuing a hard error in response to a pair of FFI declarations with distinct types may not suffice; we suspect there are users who are relying on being able to do that, and there is some underlying notion of "compatible" pairs of types that is being used to justify such code.
Felix S Klock II at 2024-08-15 14:55:47
The next plan for https://github.com/rust-lang/rust/pull/128247 is to try declaring functions with an "empty" signature (no arguments,
voidreturn type). That seems less likely to cause issues than declaring them as empty arrays.The main concern with that approach seems to be that LLVM's special treatment for specific library functions relies on having the signature in the declaration.
mallocis a typical example but shouldn't concern us much since we have our own allocation functions and Rust code doesn't usually callmallocdirectly, but I am wondering e.g. about all the libm functions we expose on our float types and whether they could still be const-folded.Ralf Jung at 2024-08-15 15:36:09
So... what happens when I import the same symbol name as an extern static and as a function?
Right now, it seems like the static "wins":
@testname = external global i32 ; Function Attrs: nonlazybind uwtable define void @mytest() unnamed_addr #0 !dbg !7 { start: %_2 = load i32, ptr @testname, align 4, !dbg !13 call void @testname(i32 %_2), !dbg !14 ret void, !dbg !15 }This code is obviously UB, but if I make the static a 1-ZST there's not actually going to be any memory access there.
Ralf Jung at 2024-08-18 17:32:17
The proposed fixes for this issue might not practical, at least not without some help from the LLVM side, so we should consider what we want to do about this as a language.
Note that the cranelift backend fundamentally cannot represent such different signatures. I am not sure if there's any way to use a scheme like "declare all functions as taking no arguments and then define the rest for each call site" there -- @bjorn3? The GCC backend can likely use a trick similar to what https://github.com/rust-lang/rust/pull/128247 now does with LLVM.
On the other hand, given cross-crate monomorphization, it is really hard to ensure that we'll never get this cranelift error -- we have to ensure that the signature is consistent (at least insofar as that would affect cranelift function types) across the entire crate graph. If two crates use signatures that are "meaningfully" different (whatever exactly that means), they may work fine in isolation but the program may then fail to build when they are both used in the same dependency tree. That's pretty bad...
Ralf Jung at 2024-08-18 20:38:11
Note that the cranelift backend fundamentally cannot represent such different signatures. I am not sure if there's any way to use a scheme like "declare all functions as taking no arguments and then define the rest for each call site" there -- @bjorn3?
For functions it would be possible to declare them with an empty signature and then after referencing it in a function overwrite the signature for this reference, but that would require knowing all function definitions in the codegen unit in advance (as the definition still needs to match the declaration), which is incompatible with anything like LTO as there you did have to make the choice for function signature in one module and when you merge it with another module containing the definition the function signature has to match. For declaring a function and and a static with the same name, this hack wouldn't work either. You can't reference a function/static as the other type from a module.
bjorn3 at 2024-08-18 20:53:08
For functions it would be possible to declare them with an empty signature and then after referencing it in a function overwrite the signature for this reference, but that would require knowing all function definitions in the codegen unit in advance
I am confused. What does "overwrite" mean here? What I meant above (what https://github.com/rust-lang/rust/pull/128247 does in LLVM) is to always keep the declaration with an empty signature, it never gets changed or "overwritten" in that sense. But when we call the function we give it some arguments anyway and that works.
Maybe this could be encoded in cranelift by taking the global function, casting the function pointer to a different type that matches what this call site uses, and then calling that?
Ralf Jung at 2024-08-18 21:23:24
In Cranelift IR a
Functioncontains a list of all references to external functions with function signatures and an index into the list of function declarations stored in the moduld. This is separate from the signature stored for the function declaration in the module. When you reference an external function from cranelift IR you "import" the function declaration from the module. This normally inherits the function signature, but you can overwrite it after the fact.You can't use an empty signature for a declaration of a function that will be locally defined however. There is a check that the function signature of the
Functionmatches the declaration in the module when defining the function.bjorn3 at 2024-08-18 22:15:10
You can't use an empty signature for a declaration of a function that will be locally defined however. There is a check that the function signature of the Function matches the declaration in the module when defining the function.
That seems to contradict your previous statement where you said "after referencing it in a function overwrite the signature for this reference", so I am still confused. Can the per-function import of the per-module table have a different type than the per-module table or not? If no then this entire thing doesn't seem relevant here as anyway it can't be used. And even if it can, one could have calls of two different imports of the same external function inside one function so a per-caller type signature doesn't help -- we need a per-call-site signature. (Though it would help insofar as it would make the signature consistency check more local, at least if we ignore inlining.)
But I take it the upshot is that cranelift tries to be somewhat typesafe and if a function is declared with a given signature, you have to call it like that. However, presumably it is also possible to create function pointers and cast them to a different type, so that way it'd still be possible to declare them with one signature but call them with another?
Ralf Jung at 2024-08-19 06:30:02
Incorrect static declarations can also introduce immediate undefined behaviour, even when they are never used at runtime. From LLVM reference:
Either global variable definitions or declarations may have an explicit section to be placed in and may have an optional explicit alignment specified. If there is a mismatch between the explicit or inferred section information for the variable declaration and its definition the resulting behavior is undefined.
<details> <summary>An example where incorrect size of an unused static introduces undefined behaviour</summary>For global variable declarations, as well as definitions that may be replaced at link time (linkonce, weak, extern_weak and common linkage types), the allocation size and alignment of the definition it resolves to must be greater than or equal to that of the declaration or replaceable definition, otherwise the behavior is undefined.
extern { // Suppose that the actual definition is zero sized. static S: usize; } // Requires: a == 0 #[inline(never)] pub unsafe fn f(mut a: usize) -> usize { let mut s = 0; while a != 0 { s += unsafe { S }; a -= 1; } s }Is optimized to:
</details>define noundef i64 @f(i64 noundef %0) unnamed_addr #0 { start: %_4 = load i64, ptr @S, align 8 %_4.fr = freeze i64 %_4 %1 = mul i64 %_4.fr, %0 ret i64 %1 }Tomasz Miąsko at 2024-08-19 07:46:05
Incorrect static declarations can also introduce immediate undefined behaviour, even when they are never used at runtime.
Does this affect linkme? The
LINKME_STOPstatic is a non-zero-sized static placed at the end of the section, so after any actual elements, so it does not neccessarily point to any valid memory: https://github.com/dtolnay/linkme/blob/0e5e28bd673014b2b8d0a7337eab9489f74d2624/impl/src/declaration.rs#L139-L143In other words, is this still a problem if the static is never accessed, even in dead code? (It only has its address taken to find out where the section ends).
Max “Goldstein” Siling at 2024-08-19 07:55:07
Yes, such a static is UB even if never accessed. Statics act basically like globally available references that must be always-dereferenceable.
Ralf Jung at 2024-08-19 08:05:05
That seems to contradict your previous statement where you said "after referencing it in a function overwrite the signature for this reference", so I am still confused. Can the per-function import of the per-module table have a different type than the per-module table or not? If no then this entire thing doesn't seem relevant here as anyway it can't be used. And even if it can, one could have calls of two different imports of the same external function inside one function so a per-caller type signature doesn't help -- we need a per-call-site signature. (Though it would help insofar as it would make the signature consistency check more local, at least if we ignore inlining.)
Functions would be represented something like the following in clif ir:
function u0:0(i64) { fn0 = u0:1(i64) -> i32 block0(v0: i64): v1 = call fn0(v0) return }Here
fn0 = u0:1(i64) -> i32is a reference to an external function withu0:1as index into the module and(i64) -> i32as function signature. It is possible to override the latter such that it doesn't match the signature in the module, so declaringu0:1as()(function signature with no arguments and return value) in the module is possible. The functionu0:0for which this is providing a definition will have to be declared as(i64)(i64as argument and no return value) however due to the signature check. So for example the following module contents (made up syntax) would be possible:u0:0 = local "my_function": (i64) u0:1 = import "imported_function": ()However if you want to also import "imported_function" as static, you did get an error as "imported_function" is already declared as function. In addition if you were to merge the above module with a module containing a definition of "imported_function":
u0:0 = export "imported_function": (i64) -> i32that wouldn't be possible as you get two different declarations of "imported_function".
This is not much of an issue currently as Cranelift currently doesn't benefit from LTO (which would be the main reason to want to merge modules) at all due to all optimizations being strictly within a single function, but it might in the future and I don't want to exclude support for it.
But I take it the upshot is that cranelift tries to be somewhat typesafe and if a function is declared with a given signature, you have to call it like that. However, presumably it is also possible to create function pointers and cast them to a different type, so that way it'd still be possible to declare them with one signature but call them with another?
Cranelift has separate instruction for calling a function directly and calling a function pointer. The latter is less efficient as it doesn't allow directly encoding the target address in the call instruction. (By the way cranelift doesn't have typed function pointers, it only has pointer-sized integers. Every use of the call_indirect instruction has to specify the function signature.)
bjorn3 at 2024-08-19 10:53:13
Okay, so encoding calls with the "wrong" signatures via indirect calls works but is inefficient. Makes sense. Maybe it could still be done just for the very rare case where there's a signature mismatch, but it's not pretty. It also seems likely that cranelift will outright reject merging such modules if they have a signature mismatch rather than introducing such casts.
From a Rust perspective, the cranelift signature captures only a small part of what makes up a type, so we could allow some mismatches but not arbitrary mismatches. It would probably have to boil down to something like, the signatures have to be ABI-compatible.
The other point regarding the "signature check" you mention is something I haven't even considered yet for the LLVM case -- merging two modules where one module defines a function and the other one uses it. In that case we'll always want the definition to win. I assume the LLVM encoding via
void @fn()will do that.But IIUC the upshot is, non-ABI-compatible imports of the same symbol anywhere in the crate graph are likely never going to be fully supported in cranelift (as that would require cranelift merging modules with incompatible signatures, and doing the required adjustments to make the result a well-defined module).
Ralf Jung at 2024-08-19 11:30:53
Turns out this is more tricky than expected -- some calling conventions put the number of arguments into the function name, so LLVM has to know the actual signature.
Not sure if LLVM can be fixed to use the arguments at the call site for this, rather than the arguments at declaration site. But it does look increasingly like there's a lot of technical hurdles for importing the same name with different signatures into one module -- this isn't fundamentally anything that should be hard, it's just that LLVM (and cranelift) are making assumptions that make this hard.
Ralf Jung at 2024-09-11 09:50:11
I intend to propose a t-lang design meeting about this problem. Here's the writeup for that: https://hackmd.io/cEi0aL6XSk6DDDASwpVH-w Please let me know if I missed anything important. :)
Ralf Jung at 2024-09-13 07:09:27
Thanks for writing that doc @RalfJung. We discussed during the lang team meeting and my proposal to make incremental progress (which the rest of the team seemed roughly aligned with) is the following:
- Accept for now that the conflict is UB (as it already is), but
- Lint against cross-crate incompatibilities as best as we can. Consider making these lints error-by-default. Improve them over time so their definition of "compatibility" becomes better.
- Lint against using
externinstead ofunsafe externin older editions, so it's clearer where users are making important assertions that might lead to UB. - If/when the cross-crate incompatibility lint matches exactly the behavior of our backend that we want to expose, consider upgrading it to a hard error, possibly over an edition boundary.
Does this seem like a reasonable path forward? If so, maybe we can put off using a design meeting for this.
The details of the compatibility rules are of course the difficult part, but I'm not sure if T-lang will be very helpful in working those out. I think that has to be informed by a combination of implementation details, the backends we use, the targets we support, and the use cases we discover in the wild. It seems to me that this is a complex enough space that it has to be worked out over time instead of designed all at once.
Put another way, if the question is "what definitions of compatibility are we prepared to accept", my answer would be roughly "whatever is most likely to catch actual bugs while staying out of the way in places it isn't helping". I see some value in getting the lang team to sign off on that as a general direction, but doubt we need a full design meeting to do so.
Tyler Mandry at 2024-09-18 18:34:55
@tmandry I agree the exact compatibility rules need to be figured out "on the ground". I was mostly asking for t-lang input on the bigger conceptual questions, like -- should we lint/error against the cross-crate case of this? We generally spend a lot of effort to ensure that two different crates that build independently can always be linked together further downstream (e.g., the orphan rule). But here we'd introduce exactly the kind of thing the orphan rule intends to prevent -- two crates work fine in isolation but cannot be used together.
OTOH, that is the nature of this UB, so our hands a bit tied. But this seems like a fundamental enough point that I wanted to get some t-lang input. If it doesn't need a design meeting that's fine for me. :)
Ralf Jung at 2024-09-18 18:41:25
We generally spend a lot of effort to ensure that two different crates that build independently can always be linked together further downstream (e.g., the orphan rule). But here we'd introduce exactly the kind of thing the orphan rule intends to prevent -- two crates work fine in isolation but cannot be used together.
Good point. I do agree with what you said here..
OTOH, that is the nature of this UB, so our hands a bit tied.
And specifically it's because there's a global namespace shared by all (public) symbols linked into a binary.
I still think the best thing we can do for now is lint on it, letting the lints be allowed anywhere the conflicting crates are brought together, or at the final linked crate. Since it's allowable we can do this without thinking too hard about the implications, and consider them again if we decide to move to a hard error.
Tyler Mandry at 2024-09-19 05:43:40
Okay. :)
So the next steps would be:
- adjust codegen to emit fewer attributes for these imports, so that more imports become "compatible" (e.g. if they only differ in a
nonnull) - tweak the lint to reduce false positives (e.g. from differences behind pointer indirection)
- make the lint cross-crate
I don't know how to do the last step -- is there a way to iterate all imports from all crates in the current crate graph?
Ralf Jung at 2024-09-19 06:19:19
- adjust codegen to emit fewer attributes for these imports, so that more imports become "compatible" (e.g. if they only differ in a