Extremely slow optimizer performance when including large array of strings
Consider the following test program:
fn main() {
let words = include!("num_array.rs");
let mut l = 0;
for word in words.into_iter() {
l += word.len();
}
println!("{}", l);
}
num_array.rs contains an array of 250k strings:
$ (echo '['; seq 1 250000 | sed 's/.*/"&",/' ; echo ']') > num_array.rs
$ head -n 5 num_array.rs
[
"1",
"2",
"3",
"4",
$ tail -n 5 num_array.rs
"249997",
"249998",
"249999",
"250000",
]
Compiling this in debug mode took about 45 seconds; long, and potentially a good test case for compiler profiling, but not completely ridiculous.
Compiling this in release mode showed no signs of finishing after 45 minutes. stracing rustc showed two threads, one blocked in a futex and the other repeatedly allocating and freeing a memory buffer:
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152) = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb665600000, 4194304, MADV_DONTNEED) = 0
madvise(0x7fb63e600000, 12582912, MADV_DONTNEED) = 0
madvise(0x7fb661000000, 25165824, MADV_DONTNEED) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb667a00000, 10485760, MADV_DONTNEED) = 0
madvise(0x7fb646000000, 100663296, MADV_DONTNEED) = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152) = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb665600000, 4194304, MADV_DONTNEED) = 0
madvise(0x7fb63e600000, 12582912, MADV_DONTNEED) = 0
madvise(0x7fb661000000, 25165824, MADV_DONTNEED) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb667a00000, 10485760, MADV_DONTNEED) = 0
madvise(0x7fb646000000, 100663296, MADV_DONTNEED) = 0
By way of comparison, an analogous C program compiled with GCC takes 4.6s to compile without optimization, or 5.6s with optimization. Python parses and runs an analogous program in 1.2s. So, 45s seems excessive for an unoptimized compile, and 45m+ seems wildly excessive for an optimized compile.
Complete test case (ready to cargo run or cargo run --release): testcase.tar.gz
Great writeup. Thanks!
Brian Anderson at 2017-01-27 23:13:01
Parts of timings for non-optimised build:
time: 0.000; rss: 52MB parsing time: 0.693; rss: 159MB expansion time: 3.174; rss: 233MB item-bodies checking time: 0.164; rss: 244MB const checking time: 0.181; rss: 263MB MIR dump time: 0.453; rss: 263MB borrow checking time: 2.685; rss: 622MB translation time: 9.955; rss: 533MB codegen passes [0] time: 10.560; rss: 533MB LLVM passes time: 3.200; rss: 144MB linking Finished dev [unoptimized + debuginfo] target(s) in 22.49 secsSo the guess wrt parsing is wrong. Timings for optimised build are pretty much the same with exception of LLVM passes:
time: 2.194; rss: 549MB llvm function passes [0] time: ∞; rss: ∞ llvm module passes [0] time: ∞; rss: ∞ codegen passes [0] time: ∞; rss: ∞ codegen passes [0]~~so this is basically yet another super-linear case in the Super-Linear Optimisation Engine.~~
~~Notable differences in IR between clang (takes no time at all) and rustc are linkage type of string constants. clang specifies
private unnamed_addrwhereas rustc specifiesinternal.~~Main function has 250k instances of
%749955 = getelementptr inbounds [250000 x %str_slice], [250000 x %str_slice]* %words, i32 0, i32 249985 %749956 = getelementptr inbounds %str_slice, %str_slice* %749955, i32 0, i32 0 store i8* getelementptr inbounds ([6 x i8], [6 x i8]* @str.1321, i32 0, i32 0), i8** %749956 %749957 = getelementptr inbounds %str_slice, %str_slice* %749955, i32 0, i32 1 store i64 6, i64* %749957generated to produce the on-stack array, whereas clang produces a constant and
memcpys it instead. That almost guaranteed is the reason here.Simonas Kazlauskas at 2017-01-27 23:48:28
Potentially fixed by having rvalue static promotion implemented.
Simonas Kazlauskas at 2017-01-28 17:07:43
So, even if we fixed this particular case of completely static array, there’s still the same issue when the array is not exactly fully static:
[ ..., "100000", some_var, "100001", ...]I constructed a program like above in C and made clang compile it without optimisations. It still haven’t finished.
EDIT: it has finished (keep in mind, this is not optimised build):
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 2190.9531 ( 99.9%) 0.3600 ( 77.1%) 2191.3131 ( 99.9%) 2192.0474 ( 99.9%) X86 DAG->DAG Instruction SelectionWhich sounds right about where I’d expect the time be spent.
Simonas Kazlauskas at 2017-01-28 19:12:15
@nagisa Interesting. I can reproduce that result with clang with or without optimization, but gcc has no problem with such a program when not optimizing. When optimizing, it too doesn't show any signs of finishing.
The test code I used, for reference:
#include <string.h> #include <stdio.h> int main(int argc, char *argv[]) { char *array[] = { #include "num_array.h" argv[0], }; size_t sum = 0; for (int i = 0; i < sizeof(array)/sizeof(*array); i++) { sum += strlen(array[i]); } printf("%zu\n", sum); return 0; }To generate
num_array.h, useseq 1 250000 | sed 's/.*/"&",/' > num_array.h.Josh Triplett at 2017-01-28 19:40:19
I reported this test case as a GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79266
Josh Triplett at 2017-01-28 19:51:24
I would just point that the llvm embedded version missed two commits (present in 3.9.1) that could be related to this bug:
- https://github.com/llvm-mirror/llvm/commit/7e79c454db01898586ecb6444ec8bb4ee19dd0a0
- https://github.com/llvm-mirror/llvm/commit/e0af6c9240d5634766925c71a492ef48b5ca3538
Sebastien Marie at 2017-02-17 04:32:35
Cc @rust-lang/wg-compiler-performance
Anthony Ramine at 2018-04-02 11:53:35
Still an issue with current compilers:
https://internals.rust-lang.org/t/x-post-from-users-forum-long-compile-times-for-a-vector-with-50k-u64-values/11472
https://users.rust-lang.org/t/long-compile-times-for-a-vector-with-50k-u64-values/35625
Josh Triplett at 2019-12-12 10:08:39
For people coming here with a similar problem, looking for a workaround, I've noticed that include_bytes does not slow down compiles as much.
I had a case where I needed to include 100000 strings in a program. Initially I made a static array of them, but this was very slow to compile. I changed it to use two include_bytes instead, one with an array of binary u32 (offset, length)-pairs, and one with the actual (utf8-encoded) texts.
This compiled very quickly.
Edit: include_bytes is apparently not super-efficient either, but not as bad as a large array (see: https://github.com/rust-lang/rust/issues/65818 ).
Anders Musikka at 2020-05-30 09:48:03