Collect and aggregate timing statistics of CI builders
Currently we only have a little bit of insight into how long all our CI builds are taking with a breakdown of where all that time is going. It's hoped that if we have a historical record of the duration of CI builds which is relatively consistent and granular we can more easily see regressions that happen over time and diagnose otherwise obvious-in-retrospect problems with the build.
The current thinking for how this might be done is to instrument rustbuild to, by default on CI, emit timing information about each step in a structured fashion. The thinking is we could then scrape the logs of all our CI builds to aggregate this structure data. The data needs to include enough information so that it can indeed be easily aggregated. For example we shouldn't have to look at the Travis environment to guess whether it's an OSX/Linux build or which architecture is being assembled.
Once we've got this data we can hopefully set up graphs over time like https://rust-lang-nursery.github.io/rustc-pr-tracking/ to see what time is going where. CI timings are very noisy but the hope is that general trends will still be relatively clear.
cc @rust-lang/infra
Alex Crichton at 2018-03-07 20:40:32
my two cents: there is http://scribu.net/travis-stats/#rust-lang/rust/auto which can uncover recent commits that blew up general runtime of a travis build. Another similar approach is http://buildtimetrend.github.io/ , however there has not been a lot of activity in their repos recently and I am not sure how up-to-date it is (haven't tried getting it to work yet).
Matthias Krüger at 2018-03-07 20:59:18
Indeed! Unfortunately though those timings are very coarse in that they generally only cover the entire build itself. The variance of time taken from build to build is huge so I'd particularly be interested in a step-by-step breakdown which we can hopefully track much more reliably over time. I'm also curious in things like 10 minute regressions over 1 hour regression which will be much more difficult to diagnose in looking at graphs of overall build times.
Alex Crichton at 2018-03-07 22:23:56
We need to know the time taken to compile each crate (have each
bootstrap/bin/rustcinvocation write out its start-to-finish duration). Otherwise just knowing "compiling stage1-rustc takes 30 minutes" is not really useful. Judging from the stamp output only will always lead us to wrongly blamesyntax_ext.kennytm at 2018-03-08 19:57:51
cc https://github.com/rust-lang/cargo/issues/5046
Matthias Krüger at 2018-03-08 20:02:29
Unfortunately this needs to apply to the bootstrap cargo, meaning we can't benefit from it until the next 6-week cycle. Modifying
bootstrap/bin/rustc.rscan be done immediately, however.kennytm at 2018-03-08 20:06:50
Are you thinking something like this? https://github.com/rust-lang/rust/pull/48867 (Direct travis log link)
I made some stats tables for that run as an experiment:
<details> <summary>Crates we compile without anything else compiling at the same time (largely rustc, unsurprisingly)</summary>Crate | Serial Time (s) | Blocking ----- | --------------- | -------- stage1-rustc | 660.93 | stage1-ieee stage1-num_cpus stage1-ppc stage0-linkchecker stage1-cc stage1-rustc_apfloat stage1-rustc_typeck stage1-rustc_incremental stage1-rustc_resolve stage1-tempdir stage1-rustc_metadata stage0-rustc | 559.42 | stage0-rustc_resolve stage0-num_cpus stage0-tempdir stage0-rustc_mir stage0-rustc_metadata stage0-rustc_allocator stage0-cc stage1-syntax | 222.75 | stage1-syntax_ext stage2-error_index_generator stage1-proc_macro stage1-rustc_back stage1-rustc_const_math stage1-rustc_mir | 167.09 | stage1-rustc_borrowck stage1-rustc_lint stage1-rustc_passes stage1-rustc_trans_utils stage1-std | 142.18 | stage1-term stage1-getopts stage1-std_unicode_tests stage2-rustdoc | 139.29 | stage2-rustdoc_tool_binary stage0-syntax | 135.36 | stage0-rustc_const_math stage0-rustc_back stage0-proc_macro stage1-rustc_driver | 134.24 | stage1-rustc stage1-rustc_errors stage1-core | 108.36 | stage1-compiler_builtins stage1-std_unicode stage1-libc stage0-rustc_driver | 81.31 | stage0-rustc stage0-rustc_trans | 77.67 | stage1-core stage1-rustc_trans | 77.35 | stage1-coretests | 76.48 | stage1-std stage0-rustc_mir | 55.66 | stage0-rustc_trans_utils stage0-rustc_borrowck stage0-rustc_passes stage0-rustc_lint stage0-mdbook | 51.92 | stage0-rustbook stage1-collectionstests | 44.72 | stage1-coretests stage1-serialize | 31.29 | stage1-rustc_data_structures stage1-syntax (others) | 423.52 |
</details> <details> <summary>Overall, the stage 1 compiler was 1.48 ×⁄÷ 1.62 times slower than stage0 (geometric mean and stdev)</summary>Crate | Time Ratio ----- | ---------- rustc_back | 2.500 rustc_incremental | 1.076 syntax_ext | 1.045 std | 6.229 rustc_resolve | 0.864 rustc_save_analysis | 1.029 rustc_metadata | 0.970 rustc_typeck | 1.060 rustc_mir | 1.743 test | 1.820 rustc | 1.170 rustc_errors | 1.121 alloc | 3.814 core | 3.628 syntax | 1.646 rustc_driver | 1.651 rustc_trans | 0.996 serialize | 9.686 (others) | (that took less than 10 seconds each)
</details>scottmcm at 2018-03-10 07:44:57
Whoa that's awesome @scottmcm!
Out of curiosity, how is the "time ratio" calculated? The serialize crate seems to have a very high one! (I'm also quite surprised that libserialize takes 30s to compile...)
Alex Crichton at 2018-03-12 14:38:46
Very hacky code used to produce the numbers: https://gist.github.com/scottmcm/241223e4964279a71a41d9846384cb87
Hmm, it does look like it's picking up more things than intended for the ratio. For example, the big reason that
stdshows as slower is because it's also picking up the time for its test build (https://travis-ci.org/rust-lang/rust/builds/351202994#L3260-L3261), and the same forserialize(https://travis-ci.org/rust-lang/rust/builds/351202994#L3884-L3885). I guess the logging and/or processing needs to account for that.The overall ratio is a time-weighted geometric mean of the crate ratios.
The other thing I did is split time intervals between all the processes running in the slice (https://gist.github.com/scottmcm/241223e4964279a71a41d9846384cb87#file-timing-rs-L39) in an attempt to not penalize things that are running in parallel with others.
scottmcm at 2018-03-12 21:02:48
I've made a PR for this at https://github.com/rust-lang/rust/pull/49094
I'm hoping with that we can adapt @scottmcm's script and perhaps even get a daily build up and running displaying this information!
Alex Crichton at 2018-03-16 19:31:40
What's the current status of this issue? I see https://github.com/rust-lang/rust/pull/49094 got merged - is that enough? Was the plan to also show that in a dashboard somewhere?
cc @pietroalbini , I think you've done a lot of work on CI metrics.
jyn at 2023-02-03 07:46:31
There was a dashboard at https://alexcrichton.github.io/rust-ci-timing-tracker/, but it broke many years ago. I think it was helpful for tracking unexpected increases in CI time.
Mark has a project at https://github.com/Mark-Simulacrum/rustc-ci-timing for graphing CI time, but it has to be run manually. It would be nice to make that more accessible.
Eric Huss at 2023-02-03 14:36:52