Use systems page size instead of a hard-coded constant for File I/O?

809d8f8
Opened by Felix Schütt at 2023-01-26 20:38:23

Hi, I have a question regarding the following constant that is used throughout the BufReader:

https://github.com/rust-lang/rust/blob/6ccfe68076abc78392ab9e1d81b5c1a2123af657/src/libstd/sys_common/io.rs#L10

Shouldn't this variable rather be the page size for the system (determined at runtime) instead of 8KB hard-coded (for better IO performance)? This avoids that the system accidentally splits memory between pages that should really be contiguous on one page. For example, if my page size was 16KB, the system could accidentally allocate the memory in an unfortunate way so that the 8KB from the BufReader are now split across two pages. If it used 16KB pages, the system could map it 1:1 to a memory page. Why is the BufReader size hard-coded to 8KB?

  1. What issues arise when the allocation is split across a page? What other libraries have a runtime-determined default buffer size?

    jemalloc will page align allocations that large, but I'm not sure about libc malloc.

    Steven Fackler at 2017-10-28 11:06:38

  2. There are better metrics by which to size buffers, such as the sizes of the various CPU cache levels. If you want to keep the buffer entirely in L1, then it should be no larger than half the size of the L1 cache. Zen, for example, has a per core L1 data cache of 32KiB, so the default buffer size of 8192 is fine, but some older architectures can have smaller CPU caches.

    I fail to see how straddling page boundaries is an issue, given CPU cache doesn't care about page boundaries. If it does straddle a page boundary, then the unused space before and after the allocation can still be used for more allocations, so you're not really wasting space.

    Peter Atashian at 2017-10-28 18:21:35

  3. Cache sizes and page table effects are not the only performance concerns. When buffering IO around OS primitives (files, sockets etc.) then syscall overhead tends to be more significant than cache misses. For example gnu cp uses 128KiB as minimum buffer size for this reason.

    the8472 at 2020-10-03 17:28:20

  4. For example gnu cp uses 128KiB as minimum buffer size for this reason.

    I think, 128 KiB is unsuitable for any OS without overcommit.

    AngelicosPhosphoros at 2021-12-20 21:21:43

  5. Rustup improved decompression performance by reading in with 8MB buffers.

    The underlying devices - spinning disks, network file systems, and NVMe/SSD's have varying tradeoffs, but can for instance handle large amounts of simultaneously dispatched requests, giving a latency mitigation: rather than CPU -> bus, device, response, and back for each call, the OS will do: CPU -> request-splitting -> dispatch max-depth requests in a loop, then read one, dispatch one etc until the full set are completed. This lets the underlying device (network, disk, whatever, perform the requests concurrently without depending on readahead heuristics).

    Getting to full zero copy semantics is even better, and doing that will require page alignment.

    Robert Collins at 2022-05-08 08:55:53

  6. (I don't think this is a libs-api concern, since it doesn't impact how this functions observably, aside for performance)

    Thom Chiovoloni at 2023-01-26 20:38:23