close
Skip to content

fmt: drop #[inline(never)] from pad_integral's write_prefix helper#156822

Draft
gilescope wants to merge 1 commit into
rust-lang:mainfrom
gilescope:giles-fmt-pad-integral-drop-write-prefix-inline-never
Draft

fmt: drop #[inline(never)] from pad_integral's write_prefix helper#156822
gilescope wants to merge 1 commit into
rust-lang:mainfrom
gilescope:giles-fmt-pad-integral-drop-write-prefix-inline-never

Conversation

@gilescope
Copy link
Copy Markdown
Contributor

Summary

Formatter::pad_integral calls a nested write_prefix helper at three sites. The helper was marked #[inline(never)] in commit ed2157a (Feb 2019) explicitly "for smaller code size", replacing an earlier closure that was duplicated across the four match arms.

Seven years on, that rationale is inverted. Modern LLVM inlines the trivial body and constant-folds the common (sign = None, prefix = None) case (positive non-alternate Display) into a no-op. With the attribute in place, every pad_integral call pays for an unnecessary call/ret + parameter setup to a function that, in the hot path, does nothing.

This PR removes the single #[inline(never)] line.

Bench results

./x bench library/coretests on aarch64-apple-darwin, median of 3 runs:

bench V0 (current) V1 (this PR) Δ
fmt::write_i32_hex 54.97 ns 38.31 ns −30.3 %
fmt::write_i64_oct 67.18 ns 48.45 ns −27.9 %
fmt::write_i16_hex 50.23 ns 37.39 ns −25.6 %
fmt::write_i64_bin 85.43 ns 65.77 ns −23.0 %
fmt::write_i8_bin 52.75 ns 43.13 ns −18.2 %
fmt::write_i64_42 9.15 ns 8.31 ns −9.2 %
fmt::write_f64_42 9.42 ns 8.57 ns −9.0 %
fmt::write_i64_million 12.45 ns 11.82 ns −5.1 %
fmt::write_u64_max 36.87 ns 37.30 ns +1.2 % (noise)
fmt::write_f64_pi 30.21 ns 30.25 ns +0.1 %

Wins are largest on radix benches (bin/hex/oct/exp) which call pad_integral four times per iter mixing positive/negative/alternate cases — every call saves the wasted helper invocation. No bench regressed beyond the noise band.

Size

libstd dylib on darwin:

Metric V0 V1 Δ
pad_integral symbol 940 B 1 116 B +176 B (helper inlined at 3 sites)
write_prefix symbol 124 B 0 B (inlined away) −124 B
Net functional code 1 064 B 1 116 B +52 B
libstd __TEXT section 835 584 B 835 584 B 0 (same page)
libstd dylib file 1 300 864 B 1 300 752 B −112 B

The dylib is actually slightly smaller — the symbol-table entry for the standalone write_prefix more than offsets the 52 bytes of duplicated body.

Why this is safe to land

write_prefix is a nested fn, not generic. There's no per-T monomorphization to multiply across downstream binaries — the savings are local to libstd and the per-call overhead removal is local to pad_integral. (We separately checked the per-type pattern by patching nightly's libstd and rebuilding rust-analyzer; for generic candidates like Arc::drop_slow, removing #[inline(never)] grows downstream binaries. This one doesn't have that risk.)

Correctness

  • 68 + 43 fmt:: tests pass on the patched libstd.
  • No source semantics change — only the inlining attribute.

What V1 does structurally

The disassembled pad_integral body in V1 absorbs the helper's three branches inline:

0x758d8  cbz w25, 0x758e0   ; is_nonneg=false ? -> handle sign
0x758dc  cbz w28, 0x758f0   ; sign_plus=false ? -> skip sign write
0x758e0  ldr x8, [x19,#0x20]; load write_char vtable
0x758e8  blr x8             ; call write_char (only when sign present)
0x758f0  cbz x22, 0x75968   ; prefix=None ? -> skip prefix write
…                            ; call write_str (only when prefix present)

For the common positive non-alternate Display (sign = None, prefix = None), control flow takes two cbz branches and never executes a single instruction from the (former) write_prefix body.

Test plan

  • ./x bench library/coretests shows the wins above
  • ./x test library/coretests fmt:: passes (68 + 43 tests)
  • CI green

`Formatter::pad_integral` calls a nested `write_prefix` helper at three
sites. The helper was a closure until Feb 2019, when commit ed2157a
("De-duplicate write_prefix lambda in pad_integral. For smaller code
size.") converted it to a `#[inline(never)]` nested fn to share the
duplicated body across the call sites.

Seven years later that rationale is inverted: modern LLVM happily inlines
the trivial body and constant-folds the common `(sign=None, prefix=None)`
case (positive non-alternate `Display`) into a no-op. The explicit
`#[inline(never)]` was forcing an unnecessary call/ret + parameter setup
on every `Formatter::pad_integral` invocation.

Measurements (aarch64-apple-darwin, stage-1 libstd, median of 3 runs):

  bench               V0 (current)  V1 (no attr)    delta
  fmt::write_i32_hex     54.97 ns      38.31 ns    -30.3 %
  fmt::write_i64_oct     67.18 ns      48.45 ns    -27.9 %
  fmt::write_i16_hex     50.23 ns      37.39 ns    -25.6 %
  fmt::write_i64_bin     85.43 ns      65.77 ns    -23.0 %
  fmt::write_i8_bin      52.75 ns      43.13 ns    -18.2 %
  fmt::write_i64_42       9.15 ns       8.31 ns     -9.2 %
  fmt::write_f64_42       9.42 ns       8.57 ns     -9.0 %
  fmt::write_i64_million 12.45 ns      11.82 ns     -5.1 %
  fmt::write_u64_max     36.87 ns      37.30 ns     +1.2 %  (noise)
  fmt::write_f64_pi      30.21 ns      30.25 ns     +0.1 %

Wins are largest on radix benches (bin/hex/oct/exp) which call
`pad_integral` four times per iteration mixing positive/negative/alternate
cases - every call saves the wasted helper invocation.

Size, libstd dylib on darwin:

  pad_integral             940 B  ->  1116 B   (+176 B inlined at 3 sites)
  write_prefix standalone  124 B  ->     0 B   (-124 B, inlined away)
  Net functional code     1064 B  ->  1116 B   (+52 B)
  libstd __TEXT         835 584 B  -> 835 584 B (0; same page)
  libstd dylib file   1 300 864 B  -> 1 300 752 B (-112 B, fewer symbols)

The dylib is slightly smaller with the attribute removed because the
symbol-table entry for the standalone write_prefix is gone, more than
offsetting the +52 B of duplicated body across the three call sites.

write_prefix is a nested fn, not generic. There is no per-T
monomorphization to multiply across downstream binaries - the savings
are local to libstd.

68 + 43 fmt:: tests pass on the patched libstd.
@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants