fmt: drop #[inline(never)] from pad_integral's write_prefix helper#156822
Draft
gilescope wants to merge 1 commit into
Draft
fmt: drop #[inline(never)] from pad_integral's write_prefix helper#156822gilescope wants to merge 1 commit into
#[inline(never)] from pad_integral's write_prefix helper#156822gilescope wants to merge 1 commit into
Conversation
`Formatter::pad_integral` calls a nested `write_prefix` helper at three sites. The helper was a closure until Feb 2019, when commit ed2157a ("De-duplicate write_prefix lambda in pad_integral. For smaller code size.") converted it to a `#[inline(never)]` nested fn to share the duplicated body across the call sites. Seven years later that rationale is inverted: modern LLVM happily inlines the trivial body and constant-folds the common `(sign=None, prefix=None)` case (positive non-alternate `Display`) into a no-op. The explicit `#[inline(never)]` was forcing an unnecessary call/ret + parameter setup on every `Formatter::pad_integral` invocation. Measurements (aarch64-apple-darwin, stage-1 libstd, median of 3 runs): bench V0 (current) V1 (no attr) delta fmt::write_i32_hex 54.97 ns 38.31 ns -30.3 % fmt::write_i64_oct 67.18 ns 48.45 ns -27.9 % fmt::write_i16_hex 50.23 ns 37.39 ns -25.6 % fmt::write_i64_bin 85.43 ns 65.77 ns -23.0 % fmt::write_i8_bin 52.75 ns 43.13 ns -18.2 % fmt::write_i64_42 9.15 ns 8.31 ns -9.2 % fmt::write_f64_42 9.42 ns 8.57 ns -9.0 % fmt::write_i64_million 12.45 ns 11.82 ns -5.1 % fmt::write_u64_max 36.87 ns 37.30 ns +1.2 % (noise) fmt::write_f64_pi 30.21 ns 30.25 ns +0.1 % Wins are largest on radix benches (bin/hex/oct/exp) which call `pad_integral` four times per iteration mixing positive/negative/alternate cases - every call saves the wasted helper invocation. Size, libstd dylib on darwin: pad_integral 940 B -> 1116 B (+176 B inlined at 3 sites) write_prefix standalone 124 B -> 0 B (-124 B, inlined away) Net functional code 1064 B -> 1116 B (+52 B) libstd __TEXT 835 584 B -> 835 584 B (0; same page) libstd dylib file 1 300 864 B -> 1 300 752 B (-112 B, fewer symbols) The dylib is slightly smaller with the attribute removed because the symbol-table entry for the standalone write_prefix is gone, more than offsetting the +52 B of duplicated body across the three call sites. write_prefix is a nested fn, not generic. There is no per-T monomorphization to multiply across downstream binaries - the savings are local to libstd. 68 + 43 fmt:: tests pass on the patched libstd.
ajasad25
approved these changes
May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Formatter::pad_integralcalls a nestedwrite_prefixhelper at three sites. The helper was marked#[inline(never)]in commit ed2157a (Feb 2019) explicitly "for smaller code size", replacing an earlier closure that was duplicated across the four match arms.Seven years on, that rationale is inverted. Modern LLVM inlines the trivial body and constant-folds the common
(sign = None, prefix = None)case (positive non-alternateDisplay) into a no-op. With the attribute in place, everypad_integralcall pays for an unnecessarycall/ret+ parameter setup to a function that, in the hot path, does nothing.This PR removes the single
#[inline(never)]line.Bench results
./x bench library/coretestson aarch64-apple-darwin, median of 3 runs:fmt::write_i32_hexfmt::write_i64_octfmt::write_i16_hexfmt::write_i64_binfmt::write_i8_binfmt::write_i64_42fmt::write_f64_42fmt::write_i64_millionfmt::write_u64_maxfmt::write_f64_piWins are largest on radix benches (
bin/hex/oct/exp) which callpad_integralfour times per iter mixing positive/negative/alternate cases — every call saves the wasted helper invocation. No bench regressed beyond the noise band.Size
libstd dylib on darwin:
pad_integralsymbolwrite_prefixsymbol__TEXTsectionThe dylib is actually slightly smaller — the symbol-table entry for the standalone
write_prefixmore than offsets the 52 bytes of duplicated body.Why this is safe to land
write_prefixis a nested fn, not generic. There's no per-Tmonomorphization to multiply across downstream binaries — the savings are local tolibstdand the per-call overhead removal is local topad_integral. (We separately checked the per-type pattern by patching nightly's libstd and rebuilding rust-analyzer; for generic candidates likeArc::drop_slow, removing#[inline(never)]grows downstream binaries. This one doesn't have that risk.)Correctness
fmt::tests pass on the patched libstd.What V1 does structurally
The disassembled
pad_integralbody in V1 absorbs the helper's three branches inline:For the common positive non-alternate
Display(sign = None, prefix = None), control flow takes twocbzbranches and never executes a single instruction from the (former)write_prefixbody.Test plan
./x bench library/coretestsshows the wins above./x test library/coretests fmt::passes (68 + 43 tests)