Disable inlining of custom io::Error destructor#149146
Conversation
|
I don't think preventing inlining of the whole function is a good idea. For cases where the compiler knows the error variant, inlining allows removing the entire destructor. I think it'd be better to prevent inlining the destructor of the custom variant, which is the only one that actually needs destruction. |
Those are highly unlikely to occur on hot paths in real programs: the construction point of the error would have to get inlined into the destruction point. It's most certainly not worth inlining the branches (which still worsen the code size, albeit not as much as inlining the destructor) into every |
|
We can at least Is only removing |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Disable inlining of packed `io::Error` destructor
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (07c4fb2): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 1.3%, secondary -2.7%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary 0.2%, secondary -2.5%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.3%, secondary -1.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 472.873s -> 472.231s (-0.14%) |
| // meanwhile, is a code size increase by a factor of up to 5.4 in the case | ||
| // of dropping multiple io::Results in the same function | ||
| // (https://godbolt.org/z/8hfGchjsT). | ||
| #[inline(never)] |
There was a problem hiding this comment.
Pondering: why is this the correct one to inline(never)? What if we made decode_repr be inline(never) and allowed inlining the destructor as that turns into just a call to decode_repr?
Said otherwise, if this drop is a problem, wouldn't data and data_mut and such also not want this inlining?
There was a problem hiding this comment.
decode_repr is a generic function in which C can be one of &Custom, &mut Custom, or Box<Custom>. In the case of the references, which are instantiated by data and data_mut, no types with destructors are involved and there are no destructors to shuffle around. In the Box<Custom> case, which is what's called by the destructor of Repr, decode_repr returns ownership of that which needs its destructor called and never calls any destructors itself, meaning that putting #[inline(never)] on it and removing it from Repr::drop would just make the destructor get inlined again (except the code bloat would be even worse because the branching would get duplicated).
There was a problem hiding this comment.
The ideal thing would be to ensure ErrorData's drop is outlined but only when it has a nontrivial drop, right? I was thinking this could be done with something like:
trait OutlineDrop { const SHOULD_OUTLINE: bool = false; }
impl<T> OutlineDrop for &T {}
impl<T> OutlineDrop for &mut T {}
impl OutlineDrop for Box<Custom> { const SHOULD_OUTLINE: bool = true; }
impl<C: MaybeOutlineDrop> Drop for ErrorData<C> {
#[inline(always)]
fn drop(&mut self) {
#[inline(never)]
fn drop_slow<C: MaybeOutlineDrop>(this: &mut ErrorData<C>) {
// Insert a variant with no drop needed, then drop the custom variant
// within this outlined function.
let mut dst = ErrorData::Os(0);
mem::swap(this, &mut dst);
drop(dst);
}
// Only call the outlined drop
if C::SHOULD_OUTLINE && matches!(self, ErrorData::Custom(_)) {
drop_slow(self);
}
}
}But from a quick check it seems like the matches! check is outlined, despite the #[inline(always)], which completely defeats the purpose.
The current optimization is kind of weird for this too. Looking at your godbolt link, it seems like it decides to inline the first drop and outline the second drop? Interesting heuristics.
There was a problem hiding this comment.
I do think it's worth double checking whether there is an option to get this behavior in more places the nontrivial drop may happen, as Scott mentioned. But if there doesn't seem to be a good way, I don't see any reason not to make this change given the perf results.
What is the result without any #[inline]/#[inline(never)] btw? Seems like that could give it some flexibility.
|
A note on the code size figures: the only benchmarks that regressed are rlibs (probably code that would otherwise be instantiated downstream being moved to the rlibs themselves, in actuality likely reducing the code size after linking), and the executables are wholly in the green. |
|
Would you be able to follow up with the questions at #149146 (comment)? @rustbot author |
|
Reminder, once the PR becomes ready for a review, use |
This comment has been minimized.
This comment has been minimized.
|
@rust-timer queue |
This comment has been minimized.
This comment has been minimized.
|
@tgross35 I don't actually have a proper setup for building and benchmarking standard library changes, so I'll need someone with appropriate permissions to invoke rust-timer to test how it performs with neither |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Disable inlining of packed `io::Error` destructor
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Disable inlining of custom `io::Error` destructor
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (e7abc97): comparison URL. Overall result: ❌✅ regressions and improvements - please read:Benchmarking means the PR may be perf-sensitive. It's automatically marked not fit for rolling up. Overriding is possible but disadvised: it risks changing compiler perf. Next, please: If you can, justify the regressions found in this try perf run in writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -0.7%, secondary -2.2%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary 2.6%, secondary -1.7%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary 0.0%, secondary -0.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 513.181s -> 510.613s (-0.50%) |
|
I've eyeballed the standard library from the try build and it seems like the culprit of the dismal results is that |
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
With me having finally set up |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Disable inlining of custom `io::Error` destructor
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (719220c): comparison URL. Overall result: ❌✅ regressions and improvements - please read:Benchmarking means the PR may be perf-sensitive. It's automatically marked not fit for rolling up. Overriding is possible but disadvised: it risks changing compiler perf. Next, please: If you can, justify the regressions found in this try perf run in writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -7.0%, secondary -1.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary -1.2%, secondary -7.6%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.3%, secondary -4.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 510.282s -> 511.986s (0.33%) |
|
I've made the amazing discovery that the most recent version of the compiler is actually doing the thing this PR originally proposed: Still, I can't reproduce the ripgrep size regression on my machine, and it's actually a huge improvement: Also worth mentioning: I've eyeballed some standard library disassembly and the With all that being said, I think the PR is good to land. |



View all comments
Inlining the destructor of
ErrorData(which currently gets inlined through the destructor of the packedRepr) is in no way helpful in real programs, as the source of the error will not be inlined, so there will not be any match assumptions to gain. The cost, meanwhile, is a code size increase by a factor of up to 5.4 in the case of dropping multipleio::Results in the same function. Accordingly, this disables the inlining to avoid unhelpful code bloat inopt-level = 3programs.The destructor of
ErrorDataon 32-bit platforms might be suffering from the same problem, but fixing that would require some sort of annotation that puts#[inline(never)]on the compiler-generated part of the destructor.