Optimize `impl_read_unsigned_leb128` by nnethercote · Pull Request #92604 · rust...
source link: https://github.com/rust-lang/rust/pull/92604
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Conversation
I see instruction count improvements of up to 3.5% locally with these changes, mostly on the smaller benchmarks.
Awaiting bors try build completion.
@rustbot label: +S-waiting-on-perf
This comment has been hidden.
This comment has been hidden.
Awaiting bors try build completion.
@rustbot label: +S-waiting-on-perf
Try build successful - checks-actions
Build commit: 7775b1f (7775b1f299a2e55b29c22ee1c04e4b31a08e34f8
)
Finished benchmarking commit (7775b1f): comparison url.
Summary: This change led to moderate relevant mixed results in compiler performance.
- Moderate improvement in instruction counts (up to -2.2% on
incr-unchanged
builds ofucd
) - Moderate regression in instruction counts (up to 1.4% on
incr-unchanged
builds ofctfe-stress-4
)
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.
Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.
Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged
along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.
@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression
// The first iteration of the loop is unpeeled because this code is
// hot and integers less than 128 are very common, occurring 80%+
// of the time.
Is that true for all types? IIRC, that's very common for usize
but much less so for u64
, for example. Maybe it would make sense to have two implementations, one optimized for small values and another optimized for larger values?
Indeed it could be a good idea to generalize the impl for various "categories" of numbers (often small, often large), it helped recently in hashing: #92103.
Good point! I did some follow-up investigation, and saw that this accounts for the regressions seen in some cases. For example, the ctfe-stress-4 check incr-unchanged
regression seen on the CI run has this distribution of leb128 lengths for read u64
values:
2361790 counts:
( 1) 1311716 (55.5%, 55.5%): u64 1
( 2) 1049356 (44.4%,100.0%): u64 10
( 3) 574 ( 0.0%,100.0%): u64 9
( 4) 115 ( 0.0%,100.0%): u64 2
( 5) 20 ( 0.0%,100.0%): u64 4
( 6) 5 ( 0.0%,100.0%): u64 3
( 7) 3 ( 0.0%,100.0%): u64 8
( 8) 1 ( 0.0%,100.0%): u64 7
FWIW, I looked at reading/writing LEB128 recently. I tried two things:
- Removing bound checks (
let byte = slice[*position];
) from reading: didn't help - Removing the loop from writing and performing an explicit match on the number of bytes that should be written, like:
match value {
0..0x7F => leb_write(),
0x80... => leb_write(), leb_write()
}
etc. It helped for some benchmarks, but in the end it was mostly a regression, probably because the code was much larger. But I think that maybe if a good compromise on the number of branches was found, it could be an improvement, because now the loop is quite tight and the repeated increment can slow it down.
I have split out the second commit ("Modify the buffer position directly when reading leb128 values") into its own PR (#92631) because it's a straightforward win for all types. Once that is merged I'll return here and work through the more subtle changes that are type-dependent. Thanks for the feedback.
force-pushed the optimize-impl_read_unsigned_leb128
branch
from
a18ddee
to
facba24
13 days ago
I did some more investigation.
- Modifying the buffer position directly is a universal win, as #92631 shows.
- Unpeeling the first iteration of the loop is also a universal win, even for
u64
andu128
. (I tried removing it for justu64
andu128
, that made things slightly worse.) - Putting the loop into a separate function is a regression on a lot of case, so I have ditched that.
So I now have two commits in this PR, and if the CI run works well I think we should land them both here, and drop #92631. Sorry for any confusion!
@bors try @rust-timer queue
Commit facba24 has been approved by michaelwoerister
It seems like getting any benefit from that crate requires compiling rustc with at least the ssse3 target feature. Rustc is currently compiled to be suitable for all x86_64 cpu's.
Test timed out
@bors retry Test timed out
Test failed - checks-actions
@bors retry Some apple builder is having network issues. "unable to access 'https://github.com/rust-lang-ci/rust/': Could not resolve host: github.com"
Test successful - checks-actions
Approved by: michaelwoerister
Pushing 38c22af to master...
Finished benchmarking commit (38c22af): comparison url.
Summary: This change led to large relevant improvements in compiler performance.
- Large improvement in instruction counts (up to -3.6% on
full
builds ofhelloworld
)
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.
@rustbot label: -perf-regression
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Successfully merging this pull request may close these issues.
None yet
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK