0

Optimize `impl_read_unsigned_leb128` by nnethercote · Pull Request #92604 · rust...

 3 months ago
source link: https://github.com/rust-lang/rust/pull/92604
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is damaged, please click the button below to view the snapshot at that time.

Conversation

Copy link

Contributor

nnethercote commented 14 days ago

I see instruction count improvements of up to 3.5% locally with these changes, mostly on the smaller benchmarks.

r? @michaelwoerister

Copy link

Contributor

Author

nnethercote commented 14 days ago

Copy link

Collaborator

rust-timer commented 14 days ago

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

Copy link

Contributor

bors commented 14 days ago

hourglass Trying commit ac6e14f with merge 2817d16...

This comment has been hidden.

This comment has been hidden.

Copy link

Collaborator

rust-timer commented 14 days ago

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

Copy link

Contributor

bors commented 14 days ago

hourglass Trying commit a18ddee with merge 7775b1f...

Copy link

Contributor

bors commented 14 days ago

sunny Try build successful - checks-actions
Build commit: 7775b1f (7775b1f299a2e55b29c22ee1c04e4b31a08e34f8)

Copy link

Collaborator

rust-timer commented 14 days ago

Copy link

Collaborator

rust-timer commented 14 days ago

Finished benchmarking commit (7775b1f): comparison url.

Summary: This change led to moderate relevant mixed results shrug in compiler performance.

  • Moderate improvement in instruction counts (up to -2.2% on incr-unchanged builds of ucd)
  • Moderate regression in instruction counts (up to 1.4% on incr-unchanged builds of ctfe-stress-4)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

// The first iteration of the loop is unpeeled because this code is

// hot and integers less than 128 are very common, occurring 80%+

// of the time.

Is that true for all types? IIRC, that's very common for usize but much less so for u64, for example. Maybe it would make sense to have two implementations, one optimized for small values and another optimized for larger values?

Copy link

Contributor

@Kobzol Kobzol 14 days ago

Indeed it could be a good idea to generalize the impl for various "categories" of numbers (often small, often large), it helped recently in hashing: #92103.

Copy link

Contributor

Author

@nnethercote nnethercote 14 days ago

Good point! I did some follow-up investigation, and saw that this accounts for the regressions seen in some cases. For example, the ctfe-stress-4 check incr-unchanged regression seen on the CI run has this distribution of leb128 lengths for read u64 values:

2361790 counts:
(  1)  1311716 (55.5%, 55.5%): u64 1
(  2)  1049356 (44.4%,100.0%): u64 10
(  3)      574 ( 0.0%,100.0%): u64 9
(  4)      115 ( 0.0%,100.0%): u64 2
(  5)       20 ( 0.0%,100.0%): u64 4
(  6)        5 ( 0.0%,100.0%): u64 3
(  7)        3 ( 0.0%,100.0%): u64 8
(  8)        1 ( 0.0%,100.0%): u64 7

Copy link

Contributor

Kobzol commented 14 days ago

FWIW, I looked at reading/writing LEB128 recently. I tried two things:

  • Removing bound checks (let byte = slice[*position];) from reading: didn't help
  • Removing the loop from writing and performing an explicit match on the number of bytes that should be written, like:
match value {
    0..0x7F => leb_write(),
    0x80... => leb_write(), leb_write()
}

etc. It helped for some benchmarks, but in the end it was mostly a regression, probably because the code was much larger. But I think that maybe if a good compromise on the number of branches was found, it could be an improvement, because now the loop is quite tight and the repeated increment can slow it down.

Copy link

Contributor

Author

nnethercote commented 14 days ago

I have split out the second commit ("Modify the buffer position directly when reading leb128 values") into its own PR (#92631) because it's a straightforward win for all types. Once that is merged I'll return here and work through the more subtle changes that are type-dependent. Thanks for the feedback.

nnethercote

force-pushed the optimize-impl_read_unsigned_leb128

branch from a18ddee to facba24 13 days ago

Copy link

Contributor

Author

nnethercote commented 13 days ago

edited

I did some more investigation.

  • Modifying the buffer position directly is a universal win, as #92631 shows.
  • Unpeeling the first iteration of the loop is also a universal win, even for u64 and u128. (I tried removing it for just u64 and u128, that made things slightly worse.)
  • Putting the loop into a separate function is a regression on a lot of case, so I have ditched that.

So I now have two commits in this PR, and if the CI run works well I think we should land them both here, and drop #92631. Sorry for any confusion!

@bors try @rust-timer queue

Copy link

Contributor

bors commented 13 days ago

pushpin Commit facba24 has been approved by michaelwoerister

Copy link

Contributor

bjorn3 commented 12 days ago

It seems like getting any benefit from that crate requires compiling rustc with at least the ssse3 target feature. Rustc is currently compiled to be suitable for all x86_64 cpu's.

Copy link

Contributor

bors commented 7 days ago

hourglass Testing commit facba24 with merge e7fc933...

Copy link

Contributor

bors commented 7 days ago

boom Test timed out

Copy link

Member

lqd commented 7 days ago

@bors retry Test timed out

Copy link

Collaborator

rust-log-analyzer commented 7 days ago

A job failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

Copy link

Contributor

bors commented 6 days ago

hourglass Testing commit facba24 with merge 1078f1d...

Copy link

Collaborator

rust-log-analyzer commented 6 days ago

A job failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

Copy link

Contributor

bors commented 6 days ago

broken_heart Test failed - checks-actions

Copy link

Member

lqd commented 6 days ago

@bors retry Some apple builder is having network issues. "unable to access 'https://github.com/rust-lang-ci/rust/': Could not resolve host: github.com"

Copy link

Contributor

bors commented 5 days ago

hourglass Testing commit facba24 with merge 38c22af...

Copy link

Contributor

bors commented 5 days ago

sunny Test successful - checks-actions
Approved by: michaelwoerister
Pushing 38c22af to master...

bors

merged commit 38c22af into

rust-lang:master 5 days ago

11 checks passed

rustbot

added this to the 1.60.0 milestone

5 days ago

Copy link

Collaborator

rust-timer commented 5 days ago

Finished benchmarking commit (38c22af): comparison url.

Summary: This change led to large relevant improvements tada in compiler performance.

  • Large improvement in instruction counts (up to -3.6% on full builds of helloworld)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

nnethercote

deleted the optimize-impl_read_unsigned_leb128 branch

5 days ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Milestone

1.60.0

Linked issues

Successfully merging this pull request may close these issues.

None yet

11 participants

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK