3

manually implement `Hash` for `DefId` by llogiq · Pull Request #91660 · rust-lan...

 2 years ago
source link: https://github.com/rust-lang/rust/pull/91660
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Copy link

Contributor

llogiq commented on Dec 8, 2021

edited

This might speed up hashing for hashers that can work on individual u64s. Just as an experiment, suggested in a reddit thread on FxHasher. cc @nnethercote

Note that this should not be merged as is without cfg-ing the code path for 64 bits.

Copy link

Collaborator

rust-highfive commented on Dec 8, 2021

r? @jackh726

(rust-highfive has picked a reviewer for you, use r? to override)

Copy link

Collaborator

rust-timer commented on Dec 8, 2021

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

Copy link

Contributor

bors commented on Dec 8, 2021

hourglass Trying commit ad29818 with merge abe401b...

Copy link

Contributor

bors commented on Dec 8, 2021

sunny Try build successful - checks-actions
Build commit: abe401b (abe401b21ff392a1e7153686e94b22d7949b9a0c)

Copy link

Collaborator

rust-timer commented on Dec 8, 2021

Copy link

Collaborator

rust-timer commented on Dec 8, 2021

Finished benchmarking commit (abe401b): comparison url.

Summary: This change led to moderate relevant improvements tada in compiler performance.

  • Moderate improvement in instruction counts (up to -0.4% on full builds of externs)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

Copy link

Contributor

Author

llogiq commented on Dec 8, 2021

Those results appear to be a mixed bag. I suppose this might be because the CrateNum gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.

// hash one u64 instead of two u32s

impl Hash for DefId {

fn hash<H: Hasher>(&self, h: &mut H) {

((u64::from(u32::from(self.krate)) << 32) | u64::from(u32::from(self.index))).hash(h)

Suggested change
((u64::from(u32::from(self.krate)) << 32) | u64::from(u32::from(self.index))).hash(h) ((u64::from(u32::from(self.index)) << 32) | u64::from(u32::from(self.krate))).hash(h)

It may be faster this way: https://rust.godbolt.org/z/EMeebaGb3

Copy link

Contributor

Author

@llogiq llogiq on Dec 8, 2021

Yes but that would exacerbate the problem of def index bit distribution further (at least my theory is that we usually have more defs in a crate than we have crates within one build, but I may be wrong).

I'd rather reorder the fields in the type.

alexdotgov reacted with thumbs up emoji

This is a little-endian only improvement, right?

Yeah (probably 64-bit LE only to boot) -- if you throw a --target on the godbolt for a BE arch you can see what it looks like.

In practice the most popular architectures these days are LE (x86-64 and aarch64), but I guess if one were really motivated they could use #[cfg] to keep it optimal on both.

Thanks! I always find endianness tricky to think about, just checking my reasoning :)

Copy link

Contributor

nnethercote commented on Dec 8, 2021

Those results appear to be a mixed bag.

Are they? The instruction count changes of significance are all small improvements.

I suppose this might be because the CrateNum gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.

I don't understand this. Can you elaborate?

Copy link

Contributor

Author

llogiq commented on Dec 8, 2021

I guess I was reading too much into the results. After looking into the hash function again, I think it's probably noise. I'd like to see if reordering the fields brings more benefit. What's your take? Should we try to optimize the Hash derive for 64 bit hashers?

This comment has been hidden.

Copy link

Contributor

nnethercote commented on Dec 8, 2021

I will experiment today with different variations. I have local benchmarking set up which is much faster than CI for doing multiple perf runs.

Copy link

Contributor

nnethercote commented on Dec 8, 2021

Ok, some very surprising results.

Like the original in this PR (with slightly different syntax for the integer conversions):

    fn hash<H: Hasher>(&self, h: &mut H) {
        (((self.krate.as_u32() as u64) << 32) | (self.index.as_u32() as u64)).hash(h)
    }

Gave some tiny instruction count wins (I only measured Full builds):

Benchmark & Profile Scenario % Change Significance Factor ?

externs debug full -0.60% 3.00x

externs opt full -0.60% 3.00x

deep-vector debug full -0.55% 2.77x

inflate debug full -0.35% 1.76x

Alex's suggestion of swapping the order:

    fn hash<H: Hasher>(&self, h: &mut H) {
        (((self.index.as_u32() as u64) << 32) | (self.krate.as_u32() as u64)).hash(h)
    }

Gave gigantic slowdowns. Here's the first part of the results:

Benchmark & Profile Scenario % Change Significance Factor ?

stm32f4 check full 1350.47% 6752.37x

stm32f4 debug full 1264.39% 6321.93x

stm32f4 opt full 1242.46% 6212.28x

unused-warnings check full 416.26% 2081.28x

unused-warnings debug full 385.00% 1925.00x

unused-warnings opt full 382.23% 1911.17x

externs debug full 336.14% 1680.69x

externs opt full 335.54% 1677.72x

many-assoc-items debug full 161.70% 808.52x

many-assoc-items check full 161.44% 807.21x

many-assoc-items opt full 161.37% 806.83x

externs check full 94.20% 470.98x

I did a Cachegrind diff and confirmed that it was due to huge numbers of collisions. E.g. this hashbrown function:

         .               fn move_next(&mut self, bucket_mask: usize) {
         .                   // We should have found an empty bucket by now and ended the probe.
         .                   debug_assert!(
         .                       self.stride <= bucket_mask,
         .                       "Went past end of probe sequence"
         .                   );
         .               
   386,363 ( 0.02%)          self.stride += Group::WIDTH;
         6 ( 0.00%)          self.pos += self.stride;
 1,571,177 ( 0.09%)          self.pos &= bucket_mask;
         .               }

changed to this:

            .               fn move_next(&mut self, bucket_mask: usize) {
            .                   // We should have found an empty bucket by now and ended the probe.
            .                   debug_assert!(
            .                       self.stride <= bucket_mask,
            .                       "Went past end of probe sequence"
            .                   ); 
            .               
  337,778,897 ( 3.86%)          self.stride += Group::WIDTH;
            6 ( 0.00%)          self.pos += self.stride;
1,476,729,111 (16.86%)          self.pos &= bucket_mask;
            .               }   
            .           }

Wow. FxHasher does really poorly when most of the entropy is in the higher bits.

Like v1, but with the fields in DefId swapped. Gave slightly better results than v1.

Benchmark & Profile Scenario % Change Significance Factor ?

match-stress-enum check full -2.14% 10.72x

match-stress-enum debug full -2.04% 10.21x

match-stress-enum opt full -2.04% 10.19x

externs opt full -1.22% 6.09x

externs debug full -1.22% 6.08x

deep-vector debug full -0.56% 2.80x

webrender-wrench debug full -0.35% 1.75x

But I get the same test failures seen on CI, which I don't understand. I get those same failures even if I write hash like this:

    fn hash<H: Hasher>(&self, h: &mut H) {
        self.krate.hash(h);
        self.index.hash(h);
    }
}

So now I'm really confused.

Copy link

Contributor

Author

llogiq commented on Dec 8, 2021

Perhaps those tests rely on the field order of DefId? I'm not sure how (I haven't had the time to look into them, but given your suggestion, I'll do that next.

Copy link

Contributor

Author

llogiq commented on Dec 8, 2021

edited

Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.

Copy link

Contributor

nnethercote commented on Dec 9, 2021

Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.

By "large shift" do you mean the left shift by 32? It's the obvious way to combine two 32 bit values into a 64 bit value. If you're talking about doing something more complex like xor or rotate then that's just a kind of pre-hash before the real hash, which doesn't seem worthwhile; might as well just hash the two fields in the normal way.

I did try another thing: changing FxHasher to be more like the fallback hasher from aHash.

pub(crate) const MULTIPLE: u64 = 6364136223846793005;

#[inline(always)]
pub(crate) const fn folded_multiply(s: u64, by: u64) -> u64 {
    let result = (s as u128).wrapping_mul(by as u128);
    ((result & 0xffff_ffff_ffff_ffff) as u64) ^ ((result >> 64) as u64)
}

impl FxHasher {
    #[inline]
    fn add_to_hash(&mut self, i: u64) {
        self.hash = folded_multiply(i ^ self.hash, MULTIPLE);
    }
}

The folded multiply is clever, it folds the overflow bytes from the multiply back into the result. This avoids the FxHasher weakness with higher bits and compiles down to the same number of instructions as FxHasher: https://rust.godbolt.org/z/17xqa5e5e

It doesn't go bad when the DefIf fields are hashed in the different order, but overall it's a tiny bit more instructions than FxHasher. Not sure why, maybe it inlines a little differently.

Copy link

Contributor

nnethercote commented on Dec 9, 2021

Thinking some more: within the compiler FxHasher is mostly hashing integers and pointers, and some strings.

  • The integers are all or almost all 32-bits or less, so the weakness in bits 32-63 doesn't matter.
  • As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.
  • As for strings (e.g. symbols), they are processed in 64-bit chunks, and so the upper bits weakness will have some effect there.

Copy link

Contributor

alex commented on Dec 9, 2021

As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.

Except the bottom 2-3 bits, which are always 0. (I'm not sure if this meaningfully impacts your point.)

(Sorry for the confusion of commenting from two different accounts!)

Copy link

Contributor

Author

llogiq commented on Dec 9, 2021

The UI tests appear to check trait coherence and ambiguity. Probably a good idea to dig into that code, as it may be less robust than we'd like.

Copy link

Contributor

Author

llogiq commented on Dec 9, 2021

Having had a deeper look at the failing tests, the difference appears to hinge on the iteration order of the trait lookup, which is a HashMap, so it's kind of expected that order would change when the hashing function does. I think updating the tests here should be fine.

Copy link

Contributor

Author

llogiq commented on Dec 9, 2021

Ok, so I guess we want to avoid doing such optimizations generically. The risk of getting an avalanche of collisions (pardon the pun) is very real.

For this PR, before merging I think I should #[cfg]-out the change for all but 64 bits platforms. I also wonder if it would be worth to make the order of the DefId fields platform-endian-dependent, but then we'd get different UI test output on those platforms. Given that the difference isn't that large, I don't think it should block this PR.

Copy link

Contributor

Author

llogiq commented on Dec 9, 2021

This should be ready now. In practice it boils down to @nnethercote 's v3 (which gave slightly better results in his benchmarks) on 64 bit systems and keeping the derived impls on all other pointer widths. Do we need another perf run or can we merge this @jackh726 ?

Copy link

Collaborator

rust-timer commented on Dec 9, 2021

Copy link

Member

camelid commented on Dec 9, 2021

Also, I find this form more readable, do others?

Extracting krate = self.krate.as_u32() as u64 and index = self.index.as_u32() as u64 variables might improve it even further.

Copy link

Contributor

Author

llogiq commented on Dec 9, 2021

@camelid with the comments I added, this should hopefully now be sufficiently clear.

Copy link

Member

camelid commented on Dec 9, 2021

I'd suggest making the comments into doc-comments so rustdoc picks them up. Also, what's the disadvantage of extracting local variables? I'm not sure why adding comments helps with that.

Copy link

Contributor

Author

llogiq commented on Dec 9, 2021

I even made some ascii art to show where each bit ends up. So what's unclear? smile

My reasoning for not making those doc comments is that they are for folks working on exactly that file whereas docs are for people using the type. If I just use a DefId, I don't need to care how it's going to be hashed. On the other hand, if I want to change DefId, I read the source anyway, which will include the comments.

Copy link

Collaborator

rust-timer commented on Dec 9, 2021

Finished benchmarking commit (03e04a6): comparison url.

Summary: This change led to large relevant improvements tada in compiler performance.

  • Large improvement in instruction counts (up to -2.3% on full builds of match-stress-enum)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

Copy link

Member

camelid commented on Dec 9, 2021

I even made some ascii art to show where each bit ends up. So what's unclear? smile

Oh, I wasn't saying it's unclear, just that the code would be a little easier to read if local variables were extracted. Your comment is definitely helpful :)

Copy link

Contributor

nnethercote commented on Dec 10, 2021

@bors r+

Copy link

Contributor

bors commented on Dec 10, 2021

pushpin Commit 635533b has been approved by nnethercote

Copy link

Contributor

klensy commented on Dec 10, 2021

Is this actually a good idea to change order of fields and relying on current behavior? As for default repr(rust): There are no guarantees of data layout made by this representation., so relying on field order is UB?

Copy link

Contributor

Author

llogiq commented on Dec 10, 2021

We only rely on field order for performance, not correctness. Nothing to worry here.

Copy link

Contributor

bors commented on Dec 14, 2021

hourglass Testing commit 635533b with merge a2d25b4...

Copy link

Contributor

bors commented on Dec 14, 2021

sunny Test successful - checks-actions
Approved by: nnethercote
Pushing a2d25b4 to master...

bors

merged commit a2d25b4 into

rust-lang:master on Dec 14, 2021

11 checks passed

llogiq

deleted the make-a-hash-of-def-ids branch

last month

Copy link

Collaborator

rust-timer commented on Dec 14, 2021

Finished benchmarking commit (a2d25b4): comparison url.

Summary: This change led to large relevant improvements tada in compiler performance.

  • Large improvement in instruction counts (up to -2.3% on full builds of match-stress-enum)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

We only rely on field order for performance, not correctness. Nothing to worry here.

Field order could be guaranteed by #[repr(C)], right? Would that have any downsides?

Copy link

Contributor

Author

llogiq commented on Dec 16, 2021

Probably not. Although at that point we may also want to #[cfg] the field order for 64-bit big-Endian systems.

I'll create a follow-up PR.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK