manually implement `Hash` for `DefId` by llogiq · Pull Request #91660 · rust-lan...
source link: https://github.com/rust-lang/rust/pull/91660
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
This might speed up hashing for hashers that can work on individual u64s. Just as an experiment, suggested in a reddit thread on FxHasher
. cc @nnethercote
Note that this should not be merged as is without cfg-ing the code path for 64 bits.
r? @jackh726
(rust-highfive has picked a reviewer for you, use r? to override)
Awaiting bors try build completion.
@rustbot label: +S-waiting-on-perf
Try build successful - checks-actions
Build commit: abe401b (abe401b21ff392a1e7153686e94b22d7949b9a0c
)
Finished benchmarking commit (abe401b): comparison url.
Summary: This change led to moderate relevant improvements in compiler performance.
- Moderate improvement in instruction counts (up to -0.4% on
full
builds ofexterns
)
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.
Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.
@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression
Those results appear to be a mixed bag. I suppose this might be because the CrateNum
gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.
compiler/rustc_span/src/def_id.rs
Outdated
// hash one u64 instead of two u32s
impl Hash for DefId {
fn hash<H: Hasher>(&self, h: &mut H) {
((u64::from(u32::from(self.krate)) << 32) | u64::from(u32::from(self.index))).hash(h)
It may be faster this way: https://rust.godbolt.org/z/EMeebaGb3
Yes but that would exacerbate the problem of def index bit distribution further (at least my theory is that we usually have more defs in a crate than we have crates within one build, but I may be wrong).
I'd rather reorder the fields in the type.
This is a little-endian only improvement, right?
Yeah (probably 64-bit LE only to boot) -- if you throw a --target
on the godbolt for a BE arch you can see what it looks like.
In practice the most popular architectures these days are LE (x86-64 and aarch64), but I guess if one were really motivated they could use #[cfg]
to keep it optimal on both.
Thanks! I always find endianness tricky to think about, just checking my reasoning :)
Those results appear to be a mixed bag.
Are they? The instruction count changes of significance are all small improvements.
I suppose this might be because the
CrateNum
gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.
I don't understand this. Can you elaborate?
I guess I was reading too much into the results. After looking into the hash function again, I think it's probably noise. I'd like to see if reordering the fields brings more benefit. What's your take? Should we try to optimize the Hash
derive for 64 bit hashers?
This comment has been hidden.
I will experiment today with different variations. I have local benchmarking set up which is much faster than CI for doing multiple perf runs.
Ok, some very surprising results.
Like the original in this PR (with slightly different syntax for the integer conversions):
fn hash<H: Hasher>(&self, h: &mut H) {
(((self.krate.as_u32() as u64) << 32) | (self.index.as_u32() as u64)).hash(h)
}
Gave some tiny instruction count wins (I only measured Full
builds):
Benchmark & Profile Scenario % Change Significance Factor ?
externs debug full -0.60% 3.00x
externs opt full -0.60% 3.00x
deep-vector debug full -0.55% 2.77x
inflate debug full -0.35% 1.76x
Alex's suggestion of swapping the order:
fn hash<H: Hasher>(&self, h: &mut H) {
(((self.index.as_u32() as u64) << 32) | (self.krate.as_u32() as u64)).hash(h)
}
Gave gigantic slowdowns. Here's the first part of the results:
Benchmark & Profile Scenario % Change Significance Factor ?
stm32f4 check full 1350.47% 6752.37x
stm32f4 debug full 1264.39% 6321.93x
stm32f4 opt full 1242.46% 6212.28x
unused-warnings check full 416.26% 2081.28x
unused-warnings debug full 385.00% 1925.00x
unused-warnings opt full 382.23% 1911.17x
externs debug full 336.14% 1680.69x
externs opt full 335.54% 1677.72x
many-assoc-items debug full 161.70% 808.52x
many-assoc-items check full 161.44% 807.21x
many-assoc-items opt full 161.37% 806.83x
externs check full 94.20% 470.98x
I did a Cachegrind diff and confirmed that it was due to huge numbers of collisions. E.g. this hashbrown
function:
. fn move_next(&mut self, bucket_mask: usize) {
. // We should have found an empty bucket by now and ended the probe.
. debug_assert!(
. self.stride <= bucket_mask,
. "Went past end of probe sequence"
. );
.
386,363 ( 0.02%) self.stride += Group::WIDTH;
6 ( 0.00%) self.pos += self.stride;
1,571,177 ( 0.09%) self.pos &= bucket_mask;
. }
changed to this:
. fn move_next(&mut self, bucket_mask: usize) {
. // We should have found an empty bucket by now and ended the probe.
. debug_assert!(
. self.stride <= bucket_mask,
. "Went past end of probe sequence"
. );
.
337,778,897 ( 3.86%) self.stride += Group::WIDTH;
6 ( 0.00%) self.pos += self.stride;
1,476,729,111 (16.86%) self.pos &= bucket_mask;
. }
. }
Wow. FxHasher
does really poorly when most of the entropy is in the higher bits.
Like v1, but with the fields in DefId
swapped. Gave slightly better results than v1.
Benchmark & Profile Scenario % Change Significance Factor ?
match-stress-enum check full -2.14% 10.72x
match-stress-enum debug full -2.04% 10.21x
match-stress-enum opt full -2.04% 10.19x
externs opt full -1.22% 6.09x
externs debug full -1.22% 6.08x
deep-vector debug full -0.56% 2.80x
webrender-wrench debug full -0.35% 1.75x
But I get the same test failures seen on CI, which I don't understand. I get those same failures even if I write hash
like this:
fn hash<H: Hasher>(&self, h: &mut H) {
self.krate.hash(h);
self.index.hash(h);
}
}
So now I'm really confused.
Perhaps those tests rely on the field order of DefId? I'm not sure how (I haven't had the time to look into them, but given your suggestion, I'll do that next.
Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.
Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.
By "large shift" do you mean the left shift by 32? It's the obvious way to combine two 32 bit values into a 64 bit value. If you're talking about doing something more complex like xor or rotate then that's just a kind of pre-hash before the real hash, which doesn't seem worthwhile; might as well just hash the two fields in the normal way.
I did try another thing: changing FxHasher
to be more like the fallback hasher from aHash
.
pub(crate) const MULTIPLE: u64 = 6364136223846793005;
#[inline(always)]
pub(crate) const fn folded_multiply(s: u64, by: u64) -> u64 {
let result = (s as u128).wrapping_mul(by as u128);
((result & 0xffff_ffff_ffff_ffff) as u64) ^ ((result >> 64) as u64)
}
impl FxHasher {
#[inline]
fn add_to_hash(&mut self, i: u64) {
self.hash = folded_multiply(i ^ self.hash, MULTIPLE);
}
}
The folded multiply is clever, it folds the overflow bytes from the multiply back into the result. This avoids the FxHasher
weakness with higher bits and compiles down to the same number of instructions as FxHasher
: https://rust.godbolt.org/z/17xqa5e5e
It doesn't go bad when the DefIf
fields are hashed in the different order, but overall it's a tiny bit more instructions than FxHasher
. Not sure why, maybe it inlines a little differently.
Thinking some more: within the compiler FxHasher
is mostly hashing integers and pointers, and some strings.
- The integers are all or almost all 32-bits or less, so the weakness in bits 32-63 doesn't matter.
- As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.
- As for strings (e.g. symbols), they are processed in 64-bit chunks, and so the upper bits weakness will have some effect there.
As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.
Except the bottom 2-3 bits, which are always 0. (I'm not sure if this meaningfully impacts your point.)
(Sorry for the confusion of commenting from two different accounts!)
The UI tests appear to check trait coherence and ambiguity. Probably a good idea to dig into that code, as it may be less robust than we'd like.
Having had a deeper look at the failing tests, the difference appears to hinge on the iteration order of the trait lookup, which is a HashMap, so it's kind of expected that order would change when the hashing function does. I think updating the tests here should be fine.
Ok, so I guess we want to avoid doing such optimizations generically. The risk of getting an avalanche of collisions (pardon the pun) is very real.
For this PR, before merging I think I should #[cfg]
-out the change for all but 64 bits platforms. I also wonder if it would be worth to make the order of the DefId
fields platform-endian-dependent, but then we'd get different UI test output on those platforms. Given that the difference isn't that large, I don't think it should block this PR.
This should be ready now. In practice it boils down to @nnethercote 's v3 (which gave slightly better results in his benchmarks) on 64 bit systems and keeping the derived impls on all other pointer widths. Do we need another perf run or can we merge this @jackh726 ?
Also, I find this form more readable, do others?
Extracting krate = self.krate.as_u32() as u64
and index = self.index.as_u32() as u64
variables might improve it even further.
@camelid with the comments I added, this should hopefully now be sufficiently clear.
I'd suggest making the comments into doc-comments so rustdoc picks them up. Also, what's the disadvantage of extracting local variables? I'm not sure why adding comments helps with that.
I even made some ascii art to show where each bit ends up. So what's unclear?
My reasoning for not making those doc comments is that they are for folks working on exactly that file whereas docs are for people using the type. If I just use a DefId
, I don't need to care how it's going to be hashed. On the other hand, if I want to change DefId
, I read the source anyway, which will include the comments.
Finished benchmarking commit (03e04a6): comparison url.
Summary: This change led to large relevant improvements in compiler performance.
- Large improvement in instruction counts (up to -2.3% on
full
builds ofmatch-stress-enum
)
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.
Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.
@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression
I even made some ascii art to show where each bit ends up. So what's unclear?
Oh, I wasn't saying it's unclear, just that the code would be a little easier to read if local variables were extracted. Your comment is definitely helpful :)
@bors r+
Commit 635533b has been approved by nnethercote
Is this actually a good idea to change order of fields and relying on current behavior? As for default repr(rust)
: There are no guarantees of data layout made by this representation.
, so relying on field order is UB?
We only rely on field order for performance, not correctness. Nothing to worry here.
Test successful - checks-actions
Approved by: nnethercote
Pushing a2d25b4 to master...
Finished benchmarking commit (a2d25b4): comparison url.
Summary: This change led to large relevant improvements in compiler performance.
- Large improvement in instruction counts (up to -2.3% on
full
builds ofmatch-stress-enum
)
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.
@rustbot label: -perf-regression
We only rely on field order for performance, not correctness. Nothing to worry here.
Field order could be guaranteed by #[repr(C)]
, right? Would that have any downsides?
Probably not. Although at that point we may also want to #[cfg]
the field order for 64-bit big-Endian systems.
I'll create a follow-up PR.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK