Compress amount of hashed bytes for `isize` values in StableHasher by Kobzol · P...
source link: https://github.com/rust-lang/rust/pull/93432
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
This is another attempt to land #92103, this time hopefully with a correct implementation w.r.t. stable hashing guarantees. The previous PR was reverted because it could produce the same hash for different values even in quite simple situations. I have since added a basic test that should guard against that situation, I also added a new test in this PR, specialised for this optimization.
Why this optimization helps
Since the original PR, I have tried to analyze why this optimization even helps (and why it especially helps for clap
). I found that the vast majority of stable-hashing i64
actually comes from hashing isize
(which is converted to i64
in the stable hasher). I only found a single place where is this datatype used directly in the compiler, and this place has also been showing up in traces that I used to find out when is isize
being hashed. This place is rustc_span::FileName::DocTest
, however, I suppose that isizes also come from other places, but they might not be so easy to find (there were some other entries in the trace). clap
hashes about 8.5 million isize
s, and all of them fit into a single byte, which is why this optimization has helped it quite a lot.
Now, I'm not sure if special casing isize
is the correct solution here, maybe something could be done with that isize
inside DocTest
or in other places, but that's for another discussion I suppose. In this PR, instead of hardcoding a special case inside SipHasher128
, I instead put it into StableHasher
, and only used it for isize
(I tested that for i64
it doesn't help, or at least not for clap
and other few benchmarks that I was testing).
New approach
Since the most common case is a single byte, I added a fast path for hashing isize
values which positive value fits within a single byte, and a cold path for the rest of the values.
To avoid the previous correctness problem, we need to make sure that each unique isize
value will produce a unique hash stream to the hasher. By hash stream I mean a sequence of bytes that will be hashed (a different sequence should produce a different hash, but that is of course not guaranteed).
We have to distinguish different values that produce the same bit pattern when we combine them. For example, if we just simply skipped the leading zero bytes for values that fit within a single byte, (0xFF, 0xFFFFFFFFFFFFFFFF)
and (0xFFFFFFFFFFFFFFFF, 0xFF)
would send the same hash stream to the hasher, which must not happen.
To avoid this situation, values [0, 0xFE]
are hashed as a single byte. When we hash a larger (treating isize
as u64
) value, we first hash an additional byte 0xFF
. Since 0xFF
cannot occur when we apply the single byte optimization, we guarantee that the hash streams will be unique when hashing two values (a, b)
and (b, a)
if a != b
:
- When both
a
andb
are within[0, 0xFE]
, their hash streams will be different. - When neither
a
andb
are within[0, 0xFE]
, their hash streams will be different. - When
a
is within[0, 0xFE]
andb
isn't, when we hash(a, b)
, the hash stream will definitely not begin with0xFF
. When we hash(b, a)
, the hash stream will definitely begin with0xFF
. Therefore the hash streams will be different.
r? @the8472
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK