3

Compress amount of hashed bytes for `isize` values in StableHasher by Kobzol · P...

 2 years ago
source link: https://github.com/rust-lang/rust/pull/93432
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Copy link

Contributor

Kobzol commented 20 days ago

edited

This is another attempt to land #92103, this time hopefully with a correct implementation w.r.t. stable hashing guarantees. The previous PR was reverted because it could produce the same hash for different values even in quite simple situations. I have since added a basic test that should guard against that situation, I also added a new test in this PR, specialised for this optimization.

Why this optimization helps

Since the original PR, I have tried to analyze why this optimization even helps (and why it especially helps for clap). I found that the vast majority of stable-hashing i64 actually comes from hashing isize (which is converted to i64 in the stable hasher). I only found a single place where is this datatype used directly in the compiler, and this place has also been showing up in traces that I used to find out when is isize being hashed. This place is rustc_span::FileName::DocTest, however, I suppose that isizes also come from other places, but they might not be so easy to find (there were some other entries in the trace). clap hashes about 8.5 million isizes, and all of them fit into a single byte, which is why this optimization has helped it quite a lot.

Now, I'm not sure if special casing isize is the correct solution here, maybe something could be done with that isize inside DocTest or in other places, but that's for another discussion I suppose. In this PR, instead of hardcoding a special case inside SipHasher128, I instead put it into StableHasher, and only used it for isize (I tested that for i64 it doesn't help, or at least not for clap and other few benchmarks that I was testing).

New approach

Since the most common case is a single byte, I added a fast path for hashing isize values which positive value fits within a single byte, and a cold path for the rest of the values.

To avoid the previous correctness problem, we need to make sure that each unique isize value will produce a unique hash stream to the hasher. By hash stream I mean a sequence of bytes that will be hashed (a different sequence should produce a different hash, but that is of course not guaranteed).

We have to distinguish different values that produce the same bit pattern when we combine them. For example, if we just simply skipped the leading zero bytes for values that fit within a single byte, (0xFF, 0xFFFFFFFFFFFFFFFF) and (0xFFFFFFFFFFFFFFFF, 0xFF) would send the same hash stream to the hasher, which must not happen.

To avoid this situation, values [0, 0xFE] are hashed as a single byte. When we hash a larger (treating isize as u64) value, we first hash an additional byte 0xFF. Since 0xFF cannot occur when we apply the single byte optimization, we guarantee that the hash streams will be unique when hashing two values (a, b) and (b, a) if a != b:

  1. When both a and b are within [0, 0xFE], their hash streams will be different.
  2. When neither a and b are within [0, 0xFE], their hash streams will be different.
  3. When a is within [0, 0xFE] and b isn't, when we hash (a, b), the hash stream will definitely not begin with 0xFF. When we hash (b, a), the hash stream will definitely begin with 0xFF. Therefore the hash streams will be different.

r? @the8472


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK