3

Do not hash leading zero bytes of i64 numbers in Sip128 hasher by Kobzol · Pull...

 2 years ago
source link: https://github.com/rust-lang/rust/pull/92103
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Copy link

Contributor

Kobzol commented 29 days ago

I was poking into the stable hasher, trying to improve its performance by compressing the number of hashed bytes. First I was experimenting with LEB128, but it was painful to implement because of the many assumptions that the SipHasher makes, so I tried something simpler - just ignoring leading zero bytes. For example, if an 8-byte integer can fit into a 4-byte integer, I will just hash the four bytes.

I wonder if this could produce any hashing ambiguity. Originally I thought so, but then I struggled to find any counter-example where this could cause different values to have the same hash. I'd be glad for any examples that could be broken by this (there are some ways of mitigating it if that would be the case). It could happen if you had e.g. 2x u8 vs 1x u16 hashed after one another in two different runs, but that can also happen now, without this "trick". And with collections, it should be fine, because the length is included in their hash.

I gathered some statistics for common values used in the clap benchmark. I observed that especially i64 often had very low values, so I started with that type, let's see what perf does on CI.

There are some tradeoffs that we can try:

  1. What types to use this optimization for? u64, u32, u16? Locally it was a slight loss for u64, I noticed that its values are often quite large.
  2. What byte sizes to check? E.g. we can only distinguish between u64/u32 or u64/u8 instead of u64/u32/u16/u8 to reduce branching (with i64 it seemed to be better to go all the way down to u8 locally though).

(The macro was introduced because I expect that I will be trying out this "trick" for different types).

Can you please schedule a perf. run? Thanks.

r? @the8472


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK