String.hashCode() is plenty unique

I’ve been running across this article all over Reddit for the past couple of days.

It’s driving me crazy.

The article (rightfully) points out that Java’s humble String.hashCode() method — which maps arbitrary-length String objects to 32-bit int values — has collisions. The article also (wrongfully) talks about this basic result from information theory as a revelation, and claims that the String.hashCode() algorithm is bad on that basis. In the author’s own words:

No matter what hashing strategy is used, collisions are enevitable however some hashes are worse than others. You can expect String to be fairly poor.

Besides being misspelled, poorly punctuated, and grammatically incorrect, the claim is also wrong. Yes, there are many odd, short String values that generate String.hashCode() collisions. (The article “helpfully” gives many examples, such as !~ and "_ .) However, String.hashCode() manages collisions quite well for more realistic input.

To prove the point, I banged this out:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;

import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

import java.nio.charset.StandardCharsets;

public class HashTest {
    private static  Map<Integer,Set> collisions(Collection values) {
        Map<Integer,Set> result=new HashMap<>();
        for(T value : values) {
            Integer hc=Integer.valueOf(value.hashCode());
    
            Set bucket=result.get(hc);
            if(bucket == null)
                result.put(hc, bucket = new TreeSet<>());
    
            bucket.add(value);
        }
        return result;
    }

    public static void main(String[] args) throws IOException {
        System.err.println("Loading lines from stdin...");
        Set lines=new HashSet<>();
        try (BufferedReader r=new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8))) {
            for(String line=r.readLine();line!=null;line=r.readLine())
                lines.add(line);
        }

        // Warm up, if you please
        System.err.print("Warming up");
        for(int i=0;i<10;i++) {
            System.err.print(".");
            collisions(lines);
        }
        System.err.println();

        System.err.println("Computing collisions...");
        long start=System.nanoTime();
        Map<Integer,Set> collisions=collisions(lines);
        long finish=System.nanoTime();
        long elapsed=finish-start;

        int maxhc=0, maxsize=0;
        for(Map.Entry<Integer,Set> e : collisions.entrySet()) {
            Integer hc=e.getKey();
            Set bucket=e.getValue();
            if(bucket.size() > maxsize) {
                maxhc = hc.intValue();
                maxsize = bucket.size();
            }
        }

        System.out.println("Elapsed time: "+elapsed+"ns");
        System.out.println("Total unique lines: "+lines.size());
        System.out.println("Time per hashcode: "+String.format("%.4f", 1.0*elapsed/lines.size())+"ns");
        System.out.println("Total unique hashcodes: "+collisions.size());
        System.out.println("Total collisions: "+(lines.size()-collisions.size()));
        System.out.println("Collision rate: "+String.format("%.4f", 1.0*(lines.size()-collisions.size())/lines.size()));
        if(maxsize != 0)
            System.out.println("Max collisions: "+maxsize+" "+collisions.get(maxhc));
    }
}

Withwords.txt as input, it produces the following output:

$ cat words.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 49117411ns
Total unique lines: 466544
Time per hashcode: 105.2793ns
Total unique hashcodes: 466188
Total collisions: 356
Collision rate: 0.0008
Max collisions: 3 [Jr, KS, L4]

On English words, String.hashCode() ‘s collision rate is 0.0008 (that is, 8 collisions per 10,000).

That seems OK to me.

But why stop there? If we throw the complete works of Shakespeare through, treating each unique line as a unit of input, we get this output:

$ cat shakespeare.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 24106163ns
Total unique lines: 111385
Time per hashcode: 216.4220ns
Total unique hashcodes: 111384
Total collisions: 1
Collision rate: 0.0000
Max collisions: 2 [    There's half a dozen sweets.,   PISANIO. He hath been search'd among the dead and living,]

On lines from Shakespeare, String.hashCode() ‘s collision rate is <0.00001 (that is, less than 1 per 100,000).

So no, String.hashCode() isn’t unique. But it isn’t supposed to be. And it’s plenty unique for its intended purpose, which is splitting String values out across a hash table.

As an aside, claims and “outcomes” like this are why you should never trust any figures that aren’t based on Real World Data. They prove nothing at best, and are actually misleading at worst.

Recommend

很少见的网络故障的快速恢复

威胁预测的“新”定义

What's wrong with popular articles telling you that foo is faster than bar? - Tr...

Machine Learning Can Identify the Authors of Anonymous Code

What is Data Cleansing?

VuePress: a static site generator that excels at working with documentation

Vue CLI 3.0 is here!

1/0 = 0

GitHub - pstadler/dotfiles: For my very own convenience.

为什么团队合作类竞技游戏单排普遍会产生难以爬出鱼塘的现象？ - 知乎

About Joyk