57

GitHub - axiomhq/hyperminhash: HyperMinHash: Bringing intersections to HyperLogL...

 6 years ago
source link: https://github.com/axiomhq/hyperminhash
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

HyperMinSketch

Besides being a compact and pretty speedy HyperLogLog implementation for cardinality counting, this modified HyperLogLog allows intersection and similarity estimation of different HyperLogLogs.

Details

A simple implementation of HyperLogLog (LogLog-Beta to be specific):

  • 16 bit registers instead of 6 bit, the new 10 bit are for b-bit signatures
  • Similarity function estimates Jaccard indices (a number between 0-1) of 0.01 for set cardinalities on the order of 1e9 with accuracy around 5%
  • Intersection applies the Jaccard index on the union of the sets to return the intersecting set cardinality

The work is based on "HyperMinHash: Jaccard index sketching in LogLog space - Yun William Yu, Griffin M. Weber"

Example Usage

sk1 := hyperminhash.New()
sk2 := hyperminhash.New()

for i := 0; i < 10000; i++ {
    sk1.Add([]byte(strconv.Itoa(i)))
}

sk1.Cardinality() // 10001 (should be 10000)

for i := 3333; i < 23333; i++ {
    sk2.Add([]byte(strconv.Itoa(i)))
}

sk2.Cardinality()     // 19977 (should be 20000)
sk1.Similarity(sk2)   // 0.284589082 (should be 0.2857326533)
sk1.Intersection(sk2) // 6623 (should be 6667)

sk1.Merge(sk2)
sk1.Cardinality() // 23271 (should be 23333)

Results

Max Cardinality 1e3

Set1 HLL1 Set2 HLL2 S1 ∪ S2 HLL1 ∪ HLL2 S1 ∩ S2 HLL1 ∩ HLL2
350 352 752 752 831 832 271 (32.611312%) 274 (32.932692%)
746 748 591 590 834 835 503 (60.311751%) 501 (60.000000%)
248 248 789 791 897 899 140 (15.607581%) 144 (16.017798%)
9 9 818 818 824 825 3 (0.364078%) 3 (0.363636%)
408 411 412 408 771 771 49 (6.355383%) 47 (6.095979%)

Max Cardinality 1e4

Set1 HLL1 Set2 HLL2 S1 ∪ S2 HLL1 ∪ HLL2 S1 ∩ S2 HLL1 ∩ HLL2
2126 2138 1162 1158 3063 3060 225 (7.345739%) 223 (7.287582%)
7767 7706 7054 7064 8889 8887 5932 (66.734166%) 5888 (66.254079%)
842 844 5183 5135 5880 5842 145 (2.465986%) 135 (2.310852%)
6833 6791 664 666 7410 7345 87 (1.174089%) 89 (1.211709%)
1814 1820 6214 6169 7697 7639 331 (4.300377%) 320 (4.189030%)

Max Cardinality 1e5

Set1 HLL1 Set2 HLL2 S1 ∪ S2 HLL1 ∪ HLL2 S1 ∩ S2 HLL1 ∩ HLL2
29667 29540 88700 88167 92444 91667 25923 (28.041842%) 25036 (27.311901%)
79242 78731 30216 30137 83502 82953 25956 (31.084285%) 25995 (31.337022%)
57830 57223 79550 79194 82112 81595 55268 (67.308067%) 54684 (67.018812%)
64610 63501 21696 21729 75895 74816 10411 (13.717636%) 10083 (13.477064%)
92204 91453 96417 95556 165025 163370 23596 (14.298440%) 24130 (14.770154%)

Max Cardinality 1e6

Set1 HLL1 Set2 HLL2 S1 ∪ S2 HLL1 ∪ HLL2 S1 ∩ S2 HLL1 ∩ HLL2
150443 149810 974366 979514 1088517 1096991 36292 (3.334077%) 37417 (3.410876%)
156337 155347 19083 19070 167353 165433 8067 (4.820350%) 8017 (4.846071%)
800969 802044 51053 51244 851388 853396 634 (0.074467%) 511 (0.059878%)
176155 174707 520111 516822 570092 569289 126174 (22.132217%) 123766 (21.740452%)
485954 481362 967341 972651 1081990 1091296 371305 (34.316861%) 376007 (34.455088%)

Max Cardinality 1e7

Set1 HLL1 Set2 HLL2 S1 ∪ S2 HLL1 ∪ HLL2 S1 ∩ S2 HLL1 ∩ HLL2
7132942 7150720 122116 121539 7243153 7261709 11905 (0.164362%) 12550 (0.172824%)
8646240 8649049 1277784 1295017 9821480 9854242 102544 (1.044079%) 99163 (1.006298%)
4192390 4164637 2788913 2779975 4526476 4499897 2454827 (54.232630%) 2454356 (54.542493%)
9803344 9826412 1705700 1715798 10255010 10262719 1254034 (12.228501%) 1273821 (12.412120%)
1308849 1322604 9940327 9971519 11179030 11201850 70146 (0.627478%) 80717 (0.720568%)

Max Cardinality 1e8

Set1 HLL1 Set2 HLL2 S1 ∪ S2 HLL1 ∪ HLL2 S1 ∩ S2 HLL1 ∩ HLL2
13237748 13298469 57073758 57124720 59474437 59394847 10837069 (18.221390%) 11143669 (18.762013%)
90757994 88576114 5717797 5701796 95061178 93016636 1414613 (1.488108%) 1350058 (1.451416%)
60150663 60033013 79238333 77672994 110438475 108311818 28950521 (26.214162%) 27666946 (25.543792%)
30187492 30718889 37756209 37153655 67443566 66938074 500135 (0.741561%) 447406 (0.668388%)
53196095 53461989 48344583 47535284 93284291 91321031 8256387 (8.850780%) 8036467 (8.800237%)

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK