22

Deletion from hash tables without tombstones

 4 years ago
source link: https://attractivechaos.wordpress.com/2019/12/28/deletion-from-hash-tables-without-tombstones/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

TL;NR: With linear probing, we can delete elements from an open addressing hash table without tombstones. Here are the C and the C++ implementations.

Introduction

When implementing a hash table based on open addressing , we usually set a tombstone for each deleted element, which indicates a bucket used to have an element. These tombstones maintain proper probe chains in the presence of hash collisions. They are critical to the correctness of open-addressing hash tables. My khash.h and most hash table implementations use tombstones.

However, the tombstone strategy is problematic. These tombstones waste memory and may increase the frequency of expensive rehashing. To alleviate these adverse effects, the FaceBook F14 hash table implements reference-counted tombstones. Such tombstones are removed from the hash table when they are at the end of probe chains. Note that F14 still uses tombstones. It just removes them more effectively.

Deletions without tombstones for linear probing

For a long time, I thought tombstones are inevitable. I was wrong. A recent reddit post pointed out that the wiki linear probing page has already offered a no-tombstone solution for years. The basic idea is not complex: when we delete an element, we move the next element to the deleted location if it is pushed away by the deleted element due to hash collision; we repeat this process until we come to an empty bucket. In C++, this algorithm only takes ~10 lines .

The current wiki page describes the idea well, but it is incomplete –– you can’t implement the algorithm with the description there. Fortunately, Google search directs me to an old StackOverflow answer which gives the correct pseudocode. Unlike F14 or robin-hood hashing, this algorithm doesn’t require any additional storage.

Implementation

I implemented the algorithm in a new experimental hash table library khashl.h along with its C++ version khashl.hpp . I started to use Fibonacci hashing and optional hash value caching as are described inmy earlier post. It uses one bit per bucket to indicate whether the bucket is empty.

Benchmark

I modified myearlier benchmark to evaluate deletions. The new benchmark feeds each hash table library a list of random integers. We insert an integer if it is absent from the table; we delete an integer if it is already in the table. For the largest dataset, there are 50 million input integers but there are only 6.2 million integers left in the final table. There are plenty of deletions.

The timing and memory of several hash tables are shown below:

eiYnY36.png!web

In the figure, each library is associated with five points corresponding to 10, 18, 26, 34, 42 and 50 million input integers. The red circle line shows khashl, the new implementation. It has lower memory footprint across the board.

Interestingly, khashl is slower than my older khash.h. This may be caused by a combination of two effects. First, khash.h doubles the number of buckets, resulting in fewer collisions. It unintentionally trades memory for speed. Second, khashl may need to move multiple elements upon a deletion. Multiple copying can be slow. Nonetheless, the no-tombstone strategy is only a little slower and is faster than other libraries for this particular task. It may be faster than khash under a different payload (e.g. batch deletions and then batch insertions). Overall, I think the no-tombstone algorithm is better.

Appendix: comments on other libraries

  • I am only showing fast libraries in the plot. Adding slow libraries will squeeze all the current points into a corner, making them harder to see.
  • Many libraries are also evaluated in another benchmark .
  • phmap is largely a more portable version of abseil hash map . It has similar performance according the other benchmark I was recommending.
  • I could compile absl::flat_hash_map, but its performance is abysmal. Probably I got something wrong. By the way, Abseil is also harder to link than before. Well, that is google…
  • Consistent with the other benchmark, emilib is the fastest on inserting 32-bit integers . It is faster than khashl possibly because it uses a small load balancing threshold of 67% (vs 75% with khashl). Emilib is very slow on deletions. I am not sure why.
  • Like khashl and the Google dense hash table, emilib implements a relatively simple hash table with linear probing. I often wonder if those complex tricks in Abseil and F14 are overkilling.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK