Deletion from hash tables without tombstones

TL;NR: With linear probing, we can delete elements from an open addressing hash table without tombstones. Here are the C and the C++ implementations.

Introduction

When implementing a hash table based on open addressing , we usually set a tombstone for each deleted element, which indicates a bucket used to have an element. These tombstones maintain proper probe chains in the presence of hash collisions. They are critical to the correctness of open-addressing hash tables. My khash.h and most hash table implementations use tombstones.

However, the tombstone strategy is problematic. These tombstones waste memory and may increase the frequency of expensive rehashing. To alleviate these adverse effects, the FaceBook F14 hash table implements reference-counted tombstones. Such tombstones are removed from the hash table when they are at the end of probe chains. Note that F14 still uses tombstones. It just removes them more effectively.

Deletions without tombstones for linear probing

For a long time, I thought tombstones are inevitable. I was wrong. A recent reddit post pointed out that the wiki linear probing page has already offered a no-tombstone solution for years. The basic idea is not complex: when we delete an element, we move the next element to the deleted location if it is pushed away by the deleted element due to hash collision; we repeat this process until we come to an empty bucket. In C++, this algorithm only takes ~10 lines .

The current wiki page describes the idea well, but it is incomplete –– you can’t implement the algorithm with the description there. Fortunately, Google search directs me to an old StackOverflow answer which gives the correct pseudocode. Unlike F14 or robin-hood hashing, this algorithm doesn’t require any additional storage.

Implementation

I implemented the algorithm in a new experimental hash table library khashl.h along with its C++ version khashl.hpp . I started to use Fibonacci hashing and optional hash value caching as are described inmy earlier post. It uses one bit per bucket to indicate whether the bucket is empty.

Benchmark

I modified myearlier benchmark to evaluate deletions. The new benchmark feeds each hash table library a list of random integers. We insert an integer if it is absent from the table; we delete an integer if it is already in the table. For the largest dataset, there are 50 million input integers but there are only 6.2 million integers left in the final table. There are plenty of deletions.

The timing and memory of several hash tables are shown below:

eiYnY36.png!web

In the figure, each library is associated with five points corresponding to 10, 18, 26, 34, 42 and 50 million input integers. The red circle line shows khashl, the new implementation. It has lower memory footprint across the board.

Interestingly, khashl is slower than my older khash.h. This may be caused by a combination of two effects. First, khash.h doubles the number of buckets, resulting in fewer collisions. It unintentionally trades memory for speed. Second, khashl may need to move multiple elements upon a deletion. Multiple copying can be slow. Nonetheless, the no-tombstone strategy is only a little slower and is faster than other libraries for this particular task. It may be faster than khash under a different payload (e.g. batch deletions and then batch insertions). Overall, I think the no-tombstone algorithm is better.

Appendix: comments on other libraries

I am only showing fast libraries in the plot. Adding slow libraries will squeeze all the current points into a corner, making them harder to see.
Many libraries are also evaluated in another benchmark .
phmap is largely a more portable version of abseil hash map . It has similar performance according the other benchmark I was recommending.
I could compile absl::flat_hash_map, but its performance is abysmal. Probably I got something wrong. By the way, Abseil is also harder to link than before. Well, that is google…
Consistent with the other benchmark, emilib is the fastest on inserting 32-bit integers . It is faster than khashl possibly because it uses a small load balancing threshold of 67% (vs 75% with khashl). Emilib is very slow on deletions. I am not sure why.
Like khashl and the Google dense hash table, emilib implements a relatively simple hash table with linear probing. I often wonder if those complex tricks in Abseil and F14 are overkilling.

Introduction

Deletions without tombstones for linear probing

Implementation

Benchmark

Appendix: comments on other libraries

Recommend

神操作！美国程序员把工作外包给中国程序员，上班摸鱼吸猫年入 20 万美元

5G手机竞争激烈：厮杀至千元档 OPPO等掉队

常用Vim命令及实用Vim按键映射配置详解(2019年最新)

Web workers driven real world application built with neo.mjs

无货源跨境电商，先从家里人做起的“新骗局”？

请问用光猫拨号，需要如何配置才能联网？

数据可视化工具 Metabase：不会写 SQL 代码也能做数据分析

MySQL性能优化：MySQL中的隐式转换造成的索引失效

5G 超级 SIM 卡就是伪创新

5G iPhone成最大推力：苹果市值有望超1.7万亿美元

About Joyk