42

SmoothieMap 2: the lowest memory hash table - Roman Leventov - Medium

 4 years ago
source link: https://medium.com/@leventov/smoothiemap-2-the-lowest-memory-hash-table-ever-6bebd06780a3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

SmoothieMap 2: the lowest memory hash table

In this post, I want to announce the revamp of SmoothieMap, an implementation of extendible hashing (a.k.a. dynamic hashing) in Java. This post also features the first implementation of SwissTable algorithm in Java. Performance comparisons are here, too, of course!

Introduction
Memory usage and performance comparisons
Memory usage
Performance
Hash table benchmarking is thankless work
HashMap’s and SwissTable’s performance variability is low
HashMap is still very strong
What’s new in SmoothieMap 2
Applications
Concurrency and specializations

Introduction

In a recent post, I’ve presented my mental framework of thinking about hash table tradeoffs. In my view, the three main dimensions in the play in the hash table algorithm design space are:

  • CPU cost (of hash table operations such as a key search or an insertion) not related to memory access.
  • Memory. The subfactors are the memory footprint, the number of accessed locations during hash table operations, and the relative locality of these places.
  • Variability of the CPU and the memory factors depending on a specific instantiation of a hash table and a specific point in its lifetime.

In this tradeoff space, SmoothieMap fares extremely well on the memory footprint: it‘s probably the leanest among all production hash table implementations ever written, and the variability factor: there is no “cliff” in the memory footprint and performance around the threshold rehashing size (which is typical for most hash table implementations) because SmoothieMap doesn’t have rehashing in the first place: it grows smoothly as the entries are inserted. SmoothieMap also doesn’t have a load factor and performs equally well regardless of whether the expected population of the hash table is specified at the creation time or not.

The absence of rehash and the related latency spike was the main original goal of SmoothieMap which makes it useful in real-time applications.

On the other hand, SmoothieMap has relatively high non-memory related CPU cost of key search and insertion operations. It may be about 50% slower than java.util.HashMap.

This compromise: low memory usage and absence of latency spikes traded for raw throughput may resemble the behavior of low-latency garbage collection algorithms for the JVM, such as Shenandoah and ZGC, versus G1 or Parallel garbage collectors:

Image for post
Image for post
The tradeoff of GC behavior. From Shenandoah GC talk by Aleskey Shipilëv

Memory usage and performance comparisons

Memory usage

Image for post
Image for post
Footprint of Map implementations

The chart above shows the average footprint per entry of several Map implementations depending on the map size. Measurements are done with -XX:-UseCompressedOps JVM flag, i. e. with 8-byte object references. The picture is not radically different with 4-byte object references.

  • HashMap: 58.8–69.3 bytes. The swing is 18% (or less than that, if we take the footprint of the key and value objects stored in the map into consideration).
  • UnifiedMap from Eclipse Collections: 36.2–51.4 bytes, 42% swing.
  • SmoothieMap 1.x: 26.7–33.4 bytes, 25% swing.
  • SmoothieMap 2 in the low-garbage mode, which is equivalent to the only mode available in SmoothieMap 1.x: 25.8–30.9 bytes, 20% swing.
  • SmoothieMap 2 in the mixed mode: 24.8–27.9 bytes, 12.5% swing.
  • SmoothieMap 2 in the footprint mode: 23.7–25.1 bytes, 6% swing. The difference between the three modes of SmoothieMap 2 is explained below in this post, in the section “What’s new in SmoothieMap 2”.
  • SwissTable: 22.8–45.2 bytes, 98% swing. This is a proof-of-concept Java implementation of the SwissTable algorithm. The difference from the original code in C++ is that 0.75 is used as the load factor instead of 0.875, because currently (in the absence of Panama’s Vector API), it’s not possible to make JVM use _mm_cmpeq_epi8 SIMD instruction on 128-bit vectors.

You may notice that all HashMap, UnifiedMap, and SwissTable have the footprint cliff at the same point: this is because they all use 0.75 load factor in this test. It’s also easy to verify the SwissTable’s footprint numbers knowing its data structure layout: theoretically, the average footprint per entry should vary from (8 + 8) / 0.75 + 1 / 0.75 = 22.7 bytes to (8 + 8) / 0.375 + 1 / 0.375 = 45.3 bytes. This matches with the numbers that we’ve got.

There is a characteristic tendency that can be seen in the data above: the more maps based on a single array table (HashMap, UnifiedMap, SwissTable) squeeze the average footprint down, the more pronounced the footprint cliff around the rehash size becomes. This is an example of a tradeoff between the memory footprint and the variability factors in play (see my earlier post for other examples of such tradeoffs).

SwissTable is an open addressing hash table with entries stored inline and it uses power-of-two-sized array tables. This predictably leads to the footprint swing close to 100%. The same swing of 100% would be observed for other open addressing hash table implementations with power-of-two-sized tables, such as fastutil’s Object2ObjectHashMap or Koloboke’s HashObjObjMap. The only difference is that their average footprint would be greater than the footprint of SwissTable because they can’t afford using load factor as high as 0.75.

Performance

Image for post
Image for post
Performance of successful get() operation on Map<Integer, Integer>

The chart above shows the average throughput of successful key lookups (Map.get() is called in a loop on keys in randomized order) depending on the map size (ranging from 1k to 50 million entries) for three implementations: HashMap, SmoothieMap 2 in the low-garbage mode, and SwissTable. The benchmarks were run on OpenJDK 11.0.3, using G1 GC, C2 compiler, with -XX:-UseCompressedOps flag on an n1-standard-8 instance in GCP (that’s why the results are so noisy). Note that both X and Y axes are logarithmic and don’t start with 0.

Hash table benchmarking is thankless work

The most striking on the chart above is that of six possible rankings of the three Maps compared, for some map sizes, five different rankings can be observed with confidence. The sixth ranking (1. Smoothie Map 2. SwissTable 3. HashTable) can actually be found on the chart, too, but at the map size when the throughput of all three implementations is very close, so this is just accidental :)

The explanation is that the dominating factor in hash table’s performance during operations such as key lookup is how well the memory tiers of the data structure fit certain levels of CPU cache and the memory access pattern within these tiers (see my previous post for a more elaborate discussion). Due to their very different memory layouts and footprints, HashMap, SmoothieMap, and SwissTable “exhaust” certain levels of CPU cache and therefore begin performance phase transitions at different map sizes.

Image for post
Image for post

To me, this effect resembles F1 races where tire wear and tank fullness affect car’s speed so pilots may surpass each other on the track many times during the race because they decide to pitstop at different laps, rather than because they suddenly start to race better than their contenders.

Returning to maps in Java. To make matters worse, in addition to map size, the following factors change the performance picture completely:

  • Whether a successful or an unsuccessful key lookup or an insertion operation is tested, or some mix of these operations.
  • The footprint of the keys and the value objects themselves.
  • The garbage collection algorithm used.
  • Whether the keys and the values were allocated shortly before they were put into the map or not. This affects the memory latency cost of HashMap’s Entry objects: if an entry is allocated on the same or an adjacent cache line as the key object itself in TLAB or a memory region then the pointer chasing through the entry may contribute much less to the latency than otherwise.
  • Whether the keys and the values are referenced from somewhere else in the Java heap or solely from the map. This affects how the garbage collector may rearrange the keys, the values, and the entry objects (for HashMap) during compaction cycles and improve their locality as a side effect. See Moving GC and Locality post by Aleksey Shipilëv for a simpler showcase of this effect. For example, many Java Map benchmarks published before might be flawed if they store the keys to query in the benchmarking loop in an array and iterate that array in order. Even if they shuffle the keys in the array after allocation (hoping to avoid the predictable TLAB allocation order in their memory access pattern), a moving GC algorithm may void this attempt by rearranging the key objects according to the new array order. In my benchmark, I avoid this pitfall by adding another level of indirection: an int[] array of indexes in the array of key objects, and then shuffle the array of indexes.
  • Whether the key objects that are stored in the map are passed to get() themselves or other objects which are equal, but not identical to the map-stored objects are queried in the benchmarking loop.
  • In the case of testing successful key search operation, whether we read something from the stored value’s object memory or just check than the key is present in the map, in other words, whether we are interested in Map.get() or merely Map.containsKey().

This means that performance comparison of Map implementations outside of the context of a specific production use case is not just problematic, it’s impossible.

HashMap’s and SwissTable’s performance variability is low

Another feature that we can notice on the performance chart is the “saw” pattern of throughput of HashMap and, perhaps, SwissTable (however, since the “saw” of SwissTable at some map sizes seems to be in antiphase with the “saw” of HashMap, despite logically they should be synchronous because the two Maps use the same load factor 0.75 in this benchmark, the apparent “saw” of SwissTable might be just noise). Nevertheless, the “saw” of HashMap is indisputable. The spikes of the saw correspond to map populations when it hits the load factor and doubles the underlying table.

This shouldn’t be surprising. The valuable observation that we can make, though, is that the maximum local difference between the highs and the lows of HashMap’s performance is approximately 30%, i. e. not that much and unlikely should ever be a concern. This is thanks to HashMap’s collision resolution via separate chaining which is a low-variability approach both in terms of memory footprint and performance. SwissTable (as well as F14) achieve low performance variability while still having high memory footprint variability by checking 8–16 slots at a time using SIMD instructions, effectively leveling the collision chain length factor.

HashMap is still very strong

At least in the specific test configuration used in my benchmark, HashMap is the fastest at many map sizes, and at no point is far behind SmoothieMap or SwissTable. Someone may wonder why HashMap performs so well in Java, while in C++, SwissTable, F14, and many other open addressing hash table variants written by enthusiasts beat std::unordered_map by a huge margin, although both java.util.HashMap and std::unordered_map use the same algorithm: the old good entry-based hash table with collision resolution via chaining and power-of-two-sized arrays.

I suspect this might be not due to an inherent inferiority of collision resolution via separate chaining, but rather have more to do with the particular inefficiency of unordered_map: the extra indirection hop in the data structure and, consequently, during all hash table operations (see the part of Matt Kulukundis’s CppCon 2017 talk for a nice visual representation) imposed by some nasty C++ standard requirement.

Note that in all Martin Ankerl’s benchmarks linked above, “node” versions of all algorithms perform surprisingly close to “flat” versions with entries stored inline, as he notes himself.

There are a few extra factors that might give HashMap a relative edge in Java against SmoothieMap and SwissTable, as well as other open addressing hash tables:

  • The corresponding keys and entry objects being placed adjacently in memory (either in a TLAB or in a memory region, perhaps, after a rearrangement performed by the GC). In C++, this could only happen if the key size is equal to the internal unordered_map’s node size which puts them into the same size class for the memory allocator.
  • Both SmoothieMap and SwissTable use poor man’s replacement for a single _mm_cmpeq_epi8instruction with 7 operations on the dependency chain. Project Panama will change this.
  • The key lookup and insertion operations on SmoothieMap and SwissTable are much more complex than on HashMap. This leads to high register pressure. In JVM, some registers are always used by the runtime (current thread, heap base) and therefore are out of the register allocation pool. Hotspot’s C2 compiler doesn’t do a particularly good job at register allocation. Combined, these factors cause a lot of register trashing (moving variables from stack to registers and back) in the generated assembly and L1/L2 cache misses sprinkled over these operations. Graal performs no better. So, it might be that GCC or clang can crunch the complex SwissTable’s or F14’s code (SmoothieMap’s segments are essentially miniature fixed-sized F14 tables) relatively better than JVM compilers.

What’s new in SmoothieMap 2

SmoothieMap 1.0 was introduced in 2015. Here is the summary of the changes in SmoothieMap 2:

Automatic shrinking. When a SmoothieMap reduces in size after a peak, it can release the extra unused heap space. This feature is turned on by default and could be turned off by calling SmoothieMapBuilder.doShrink(false).

Modes of operation. SmoothieMap 2 introduces three modes of operation to navigate the garbage — footprint tradeoff space:

  • Low-garbage: SmoothieMap generates very little garbage. A SmoothieMap instance keeps using almost all heap memory that it ever allocated throughout its lifetime. The low-garbage mode is equivalent to the behavior of SmoothieMap 1.x.
  • Footprint: SmoothieMap exhibits the lowest possible footprint: the peak footprint in this mode is almost 20% less than the peak footprint in the low-garbage mode (not considering the footprint of the payload key and value objects). On the other hand, in this mode, SmoothieMap produces a substantial amount of garbage while it grows: the cumulative “garbage toll” of a SmoothieMap in this mode might be between 1x and 2x of its footprint. This is comparable to the “garbage toll” of a SwissTable and higher than that of a HashMap (assuming they grow from size 0 and were created without preconfigured capacity).
  • Mixed: a compromise between the low-garbage and the footprint modes. Not too much garbage is produced. The map’s footprint is higher than in the footprint mode and lower than in the low-garbage mode.

To understand how these modes are implemented, check out the documentation for allocateIntermediateCapacitySegments() and splitBetweenTwoNewSegments() methods in SmoothieMap’s Javadoc.

Coping well with a poor distribution of key hash codes (or direct collisions). SmoothieMap 1.x didn’t handle well the case when a lot of keys inserted into a map have colliding hash codes (either fully, or in some range of the bits), leading to excessive memory usage at best, or IllegalStateException at worst. SmoothieMap 2 handles these situations robustly and without performance degradation for the rest of the keys. If there are too many keys with fully colliding hash codes, SmoothieMap 2 organizes them in a balanced tree, like HashMap.

This addition moves SmoothieMap from the proof-of-concept category to production readiness.

Lower memory footprint and better successful lookup performance. SmoothieMap 2 in the low-garbage mode has slightly better footprint per entry than SmoothieMap 1.x (see the memory usage comparison above in this post).

Worse unsuccessful lookup and insertion performance. The throughput of unsuccessful get(), containsKey(), etc. methods, as well as put(), putIfAbsent(), compute(), etc. which insert a new entry into a map is lower in SmoothieMap 2 than in SmoothieMap 1.x, especially when SmoothieMap 2 is used in the mixed or the footprint mode.

Rejecting null key and values. SmoothieMap 1.x aimed to mimic java.util.HashMap as much as possible, allowing to store null key and values. SmoothieMap 2 models more after immutable maps (Map.of(), Guava’s ImmutableMaps) and ConcurrentHashMap which don’t allow storing null key and values and throw a NullPointerException when null is passed to query methods such as get().

Configuration of custom key and value equivalences and hash function using builder. In SmoothieMap 1.x, it was possible to subclass SmoothieMap to override the default usage of Java object equality for comparing keys and values (e. g. to emulate IdentityHashMap’s behavior, or storing objects of classes without properly defined equals(), like byte[], as keys) and specifying a custom hash function instead of calling key.hashCode(). In SmoothieMap 2, the SmoothieMap class is not open for inheritance anymore. Custom equivalences and a hash function should now be configured via SmoothieMapBuilder.

Applications

I think that SmoothieMap may be an interesting choice in the following areas:

  • Applications with hard real-time environments (because SmoothieMap doesn’t exhibit a rehash latency spike), and/or turning off garbage collector (e. g. by using Epsilon GC) and therefore needing the libraries to produce close to zero garbage, as SmoothieMap in the low-garbage mode does.
  • Centralized control nodes in distributed systems that operate humongous maps in their memory, such as NameNode in HDFS or Coordinator in Apache Druid.
  • In-memory data grids and caches. In particular, SmoothieMap’s segments of ~25–50 entries might enable interesting cache replacement policies that are local to the segments (therefore, scalable) but don’t lose much of the precision. This topic can be a subject for future research.

Concurrency and specializations

There are four things that didn’t make the cut into the SmoothieMap library as it is released to Maven Central (io.timeandspace:smoothie-map):

Concurrent SmoothieMap. SmoothieMap is not concurrent, though its internals (striping into small segments) naturally enables efficient implementation of concurrency.

Primitive specializations (such as long keys), Set specialization (not storing values) and KeyedSet abstraction: a set of values that store key/id in a field and could be queried by this id.

Monitoring of poor hash code distribution and active protection against malicious keys inserted into a SmoothieMap. I’ve experimented a lot with these ideas and SmoothieMap has all the necessary infrastructure in place for making such detections, but ultimately decided to leave this functionality out of the publically released version because of some problems with the statistical model.

Specializations for specific modes of operation, “IdentitySmoothieMap”. The version of SmoothieMap which is released as a library to Maven Central is generic, that is, enables configuration of the operation mode, custom key/value equivalences, etc. via flags and fields which adds some inefficiency compared to a version with hard-coded configurations.

If you are interested in one of these things, please let me know at [email protected].


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK