

Sorting Unsigned Integers Faster in Java
source link: https://richardstartin.github.io/posts/sorting-unsigned-integers-faster-in-java
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Sorting Unsigned Integers Faster in Java
I discovered a curious resource for audio-visualising sort algorithms, which is exciting for two reasons. The first is that I finally feel like I understand Alexander Scriabin: he was not a composer. He discovered Tim Sort 80 years before Tim Peters and called it Black Mass. (If you aren’t familiar with the piece, fast-forward to 1:40 to hear the congruence.)
The second reason was that I noticed Radix Sort (LSD). While it was an affront to my senses, it used a mere 800 array accesses and no comparisons! I was unaware of this algorithm so delved deeper and implemented it for integers, and benchmarked my code against Arrays.sort
.
Radix Sort
It is taken as given by many (myself included, or am I just projecting my thoughts on to others?) that latexO(nlogn)latexO(nlogn) is the best you can do in a sort algorithm. But this is actually only true for sort algorithms which depend on comparison. If you can afford to restrict the data types your sort algorithm supports to types with a positional interpretation (java.util
can’t because it needs to be ubiquitous and maintainable), you can get away with a linear time algorithm.
Radix sort, along with the closely related counting sort, does not use comparisons. Instead, the data is interpreted as a fixed length string of symbols. For each position, the cumulative histogram of symbols is computed to calculate sort indices. While the data needs to be scanned several times, the algorithm scales linearly and the overhead of the multiple scans is amortised for large arrays.
As you can see on Wikipedia, there are two kinds of radix sort: Least Significant Digit and Most Significant Digit. This dichotomy relates to the order the (representational) string of symbols is traversed in. I implemented and benchmarked the LSD version for integers.
Implementation
The implementation interprets an integer as the concatenation of n bit string symbols of fixed size size 32/n. It performs n passes over the array, starting with the least significant bits, which it modifies in place. For each pass the data is scanned three times, in order to:
- Compute the cumulative histogram over the symbols in their natural sort order.
- Copy the value with symbol k to the mth position in a buffer, where m is defined by the cumulative density of k.
- Copy the buffer back into the original array.
The implementation, which won’t work unless the chunks are proper divisors of 32, is below. The bonus (or caveat) is that it automatically supports unsigned integers. The code could be modified slightly to work with signed integers at a performance cost.
import java.util.Arrays;
public class RadixSort {
private final int radix;
public RadixSort() {
this(Byte.SIZE);
}
public RadixSort(int radix) {
assert 32 % radix== 0;
this.radix= radix;
}
public void sort(int[] data) {
int[] histogram = new int[(1 << radix) + 1];
int shift = 0;
int mask = (1 << radix) - 1;
int[] copy = new int[data.length];
while (shift < Integer.SIZE) {
Arrays.fill(histogram, 0);
for (int i = 0; i < data.length; ++i) {
++histogram[((data[i] & mask) >> shift) + 1];
}
for (int i = 0; i < 1 << radix; ++i) {
histogram[i + 1] += histogram[i];
}
for (int i = 0; i < data.length; ++i) {
copy[histogram[(data[i] & mask) >> shift]++] = data[i];
}
for (int i = 0; i < data.length; ++i) {
data[i] = copy[i];
}
shift += radix;
mask <<= radix;
}
}
}
The time complexity is obviously linear, a temporary buffer is allocated, but in comparison to Arrays.sort
it looks fairly spartan. Instinctively, cache locality looks fairly poor because the second inner loop of the three jumps all over the place. Will this implementation beat Arrays.sort
(for integers)?
Benchmark
The algorithm is measured using arrays of random positive integers, for which both algorithms are equivalent, from a range of sizes. This isn’t always the best idea (the Tim Sort algorithm comes into its own on nearly sorted data), so take the result below with a pinch of salt. Care must be taken to copy the array in the benchmark since both algorithms are in-place.
public void launchBenchmark(String... jvmArgs) throws Exception {
Options opt = new OptionsBuilder()
.include(this.getClass().getName() + ".*")
.mode(Mode.SampleTime)
.mode(Mode.Throughput)
.timeUnit(TimeUnit.MILLISECONDS)
.measurementTime(TimeValue.seconds(10))
.warmupIterations(10)
.measurementIterations(10)
.forks(1)
.shouldFailOnError(true)
.shouldDoGC(true)
.jvmArgs(jvmArgs)
.resultFormat(ResultFormatType.CSV)
.build();
new Runner(opt).run();
}
@Benchmark
public void Arrays_Sort(Data data, Blackhole bh) {
int[] array = Arrays.copyOf(data.data, data.size);
Arrays.sort(array);
bh.consume(array);
}
@Benchmark
public void Radix_Sort(Data data, Blackhole bh) {
int[] array = Arrays.copyOf(data.data, data.size);
data.radixSort.sort(array);
bh.consume(array);
}
@State(Scope.Thread)
public static class Data {
@Param({"100", "1000", "10000", "100000", "1000000"})
int size;
int[] data;
RadixSort radixSort = new RadixSort();
@Setup(Level.Trial)
public void init() {
data = createArray(size);
}
}
public static int[] createArray(int size) {
int[] array = new int[size];
Random random = new Random(0);
for (int i = 0; i < size; ++i) {
array[i] = Math.abs(random.nextInt());
}
return array;
}
Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size Arrays_Sort thrpt 1 10 1304.687189 147.380334 ops/ms 100 Arrays_Sort thrpt 1 10 78.518664 9.339994 ops/ms 1000 Arrays_Sort thrpt 1 10 1.700208 0.091836 ops/ms 10000 Arrays_Sort thrpt 1 10 0.133835 0.007146 ops/ms 100000 Arrays_Sort thrpt 1 10 0.010560 0.000409 ops/ms 1000000 Radix_Sort thrpt 1 10 404.807772 24.930898 ops/ms 100 Radix_Sort thrpt 1 10 51.787409 4.881181 ops/ms 1000 Radix_Sort thrpt 1 10 6.065590 0.576709 ops/ms 10000 Radix_Sort thrpt 1 10 0.620338 0.068776 ops/ms 100000 Radix_Sort thrpt 1 10 0.043098 0.004481 ops/ms 1000000 Arrays_Sort sample 1 3088586 0.000902 0.000018 ms/op 100 Arrays_Sort·p0.00 sample 1 1 0.000394
ms/op 100 Arrays_Sort·p0.50 sample 1 1 0.000790
ms/op 100 Arrays_Sort·p0.90 sample 1 1 0.000791
ms/op 100 Arrays_Sort·p0.95 sample 1 1 0.001186
ms/op 100 Arrays_Sort·p0.99 sample 1 1 0.001974
ms/op 100 Arrays_Sort·p0.999 sample 1 1 0.020128
ms/op 100 Arrays_Sort·p0.9999 sample 1 1 0.084096
ms/op 100 Arrays_Sort·p1.00 sample 1 1 4.096000
ms/op 100 Arrays_Sort sample 1 2127794 0.011876 0.000037 ms/op 1000 Arrays_Sort·p0.00 sample 1 1 0.007896
ms/op 1000 Arrays_Sort·p0.50 sample 1 1 0.009872
ms/op 1000 Arrays_Sort·p0.90 sample 1 1 0.015408
ms/op 1000 Arrays_Sort·p0.95 sample 1 1 0.024096
ms/op 1000 Arrays_Sort·p0.99 sample 1 1 0.033920
ms/op 1000 Arrays_Sort·p0.999 sample 1 1 0.061568
ms/op 1000 Arrays_Sort·p0.9999 sample 1 1 0.894976
ms/op 1000 Arrays_Sort·p1.00 sample 1 1 4.448256
ms/op 1000 Arrays_Sort sample 1 168991 0.591169 0.001671 ms/op 10000 Arrays_Sort·p0.00 sample 1 1 0.483840
ms/op 10000 Arrays_Sort·p0.50 sample 1 1 0.563200
ms/op 10000 Arrays_Sort·p0.90 sample 1 1 0.707584
ms/op 10000 Arrays_Sort·p0.95 sample 1 1 0.766976
ms/op 10000 Arrays_Sort·p0.99 sample 1 1 0.942080
ms/op 10000 Arrays_Sort·p0.999 sample 1 1 2.058273
ms/op 10000 Arrays_Sort·p0.9999 sample 1 1 7.526102
ms/op 10000 Arrays_Sort·p1.00 sample 1 1 46.333952
ms/op 10000 Arrays_Sort sample 1 13027 7.670135 0.021512 ms/op 100000 Arrays_Sort·p0.00 sample 1 1 6.356992
ms/op 100000 Arrays_Sort·p0.50 sample 1 1 7.634944
ms/op 100000 Arrays_Sort·p0.90 sample 1 1 8.454144
ms/op 100000 Arrays_Sort·p0.95 sample 1 1 8.742502
ms/op 100000 Arrays_Sort·p0.99 sample 1 1 9.666560
ms/op 100000 Arrays_Sort·p0.999 sample 1 1 12.916883
ms/op 100000 Arrays_Sort·p0.9999 sample 1 1 28.037900
ms/op 100000 Arrays_Sort·p1.00 sample 1 1 28.573696
ms/op 100000 Arrays_Sort sample 1 1042 96.278673 0.603645 ms/op 1000000 Arrays_Sort·p0.00 sample 1 1 86.114304
ms/op 1000000 Arrays_Sort·p0.50 sample 1 1 94.896128
ms/op 1000000 Arrays_Sort·p0.90 sample 1 1 104.293990
ms/op 1000000 Arrays_Sort·p0.95 sample 1 1 106.430464
ms/op 1000000 Arrays_Sort·p0.99 sample 1 1 111.223767
ms/op 1000000 Arrays_Sort·p0.999 sample 1 1 134.172770
ms/op 1000000 Arrays_Sort·p0.9999 sample 1 1 134.742016
ms/op 1000000 Arrays_Sort·p1.00 sample 1 1 134.742016
ms/op 1000000 Radix_Sort sample 1 2240042 0.002941 0.000033 ms/op 100 Radix_Sort·p0.00 sample 1 1 0.001578
ms/op 100 Radix_Sort·p0.50 sample 1 1 0.002368
ms/op 100 Radix_Sort·p0.90 sample 1 1 0.003556
ms/op 100 Radix_Sort·p0.95 sample 1 1 0.004344
ms/op 100 Radix_Sort·p0.99 sample 1 1 0.011056
ms/op 100 Radix_Sort·p0.999 sample 1 1 0.027232
ms/op 100 Radix_Sort·p0.9999 sample 1 1 0.731127
ms/op 100 Radix_Sort·p1.00 sample 1 1 5.660672
ms/op 100 Radix_Sort sample 1 2695825 0.018553 0.000038 ms/op 1000 Radix_Sort·p0.00 sample 1 1 0.013424
ms/op 1000 Radix_Sort·p0.50 sample 1 1 0.016576
ms/op 1000 Radix_Sort·p0.90 sample 1 1 0.025280
ms/op 1000 Radix_Sort·p0.95 sample 1 1 0.031200
ms/op 1000 Radix_Sort·p0.99 sample 1 1 0.050944
ms/op 1000 Radix_Sort·p0.999 sample 1 1 0.082944
ms/op 1000 Radix_Sort·p0.9999 sample 1 1 0.830295
ms/op 1000 Radix_Sort·p1.00 sample 1 1 6.660096
ms/op 1000 Radix_Sort sample 1 685589 0.145695 0.000234 ms/op 10000 Radix_Sort·p0.00 sample 1 1 0.112512
ms/op 10000 Radix_Sort·p0.50 sample 1 1 0.128000
ms/op 10000 Radix_Sort·p0.90 sample 1 1 0.196608
ms/op 10000 Radix_Sort·p0.95 sample 1 1 0.225792
ms/op 10000 Radix_Sort·p0.99 sample 1 1 0.309248
ms/op 10000 Radix_Sort·p0.999 sample 1 1 0.805888
ms/op 10000 Radix_Sort·p0.9999 sample 1 1 1.818141
ms/op 10000 Radix_Sort·p1.00 sample 1 1 14.401536
ms/op 10000 Radix_Sort sample 1 60843 1.641961 0.005783 ms/op 100000 Radix_Sort·p0.00 sample 1 1 1.251328
ms/op 100000 Radix_Sort·p0.50 sample 1 1 1.542144
ms/op 100000 Radix_Sort·p0.90 sample 1 1 2.002944
ms/op 100000 Radix_Sort·p0.95 sample 1 1 2.375680
ms/op 100000 Radix_Sort·p0.99 sample 1 1 3.447030
ms/op 100000 Radix_Sort·p0.999 sample 1 1 5.719294
ms/op 100000 Radix_Sort·p0.9999 sample 1 1 8.724165
ms/op 100000 Radix_Sort·p1.00 sample 1 1 13.074432
ms/op 100000 Radix_Sort sample 1 4846 20.640787 0.260926 ms/op 1000000 Radix_Sort·p0.00 sample 1 1 14.893056
ms/op 1000000 Radix_Sort·p0.50 sample 1 1 18.743296
ms/op 1000000 Radix_Sort·p0.90 sample 1 1 26.673152
ms/op 1000000 Radix_Sort·p0.95 sample 1 1 30.724915
ms/op 1000000 Radix_Sort·p0.99 sample 1 1 40.470446
ms/op 1000000 Radix_Sort·p0.999 sample 1 1 63.016600
ms/op 1000000 Radix_Sort·p0.9999 sample 1 1 136.052736
ms/op 1000000 Radix_Sort·p1.00 sample 1 1 136.052736
ms/op 1000000
The table tells an interesting story. Arrays.sort
is vastly superior for small arrays (the arrays most people have), but for large arrays the custom implementation comes into its own. Interestingly, this is consistent with the computer science. If you need to sort large arrays of (unsigned) integers and care about performance, think about implementing radix sort.
Recommend
-
20
Radix sort sorts n w-bit integers by splitting them up into chunks of log n \log n lo g ...
-
9
Nick Scialli • April 19, 2021 • 🚀 4 minute readIf you use high-level programming languages, you may have come across signed and unsigned integers. Here’s a quick primer on what they are. ...
-
9
Sorting the table Ignoring special characters and integers? advertisements I have been trying to sort and array with strings composed of lette...
-
10
Copy link Contributor a1phyr comme...
-
4
On finding the average of two unsigned integers without overflow Raymond February 7th, 2022
-
13
Converting integers to decimal strings faster with AVX-512 In most systems, integers are stored using a fixed binary representation. It is common to store integers using 32-bit or 64-bit words. You sometimes need to convert...
-
8
Faster sorting with Go generics ...
-
13
SIMD accelerated sorting in Java - how it works and why it was 3x faster 09 Jun 2022 In this post I explain a little about how to use Java’s Vector APIs, attempt to explain how they turn out fast, and then use them to implement a sort...
-
6
Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy
-
10
Faster sorting algorithms discovered using deep reinforcement learningAbstractFundamental algorithms such as sorting or hashing are used trillions of times on...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK