Sorting Unsigned Integers Faster in Java

Jun 30, 2017 | java analysis

I discovered a curious resource for audio-visualising sort algorithms, which is exciting for two reasons. The first is that I finally feel like I understand Alexander Scriabin: he was not a composer. He discovered Tim Sort 80 years before Tim Peters and called it Black Mass. (If you aren’t familiar with the piece, fast-forward to 1:40 to hear the congruence.)

The second reason was that I noticed Radix Sort (LSD). While it was an affront to my senses, it used a mere 800 array accesses and no comparisons! I was unaware of this algorithm so delved deeper and implemented it for integers, and benchmarked my code against Arrays.sort.

Radix Sort

It is taken as given by many (myself included, or am I just projecting my thoughts on to others?) that latexO(nlogn)latexO(nlog⁡n) is the best you can do in a sort algorithm. But this is actually only true for sort algorithms which depend on comparison. If you can afford to restrict the data types your sort algorithm supports to types with a positional interpretation (java.util can’t because it needs to be ubiquitous and maintainable), you can get away with a linear time algorithm.

Radix sort, along with the closely related counting sort, does not use comparisons. Instead, the data is interpreted as a fixed length string of symbols. For each position, the cumulative histogram of symbols is computed to calculate sort indices. While the data needs to be scanned several times, the algorithm scales linearly and the overhead of the multiple scans is amortised for large arrays.

As you can see on Wikipedia, there are two kinds of radix sort: Least Significant Digit and Most Significant Digit. This dichotomy relates to the order the (representational) string of symbols is traversed in. I implemented and benchmarked the LSD version for integers.

Implementation

The implementation interprets an integer as the concatenation of n bit string symbols of fixed size size 32/n. It performs n passes over the array, starting with the least significant bits, which it modifies in place. For each pass the data is scanned three times, in order to:

Compute the cumulative histogram over the symbols in their natural sort order.
Copy the value with symbol k to the mth position in a buffer, where m is defined by the cumulative density of k.
Copy the buffer back into the original array.

The implementation, which won’t work unless the chunks are proper divisors of 32, is below. The bonus (or caveat) is that it automatically supports unsigned integers. The code could be modified slightly to work with signed integers at a performance cost.

import java.util.Arrays;

public class RadixSort {

    private final int radix;

    public RadixSort() {
        this(Byte.SIZE);
    }

    public RadixSort(int radix) {
        assert 32 % radix== 0;
        this.radix= radix;
    }

    public void sort(int[] data) {
        int[] histogram = new int[(1 << radix) + 1];
        int shift = 0;
        int mask = (1 << radix) - 1;
        int[] copy = new int[data.length];
        while (shift < Integer.SIZE) {
            Arrays.fill(histogram, 0);
            for (int i = 0; i < data.length; ++i) {
                ++histogram[((data[i] & mask) >> shift) + 1];
            }
            for (int i = 0; i < 1 << radix; ++i) {
                histogram[i + 1] += histogram[i];
            }
            for (int i = 0; i < data.length; ++i) {
                copy[histogram[(data[i] & mask) >> shift]++] = data[i];
            }
            for (int i = 0; i < data.length; ++i) {
                data[i] = copy[i];
            }
            shift += radix;
            mask <<= radix;
        }
    }
}

The time complexity is obviously linear, a temporary buffer is allocated, but in comparison to Arrays.sort it looks fairly spartan. Instinctively, cache locality looks fairly poor because the second inner loop of the three jumps all over the place. Will this implementation beat Arrays.sort (for integers)?

Benchmark

The algorithm is measured using arrays of random positive integers, for which both algorithms are equivalent, from a range of sizes. This isn’t always the best idea (the Tim Sort algorithm comes into its own on nearly sorted data), so take the result below with a pinch of salt. Care must be taken to copy the array in the benchmark since both algorithms are in-place.

public void launchBenchmark(String... jvmArgs) throws Exception {
        Options opt = new OptionsBuilder()
                .include(this.getClass().getName() + ".*")
                .mode(Mode.SampleTime)
                .mode(Mode.Throughput)
                .timeUnit(TimeUnit.MILLISECONDS)
                .measurementTime(TimeValue.seconds(10))
                .warmupIterations(10)
                .measurementIterations(10)
                .forks(1)
                .shouldFailOnError(true)
                .shouldDoGC(true)
                .jvmArgs(jvmArgs)
                .resultFormat(ResultFormatType.CSV) 
                .build();

        new Runner(opt).run();
    }


    @Benchmark
    public void Arrays_Sort(Data data, Blackhole bh) {
        int[] array = Arrays.copyOf(data.data, data.size);
        Arrays.sort(array);
        bh.consume(array);
    }

    @Benchmark
    public void Radix_Sort(Data data, Blackhole bh) {
        int[] array = Arrays.copyOf(data.data, data.size);
        data.radixSort.sort(array);
        bh.consume(array);
    }


    @State(Scope.Thread)
    public static class Data {

        @Param({"100", "1000", "10000", "100000", "1000000"})
        int size;

        int[] data;
        RadixSort radixSort = new RadixSort();

        @Setup(Level.Trial)
        public void init() {
            data = createArray(size);
        }
    }

    public static int[] createArray(int size) {
        int[] array = new int[size];
        Random random = new Random(0);
        for (int i = 0; i < size; ++i) {
            array[i] = Math.abs(random.nextInt());
        }
        return array;
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size Arrays_Sort thrpt 1 10 1304.687189 147.380334 ops/ms 100 Arrays_Sort thrpt 1 10 78.518664 9.339994 ops/ms 1000 Arrays_Sort thrpt 1 10 1.700208 0.091836 ops/ms 10000 Arrays_Sort thrpt 1 10 0.133835 0.007146 ops/ms 100000 Arrays_Sort thrpt 1 10 0.010560 0.000409 ops/ms 1000000 Radix_Sort thrpt 1 10 404.807772 24.930898 ops/ms 100 Radix_Sort thrpt 1 10 51.787409 4.881181 ops/ms 1000 Radix_Sort thrpt 1 10 6.065590 0.576709 ops/ms 10000 Radix_Sort thrpt 1 10 0.620338 0.068776 ops/ms 100000 Radix_Sort thrpt 1 10 0.043098 0.004481 ops/ms 1000000 Arrays_Sort sample 1 3088586 0.000902 0.000018 ms/op 100 Arrays_Sort·p0.00 sample 1 1 0.000394

ms/op 100 Arrays_Sort·p0.50 sample 1 1 0.000790

ms/op 100 Arrays_Sort·p0.90 sample 1 1 0.000791

ms/op 100 Arrays_Sort·p0.95 sample 1 1 0.001186

ms/op 100 Arrays_Sort·p0.99 sample 1 1 0.001974

ms/op 100 Arrays_Sort·p0.999 sample 1 1 0.020128

ms/op 100 Arrays_Sort·p0.9999 sample 1 1 0.084096

ms/op 100 Arrays_Sort·p1.00 sample 1 1 4.096000

ms/op 100 Arrays_Sort sample 1 2127794 0.011876 0.000037 ms/op 1000 Arrays_Sort·p0.00 sample 1 1 0.007896

ms/op 1000 Arrays_Sort·p0.50 sample 1 1 0.009872

ms/op 1000 Arrays_Sort·p0.90 sample 1 1 0.015408

ms/op 1000 Arrays_Sort·p0.95 sample 1 1 0.024096

ms/op 1000 Arrays_Sort·p0.99 sample 1 1 0.033920

ms/op 1000 Arrays_Sort·p0.999 sample 1 1 0.061568

ms/op 1000 Arrays_Sort·p0.9999 sample 1 1 0.894976

ms/op 1000 Arrays_Sort·p1.00 sample 1 1 4.448256

ms/op 1000 Arrays_Sort sample 1 168991 0.591169 0.001671 ms/op 10000 Arrays_Sort·p0.00 sample 1 1 0.483840

ms/op 10000 Arrays_Sort·p0.50 sample 1 1 0.563200

ms/op 10000 Arrays_Sort·p0.90 sample 1 1 0.707584

ms/op 10000 Arrays_Sort·p0.95 sample 1 1 0.766976

ms/op 10000 Arrays_Sort·p0.99 sample 1 1 0.942080

ms/op 10000 Arrays_Sort·p0.999 sample 1 1 2.058273

ms/op 10000 Arrays_Sort·p0.9999 sample 1 1 7.526102

ms/op 10000 Arrays_Sort·p1.00 sample 1 1 46.333952

ms/op 10000 Arrays_Sort sample 1 13027 7.670135 0.021512 ms/op 100000 Arrays_Sort·p0.00 sample 1 1 6.356992

ms/op 100000 Arrays_Sort·p0.50 sample 1 1 7.634944

ms/op 100000 Arrays_Sort·p0.90 sample 1 1 8.454144

ms/op 100000 Arrays_Sort·p0.95 sample 1 1 8.742502

ms/op 100000 Arrays_Sort·p0.99 sample 1 1 9.666560

ms/op 100000 Arrays_Sort·p0.999 sample 1 1 12.916883

ms/op 100000 Arrays_Sort·p0.9999 sample 1 1 28.037900

ms/op 100000 Arrays_Sort·p1.00 sample 1 1 28.573696

ms/op 100000 Arrays_Sort sample 1 1042 96.278673 0.603645 ms/op 1000000 Arrays_Sort·p0.00 sample 1 1 86.114304

ms/op 1000000 Arrays_Sort·p0.50 sample 1 1 94.896128

ms/op 1000000 Arrays_Sort·p0.90 sample 1 1 104.293990

ms/op 1000000 Arrays_Sort·p0.95 sample 1 1 106.430464

ms/op 1000000 Arrays_Sort·p0.99 sample 1 1 111.223767

ms/op 1000000 Arrays_Sort·p0.999 sample 1 1 134.172770

ms/op 1000000 Arrays_Sort·p0.9999 sample 1 1 134.742016

ms/op 1000000 Arrays_Sort·p1.00 sample 1 1 134.742016

ms/op 1000000 Radix_Sort sample 1 2240042 0.002941 0.000033 ms/op 100 Radix_Sort·p0.00 sample 1 1 0.001578

ms/op 100 Radix_Sort·p0.50 sample 1 1 0.002368

ms/op 100 Radix_Sort·p0.90 sample 1 1 0.003556

ms/op 100 Radix_Sort·p0.95 sample 1 1 0.004344

ms/op 100 Radix_Sort·p0.99 sample 1 1 0.011056

ms/op 100 Radix_Sort·p0.999 sample 1 1 0.027232

ms/op 100 Radix_Sort·p0.9999 sample 1 1 0.731127

ms/op 100 Radix_Sort·p1.00 sample 1 1 5.660672

ms/op 100 Radix_Sort sample 1 2695825 0.018553 0.000038 ms/op 1000 Radix_Sort·p0.00 sample 1 1 0.013424

ms/op 1000 Radix_Sort·p0.50 sample 1 1 0.016576

ms/op 1000 Radix_Sort·p0.90 sample 1 1 0.025280

ms/op 1000 Radix_Sort·p0.95 sample 1 1 0.031200

ms/op 1000 Radix_Sort·p0.99 sample 1 1 0.050944

ms/op 1000 Radix_Sort·p0.999 sample 1 1 0.082944

ms/op 1000 Radix_Sort·p0.9999 sample 1 1 0.830295

ms/op 1000 Radix_Sort·p1.00 sample 1 1 6.660096

ms/op 1000 Radix_Sort sample 1 685589 0.145695 0.000234 ms/op 10000 Radix_Sort·p0.00 sample 1 1 0.112512

ms/op 10000 Radix_Sort·p0.50 sample 1 1 0.128000

ms/op 10000 Radix_Sort·p0.90 sample 1 1 0.196608

ms/op 10000 Radix_Sort·p0.95 sample 1 1 0.225792

ms/op 10000 Radix_Sort·p0.99 sample 1 1 0.309248

ms/op 10000 Radix_Sort·p0.999 sample 1 1 0.805888

ms/op 10000 Radix_Sort·p0.9999 sample 1 1 1.818141

ms/op 10000 Radix_Sort·p1.00 sample 1 1 14.401536

ms/op 10000 Radix_Sort sample 1 60843 1.641961 0.005783 ms/op 100000 Radix_Sort·p0.00 sample 1 1 1.251328

ms/op 100000 Radix_Sort·p0.50 sample 1 1 1.542144

ms/op 100000 Radix_Sort·p0.90 sample 1 1 2.002944

ms/op 100000 Radix_Sort·p0.95 sample 1 1 2.375680

ms/op 100000 Radix_Sort·p0.99 sample 1 1 3.447030

ms/op 100000 Radix_Sort·p0.999 sample 1 1 5.719294

ms/op 100000 Radix_Sort·p0.9999 sample 1 1 8.724165

ms/op 100000 Radix_Sort·p1.00 sample 1 1 13.074432

ms/op 100000 Radix_Sort sample 1 4846 20.640787 0.260926 ms/op 1000000 Radix_Sort·p0.00 sample 1 1 14.893056

ms/op 1000000 Radix_Sort·p0.50 sample 1 1 18.743296

ms/op 1000000 Radix_Sort·p0.90 sample 1 1 26.673152

ms/op 1000000 Radix_Sort·p0.95 sample 1 1 30.724915

ms/op 1000000 Radix_Sort·p0.99 sample 1 1 40.470446

ms/op 1000000 Radix_Sort·p0.999 sample 1 1 63.016600

ms/op 1000000 Radix_Sort·p0.9999 sample 1 1 136.052736

ms/op 1000000 Radix_Sort·p1.00 sample 1 1 136.052736

ms/op 1000000

The table tells an interesting story. Arrays.sort is vastly superior for small arrays (the arrays most people have), but for large arrays the custom implementation comes into its own. Interestingly, this is consistent with the computer science. If you need to sort large arrays of (unsigned) integers and care about performance, think about implementing radix sort.

Sorting Unsigned Integers Faster in Java

Sorting Unsigned Integers Faster in Java

Radix Sort

Implementation

Benchmark

Recommend

Faster than radix sort: Kirkpatrick-Reisch sorting

A Quick Primer on Signed vs. Unsigned Integers

Sorting the table Ignoring special characters and integers?

Add functions to add unsigned and signed integers by a1phyr · Pull Request #8760...

On finding the average of two unsigned integers without overflow

Converting integers to decimal strings faster with AVX-512

Faster Sorting with Go Generics

SIMD accelerated sorting in Java - how it works and why it was 3x faster

Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy

Faster sorting algorithms discovered using deep reinforcement learning

About Joyk