9

Linq to Collections: Beyond IEnumerable<T>

 3 years ago
source link: http://twistedoakstudios.com/blog/Post1585_linq-to-collections-beyond-ienumerablet
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Linq to Collections: Beyond IEnumerable<T>

posted by Craig Gidney on December 18, 2012

The latest version of the .Net framework (4.5) includes several new interfaces representing readonly collections: IReadOnlyCollection<T>, IReadOnlyList<T>, and IReadOnlyDictionary<T>. In this post I’ll be explaining how to use these types to improve the ease-of-reasoning-about and efficiency (particularly with respect to memory usage).

The methods and types I’ll be describing are available as a software library (called “LinqToCollections”, in the public domain). You can view the library’s source code on GitHub or directly reference its NuGet package.

Motivation

Before I start talking about really taking advantage of the new readonly collection interfaces, I want to justify why it’s worth bothering with them in the first place. Why not just use the array or IList<T> types? Why not just use the more general IEnumerable<T> type? It’s worked in the past, so why not continue doing so?

First, it is better to ask for an IReadOnlyList<T>, as opposed to an IList<T> or an array, when you don’t need the ability to mutate. Doing so obeys the principle of asking only for what you need. If you don’t need the ability to mutate what you’re given, don’t ask for a thing with that ability. This increases the number of cases where your method can be used and makes reasoning about it easier. If you ask for an IReadOnlyList<T>, then a user (or automated tool) can trivially determine that their precious list won’t be mangled by your method. But, if you ask for an IList<T>, they have to dig through documentation, inspect the source code, or (more commonly) guess in order to make the same determination. For example, you’re less likely to mistakenly reverse the arguments to a memcpy-like method that asks for a readonly source because the wrong ordering can fail type checking.

Second, it is better to expose an IReadOnlyList<T>, instead of an IList<T> or an array, when you don’t allow the ability to mutate. This follows the principle of providing what you promise. If you return an IList<T> (that happens to be readonly at runtime) instead of an IReadOnlyList<T>, users will be mislead into trying to modify the result. For example, before following the link to the documentation, tell me whether Dictionary<K, V>.KeyCollection.Remove does the same thing as Dictionary<K, V>.Remove or else fails with a NotSupportedException. The operation makes sense, implying it should do the same thing is Dictionary.Remove, but it’s a bit odd to have a collection you can remove from but not add to, implying it should fail. This question wouldn’t even come up if KeyCollection implemented IReadOnlyCollection<T> instead of ICollection<T>.

Finally, asking for an IReadOnlyList<T>, instead of an IEnumerable<T> that you immediately transform into an IReadOnlyList<T>, is more efficient. You should be asking for what you actually need, so that callers that happen to already have it can avoid unnecessary transformation costs (asking for what you actually need also helps with unit testing). Note that the overload taking IEnumerable<T> can still exist, but it becomes a mere convenience method that delegates to the overload taking an IReadOnlyList<T>. A good example of this optimization is the Enumerable.Reverse method. Consider this slightly-simplified implementation:

public static IEnumerable<T> Reverse<T>(this IEnumerable<T> source) {
    var buffer = source.ToArray();
    for (int i = buffer.Length - 1; i >= 0; --i)
        yield return buffer[i];
}

Enumerable.Reverse has to make a copy of the input sequence, using linear time and consuming linear memory before it can even yield the first item! However, if callers have an IReadOnlyList<T>, the expensive step can be omitted altogether:

public static IEnumerable<T> Reverse<T>(this IReadOnlyList<T> source) {
    for (int i = source.Count - 1; i >= 0; --i)
        yield return source[i];
}

Asking for a list instead of an enumerable reduces both the memory usage and time-to-first-item from linear to constant. Calling this a big improvement is a bit of an understatement. This is not an isolated case either: the cost of lots of other operations drops from linear to constant when working on a list instead of an enumerable. For example: “pick a random item”, “skip the first/last N items”, “take the last N items”, “give me every N’th item”, “partition items into adjacent groups of size N”, not to mention things like binary search and graph traversal (when using an adjacency matrix).

Again, readonly interfaces help you ask for what you actually need, not ask for more than you need, and not promise more than you provide. Just refactoring existing parameter/return/getter types to be readonly types has understandability and (in some cases) performance benefits.

Implicit Savings

One of the useful ideas present in Linq (and functional programming in general) is to representing things implicitly, via a computational process. This idea applies well to readonly collections. If you’ve used Linq extensively, you’re already familiar with this concept via enumerables:

// "contains" 0,1,4,9,16,...,(10^9-1)^2 but doesn't use gigabytes of memory
IEnumerable<int> squares = Enumerable.Range(0, 1000000000).Select(e => (long)e * e);

The above code creates an enumerable containing a billion items, but consumes less than a hundred bytes of memory. This is possible because the enumerable’s items are defined by a computation instead of being stored explicitly. The range enumerator doesn’t store an array containing the numbers 0 through a billion and iterate over it, it stores a counter that is yielded and incremented each time an item is requested. The squaring projection also isn’t storing all of its results anywhere: it just squares and yields the next result from the range enumerator whenever the next projected item is requested.

The same trick can be applied to lists. We don’t have the benefit of iterator methods to keep the code compact, but we can use an anonymous implementation class for the same effect. For example, here is a Range method that returns an IReadOnlyList<int> instead of an IEnumerable<int>:

public static IReadOnlyList<int> Range(this int count) {
    return new AnonymousReadOnlyList<int>(count, getter: index => index);
}

AnonymousReadOnlyList<T> implements IReadOnlyList<T> by using the count and item-getter being provided. When a caller evaluates “someRange[2]”, AnonymousReadOnlyList will check that 0 <= 2 < count and then pass 2 into the anonymous function “index => index” in order to get the result (also 2) to pass back to the caller. Because the list is defined by a computational process (instead of being stored as an array), it takes ~80 bytes (instead of billions) to store. The benefit of having a list range instead of an enumerable range is that it supports random access:

IEnumerable<long> enumRange = Enumerable.Range(1000000000);
IReadOnlyList<long> listRange = ReadOnlyList.Range(1000000000);
// takes nanoseconds to evaluate
var s1 = listRange[999999999];
// takes SECONDS to evaluate
var s2 = enumRange.ElementAt(999999999);

Many more methods can be lifted from enumerable-land to list-land. The Reverse method, mentioned earlier, is a great example:

public static IReadOnlyList<T> Reverse<T>(this IReadOnlyList<T> source) {
    return new AnonymousReadOnlyList<T>(
        counter: () => source.Count,
        getter: index => source[source.Count - 1 - index]);
}

Now when we reverse a list we don’t lose random-access capabilities. We even get all the standard benefits of deferred execution. For example, if you store “var rev = list.Reverse()” then execute “list.Add(-3)”, you’ll find that rev has been prepended with -3.

Another good example is the Select method, which enables writing query expressions that take and return lists:

public static IReadOnlyList<R> Select<T, R>(this IReadOnlyList<T> list, Func<T, R> projection) {
    return new AnonymousReadOnlyList<R>(
        counter: () => list.Count,
        getter: index => projection(list[index]));
}
IReadOnlyList<bool> x = from i in 10000000.Range()
                        select i < 5000;
bool b = x[5000]; // evaluates to false in time independent of the index

Implicit-definition can be performed with readonly collections and dictionaries, too:

public static IReadOnlyDictionary<K, R> Select<K, V, R>(this IReadOnlyDictionary<K, V> dictionary,
                                                        Func<KeyValuePair<K, V>, R> projection) {
    return new AnonymousReadOnlyDictionary<K, R>(
        counter: () => dictionary.Count,
        keys: dictionary.Keys,
        getter: (K k, out R r) => {
            V v;
            if (!dictionary.TryGetValue(k, out v)) {
                r = default(R);
                return false;
            }
            r = projection(new KeyValuePair<K, V>(k, v));
            return true;
        });
}

Again, although these implicit collections may be “large”, they are created and stored using only constant time and memory. The downside is that their items are individually more expensive to access, especially if you have implicitness stacked on implicitness a hundred levels deep. Usually you can avoid these costs with a well-placed “ToArray”, but it’s important to keep them in mind. (Technically the compiler could optimize many of the costs away, but don’t rely on that before it actually exists.)

Library

As I mentioned near the start of this post, I’ve packaged an implementation of these concepts into a “LinqToCollections” library. This section just describes what’s in the library, so it’s a bit dry (sorry).

Most of the library is devoted to working with readonly lists, because that is the case I have run into the most often. You can create a custom readonly list by constructing an instance of AnonymousReadOnlyList, or use the factory and extension methods in the ReadOnlyList class. Amongst the list methods there are two substantial groups: sublist methods and range methods.

Sublist methods include Take, TakeLast, Skip, SkipLast, TakeRequire, TakeLastRequire, SkipRequire, and SkipLastRequire. The ‘Require’ variants check that the list is large enough to take or skip the given number of items. There are several optimizations that these methods guarantee, in order to avoid quadratic behavior in use cases like “iteratively parse N bytes and skip N bytes”. Sequences of Take, sequences of TakeLast, and sequences of Skip/SkipLast are all collapsed automatically (e.g. list.Skip(2).Skip(2) is optimized into list.Skip(4)). The ‘Require’ check is also collapsed automatically across all the Skip/Take transformations (e.g. list.SkipRequire(2).TakeLastRequire(2) is optimized into list.PrivateRequire(4).Skip(2).TakeLast(2)). In many cases a requested operation will be entirely omitted at runtime, based on knowledge of the collection’s size (e.g. if x = new int[10], then x.Skip(20) returns ReadOnlyList.Empty() and x.SkipRequire(5) returns x.Skip(5)). Most of the library’s methods try to preserve ‘bounds on size’ information, in order to make these optimizations more effective.

Range methods include int.Range, byte.Range, sbyte.Range, ushort.Range, short.Range, AllBytes, AllSignedBytes, AllUnsigned16BitIntegers, and AllSigned16BitIntegers. They’re trivial, but nice to have around. I personally strongly prefer “foreach (var i in n.Range())” to “for (var i = 0; i < n; i++)", because foreach loops scope the iterator variable inside the loop, you’re less likely to make an off-by-one typo, and I can refactor transformations into the thing-being-iterated very easily. (If I was going to pick a part of the library as “controversial”, it would be the int.Range method.)

The remaining list methods each have different purposes, but are mostly self-descriptive: AsIList, IList.AsReadOnlyList (IList doesn’t inherit from IReadOnlyList for backwards compatibility reasons), Select, Zip, Reverse, Last, LastOrDefault, Empty, Singleton, Repeat, Stride, Deinterleave, and Partition. All of these transformations uses constant memory and take constant time to create. Also, their enumerators use the collection’s enumerator when possible so that functionality like lists detecting if they are modified while being enumerated continues to work.

The remainder of the library, devoted to readonly dictionaries and collections, is just equivalents of Empty, AnonymousReadOnlyList, AsIlist, and IList.AsReadOnlyList (and a Select method for readonly dictionaries). A lot more methods could be implemented for readonly dictionaries and collections, I just haven’t encountered enough situations where I needed them in order to justify their inclusion.

Usage

There are several usage examples in the main window of the example project.

Example Project Screenshot

In practice you use Linq to Collections in exactly the same ways you would use Linq to Enumerables, but with the benefit of a more flexible type. A simple use case is initializing an array:

TaskCompletionSource[] sources = 100.Range()
    .Select(e => new TaskCompletionSource())
    .ToArray(); // construct all of the sources once and only once

This is more compact than creating the array and then iterating over its range while filling it in. A more complicated example is massaging a Fourier transform’s output:

static void GraphMagnitudesAsFrequencies(IReadOnlyList<float> magnitudes) {
    // this method happens to interleave the real and imaginary components of its result
    IReadOnlyList<float> frequenciesCompact = magnitudes.FastFourierTransformWithInterleavedOutput();
    // de-interleave the result into complex numbers
    var realImag = frequenciesCompact.Deinterleave(2);
    IReadOnlyList<Complex> frequencies = realImag[0].Zip(realImag[1], (r, i) => new Complex(r, i));
    // get the frequency amplitudes
    IReadOnlyList<double> amplitudes = frequencies.Select(e => e.Magnitude);
    // graph it, omitting the "mirrored" second half of frequency space
    DrawMagicalGraph(amplitudes.Take(amplitudes.Count / 2));
}

In this example, a wave specified as a list over the time domain is transformed into a list of amplitudes over the frequency domain before being passed along. The passed-along list is actually a nicer view of the output of the transform, computed on-demand as its items are accessed. For example, when a caller accesses “amplitudes[2]” this causes “frequencies[2].Magnitude” to be evaluated, which causes “new Complex(realImag[0][2], realImag[1][2]).Magnitude” to be evaluated, which causes “new Complex(frequenciesCompact[4], frequenciesCompact[5]).Magnitude” to be evaluated.

Summary

You can make code easier to understand by asking for exactly what you want and promising exactly what you provide. The readonly collection interfaces allow you to do this in more situations. Asking for what you want, instead of creating it from an enumerable, can save a lot of execution time and memory.

Readonly collections can be defined implicitly, typically by projecting or otherwise transforming existing instances. Huge implicit collections can be created in constant time and stored in constant additional memory. You pay for this benefit with overhead when items are accessed.

All of this is implemented and ready to be referenced.

Appendix

Discuss on Reddit

Comments are closed.


Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.

Archive


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK