Collection Equality is Hard

posted by Craig Gidney on December 31, 2013

In this post: dealing with the corner cases inherent in checking if collections are equivalent.

Collection Equality

One of Objective-C’s features that I really like, but forgot to mention in A Year’s Worth of Opinions about Objective-C, is the built-in support for equating collections by value. When I’m working in C#, and I want to determine if two dictionaries have the same keys mapped to the same values, I have to write/find my own method to do so. In Objective-C I just write [dict1 isEqual:dict2] and it works (for the most part). It even works when there are other sorts of collections in the dictionaries, like arrays and sets and perhaps things that aren’t even implemented yet, because they also know how to determine equality.

But Obj-C’s collection equality, like most implementations of the concept, is not perfect. It falls apart in two cases: when collections are redundantly nested, and when collections are recursively nested.

Let’s start with the redundant case.

Unifying Redundant Collections

Imagine you’re implementing Hashlife. You decide to represent each node of your quad tree as an array containing the four child nodes. Hashlife makes sure that nodes covering areas with equal contents are actually the same node. They literally reference the same array, allowing exponential savings on the size of the board representation in memory (depending on how repetitive the simulation is).

Basically, Hashlife’s quad tree tends to have a lot of ways to get from the root to a particular node. A naive traversal of the tree will hit nodes many times. That’s what I mean when I say they are redundantly nested.

What would happen if you took two Hashlife quad trees and checked if they were equal? Well, Obj-C determines if arrays are equal by checking if they have the same length then recursively checking if their corresponding items are equal. In the case of Hashlife quad trees, that corresponds to doing a depth first traversal of the two trees, looking for discrepancies. The traversal does not avoid redundant paths to the same node, and so can end up spending exponentially more time than necessary to recheck things it’s already checked.

That’s a practical example of how this issue might crop up. Let’s consider a simpler example. How long will it take to run the following code? How long should it take?

// at the bottom: lots of "hi"
NSArray* a = @[@"hi"].mutableCopy;
NSArray* b = @[@"hi"].mutableCopy;

// nest it fifty levels deep, with branching factor 2
for (int i = 0; i < 50; i++) {
    a = @[a, a];
    b = @[b, b];
}

// they should be equal
bool areEqual = [a isEqual:b];
NSLog(areEqual ? @"true" : @"broken"); // prints 'true'

If you do try running the above code, I suggest not waiting for it to finish. It will take a month or three, as the naive equality algorithm ignores that the 2^50 paths joining up again and again suggests a slightly faster way to confirm the arrays have the same structure. In particular, the fact that a[0] == a[1], b[0] == b[1], and [a[0] isEqual:b[0]] are true implies [a[1] isEqual:b[1]] is true. Taking advantage of that fact would cut the number of recursive calls at each level from 2 to 1, saving fifty compounding factors of 2 on the running time.

This raises the question: is there an algorithm that can catch this sort of redundancy? What about more complicated cases, where the redundant arrays are scattered across multiple levels?

The problem we're trying to solve is a classic in computer science, known as unification. The input to a unification problem is two arrays that contain variables, values, and nested arrays. The solution is an assignment of values to the variables that causes the two inputs to be the same.

Here's an example unification problem: find a variable assignment for A,B,C,D that satisfies (A, B, C, D) = (B, "hey", (3, D), 4). In this case the solution is A=B="hey", C=(3,4), D=4.

Sometimes actually writing out the solution to a unification problem can be extremely expensive. For example, the solution to the unification problem corresponding to our 50-levels-of-double-nesting arrays would be over 2^50 characters long. Fortunately we can output the solution in a different form, as a series of rewrite rules for each variable that eventually result in the full solution if followed. (It's important that the rewrite rules not contain any cycles.)

For reference, the 50-levels-of-double-nesting problem looks something like this:

(A0, A0,   B0,   A1, A1,      B1,      ..., A49, A50,       B50)
=
(B0, "hi", "hi", B1, (A0,A0), (B0,B0), ..., B49, (A49,A49), (B49,B49))

and the solution, in terms of rewrites, follows this pattern:

A0=B0 -> "hi"
A1=B1 -> (A0, A0)
...
A50=B50 -> (A49, A49)

Assuming you're satisfied with the algorithm returning only the rewrite rules, and we are, unification can be solved in linear time. (This is really surprising to me, because unification feels so much like union-find which just barely doesn't have a linear time algorithm. Before I was convinced that it worked I had to look it up and prove it for myself.)

Anyways, we can translate our collection equality problem into a unification problem and thereby solve it in linear time even when there's a lot of redundant nesting. To create the unification problem we do a traversal of the collections and their sub-collections (avoiding re-traversing when we've already seen a collection), building up constraints as we go. Everything we encounter must be equal to the corresponding item or collection on the other side. Our collections are equal if and only if the unification problem has a solution.

I don't really want to go too deep into the details of the mapping between the two problems. Suffice it to say that it's feasible for collection equality to be implemented in terms of unification algorithms that don't fall apart under redundant nesting. The resulting algorithm will even still be linear time. The main downside of using the more general algorithm is that the typical case, no redundant nesting, becomes less efficient. Also, there would need to be a protocol for building up the overall unification problem (so we can work across unknown collection types).

Because it's based on a well understood algorithm that runs in linear time, I don't think ensuring good performance for redundantly nested collections is too crazy. Recursive collections, on the other hand...

Graphing Recursive Collections

A recursive collection is a collection that contains itself, either directly or indirectly (e.g. by containing a collection that contains the original collection). Recursive collections break a lot of naive methods. For example, asking for a description of a recursive collection often leads to a crash as the program tries to create a string nested to infinity.

If even describing recursive collections doesn't work, what chance do we have of comparing them? Well, sometimes you'll get lucky and the comparison will short-circuit due to two compared collections or their items being equal by reference (i.e. same pointers, same collection, trivially equal). But that's not always the case:

// two arrays
NSMutableArray* a = [NSMutableArray new];
NSMutableArray* b = [NSMutableArray new];

// containing each other
[a addObject:b];
[b addObject:a];

// have the same structure, so should be equal
bool areEqual = [a isEqual:b]; // *CRUNCH*
NSLog(areEqual ? @"true" : @"broken");

If you run the above code your program will crash. The arrays have the same structure, so I would consider them to be equal, but their cyclic nature is causing the naive algorithm to get stuck in a loop recursing deeper and deeper until it overflows the stack.

Unification won't help us here. The constraint A = (A) is not considered solvable, because the resulting rewrite rule A -> (A) makes a variable depend on itself. If we tried unification, we'd determine that the arrays are not equal (which is wrong). We have a more general problem now: graph isomorphism.

An algorithm that solves the graph isomorphism problem takes two graphs and determines if they are "the same". That is to say: is it possible to relabel the nodes of one of the graphs so that it becomes the other graph. We can translate our collection equality problem into a graph isomorphism problem by making each collection into a node, and whenever a collection contains a collection you insert an edge from the parent collection's node to the child collection's node. We'd need a bit of trickery to turn non-collection items into non-interchangeable leaf widgets, but again let's skip over the details to focus on the high level things like the running time.

How long does it take to determine if two graphs are isomorphic? That's actually a really interesting question that no one knows the answer to. Graph isomorphism is one of the few problems currently in limbo between P and NP-Complete. There's no known polynomial time algorithm, but it seems like there could be one. Suffice it to say that the problem is its own complexity class. The best algorithms we have tend to run quickly, but contain exponential worst cases.

So it's possible to make collection equality work in the fully general case, where collections reference themselves in cycles, by translating the equality check into a graph isomorphism problem. The problem is that we can't be sure the check won't take an amount of time exponential in the size of the collections, and that's a really nasty edge case to leave in.

Maybe it's simpler to just not handle this case.

Summary

When collections are nested deeply and tend to be repeated, the naive comparison algorithm degenerates to taking exponential time. A unification-based algorithm would continue to work in linear time.

When collections can indirectly contain themselves, the naive comparison algorithm crashes and unification-based algorithms give false negatives. A graph-isomorphism-based algorithm will return the correct answer, but it's not known if this can be done in polynomial time or not.

Collection equality seems simple, but hides hard computer science problems underneath.

Discuss on Reddit

My Twitter: @CraigGidney

Comments are closed.

Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.

Collection Equality is Hard