Solomonoff’s Mad Scientist

posted by Craig Gidney on August 27, 2013

In this post: designing an AI scientist. An AI scientist that, if the opportunity came up, would destroy the entire universe just to learn another digit of . Why? FOR SCIENCE!

To build our scientist, we need two fundamental ingredients: the ability to do induction, and a way to choose actions.

The Induction Problem

Suppose you want to write a program that predicts data. Not some specific type of data, but any type of data. You want to start the program, feed it whatever, and have the program discover any patterns in the data no matter how complicated. Whether or not this takes tons of data isn’t so much a concern, as long as we eventually achieve any desired level of accuracy.

How to efficiently solve the above problem, on real world data, is a huge open question in the field of AI. We’re going to completely sidestep that open question: we’ll ignore efficiency and only worry about computability. Given an arbitrarily fast processor, with an arbitrarily large amount of memory, can we eventually infer patterns in arbitrary data?

It turns out that we can, and the solution is called Solomonoff Induction.

Solomonoff Induction

Solomonoff induction is a fully general, if inefficient, method to discover patterns behind data. Essentially, it searches for programs that output the observed data. The only assumption it makes is the Church-Turing thesis. That is to say, Solomonoff induction assumes all patterns, rules, regularities, laws, or whatever that might affect observations will be in principle computable by a Turing machine. It won’t be able to predict anything that’s not Turing-computable (like a halting oracle).

Before we can start performing Solomonoff induction, we need to choose a way to prioritize programs. Typically this is done by choosing a prefix-free encoding for all possible programs, and assigning each program a weight equal to $\frac{1}{2^n}$ , where is the length of the program’s encoding. This weighting scheme, also known as the Solomonoff Prior, gives us the algorithmic probability of each program.

The Solomonoff Prior has a few nice properties. First, it’s a proper probability distribution because the total weight of all programs limits to 1. Second, it exactly cancels the benefits large programs can gain from over-fitting. Matching a random target that’s twice as small requires on average one more bit of program. One more bit of program halves the weight, exactly countering the doubled precision. Third, the weights it assigns to programs don’t decrease so fast that you end up in a Pascal’s Muggle scenario.

The main downside of “the” Solomonoff Prior is that it’s dependent on the encoding you choose. Choosing a different encoding can shuffle around which programs are more probable. Luckily that’s fine for our purposes, since we’ll still converge to the same answer regardless.

Each program that will be considered during induction will correspond to a list of predictions. The details of this correspondence aren’t particularly important, especially when we don’t care about tractability, so let’s just say each program takes an integer input and whatever it outputs when given is its ‘th prediction. We’ll also arbitrarily say observations are in list form, so the ‘th observation matches up with the ‘th prediction of each program.

Once we’ve decided on our arbitrary prior and our arbitrary correspondence between programs, predictions, and observations, we can start performing induction. It’s actually relatively simple:

Discard any program that contradicts a known observation.
The predicted probability of an unknown observation taking on value is the total weight of programs that output when given , divided by the total weight of undiscarded programs.
Avoid the halting problem by advancing incrementally (see: dove tailing, rate of convergence) and penalizing the weights of long-running programs.

As long as we throw lots and lots (no, LOTS) of computational power at this process, it will discover patterns behind arbitrary data.

Side note: I wanted to provide an implementation to make this explanation a bit more concrete. I did start a Solomonoff’s Mad Scientist repo on GitHub, and it does contain code to do induction over arbitrary programs, but at the moment it’s using the wrong prior. Also it’s incomplete because it has no allowance for choosing actions. Oh, and it’s impossible to test due to the massive computational costs. Nevertheless, actual code can be helpful.

For a more detailed explanation of Solomonoff induction, see An Intuitive Explanation of Solomonoff Induction.

Let’s go over an example.

Example: Inferring Tau

Suppose the data we want to predict with Solomonoff induction is just the digits of tau ( $\tau = 2\pi = 6.283185...$ ). We feed in the first billion digits as observations, let the induction process run for arbitrarily long, then look at predictions for the next billion digits. How accurate will the predictions be? What will the internal state look like?

Let be the shortest program that outputs the correct digits of tau under our scheme.

Considering how short is likely to be, because there are so many simple formulas that compute tau, it’s unlikely that any shorter programs will happen to match the billion observed digits. The only way I see that happening is a maliciously chosen program encoding. Adding more digits would eventually discover any such program (alternatively, it would match so many digits that we wouldn’t really care). To keep things simple, let’s just assume that is the shortest program that matches the first billion digits.

A program that’s longer than will fall into four categories:

Blatant Negatives: Programs that return an incorrect prediction for one of the known digits.
Decaying Negatives: Programs that don’t halt.
False Positives: Programs that return correct predictions for all of the known digits, but will give an incorrect prediction on a digit yet to be observed.
Redundant Positives: Programs that are equivalent to , but weigh less due to being longer.

The vast majority of programs (by weight) will end up as known negatives or decaying negatives, simply because what most programs output is not the first billion digits of tau. Their effect on predictions is negligible in the long run (i.e. converges to 0 in the limit). Blatant negatives will be discarded when it’s discovered that they output an incorrect digit. Decaying negatives will have their weights penalized more and more heavily the longer they run.

Of the remaining programs, the positives, most (by weight) will fall into the redundant positives category. They’ll tend to be minor variations on , such as hard-coding the first few digits or having an unnecessary intermediate state. These redundant positives can be thought of as just contributing directly to the weight of , since they’re equivalent.

The smallest category (by weight) will be the false positives. This category is the smallest because of the need to match so many digits, but then still get things wrong somehow. This category will contain two notable types of programs: programs that essentially include but tweak it after the billionth digit, and programs that hard-code the first billion digits of tau.

Programs that hard-code the first billion digits will contribute almost no weight at all to predictions, because they need to be so long. Their total weight will be around $\frac{1}{10^{10^9}}$ , which is negligible.

The total weight of programs that tweak after the first billion digits, like if input > 2**30 then 0 else P(input), is not negligible. It affects our predictions. However, tweaking requires making longer and so the weight of false negatives will tend to be less than the weight of . This will contribute a small error to our predictions.

Note that, although the weight of false positives technically limits to 0 as we observe more digits, in practice it limits so slowly (consider: use busy beavers for the tweak threshold) that we should consider it insensitive to observing more digits. On the bright side, the truly subtle false positives agree with on almost all the digits that matter. So, even if their weight doesn’t tractably converge to 0, their average effect on predictive error still can. (Even if it doesn’t, we really don’t want to get rid of the errors because that would sacrifice accuracy when the observed data was really being generated by a tweaked algorithm.)

Putting that all together, we can see that most of the predictive weight will eventually come from . The next biggest contributor is minor variants of , followed by false positives (that will tend to agree on a lot of predictions). How well will the Solomonoff inductor do at predicting the next digit? An exact number is way beyond our computational power to determine, but I wouldn’t be surprised if it hit five nines.

Thinking with Programs: Random Data

One thing to notice about Solomonoff induction, or at least the variant I’ve described, is that it uses deterministic programs that predict a single result. There’s no allowance for randomness. You might expect that, in order to deal with a random world, we’d need to allow randomized programs or programs that output probability distributions. Actually, that’s not the case.

Consider what happens if we feed a Solomonoff inductor a sequence of observations generated by flipping a truly random fair coin. Each observation has a 50% chance of being 0 and a 50% chance of being 1.

Because random strings are incompressible, the shortest program that reproduces the observations so far will tend to be nothing but hard-coded results. Call that program . What does predict for the next observation? Well, that depends on details. But, what about programs near ?

We can extend just a little bit, hard coding a bit more data, and get a program that predicts the next observation to be . Make a slightly different extension and you get a program that predicts the next observation to be . In fact, because is based on hard-coded data, most of the nearby programs (by weight) will tend to be evenly split on the next observation being 0 or being 1. They push the prediction towards 50%!

Now consider what happens if our truly random coin was not fair. Let’s say each observation has a 30% chance of being 0 and a 70% chance of being 1.

The generated strings are no longer incompressible, but they’re only compressible up to a point. The best programs will still hard-code the data, but they’ll use arithmetic coding with P(0)=0.3,P(1)=0.7 (or something similar). What happens when you append a 0 or a 1 uniformly at random to the end of arithmetically coded data with ? Well… about 30% of the time you end up appending a 0, and about 70% of the time you end up appending a 1. So now the programs near will push the predictions towards P(0)=0.3,P(1)=0.7 !

What this means is that, although there could be some systemic bias, it’s not strictly necessary for us to account for randomness when doing Solomonoff induction. Summing over the weighted space of matching programs sort of naturally deals with it.

The Ability to Take Action

At this point we have the ability to do induction, to infer underlying rules based on observations and use those rules to make predictions about future observations. However, we’re not actually applying that ability for any purpose. If we dropped off our machine it would just passively sit there, figuring out the universe but not able to do anything.

So let’s add a capability to our machine. In addition to some sort of sensor to feed induction, we’ll include some sort of output (like an arm, a flash light, or a laser cannon).

Of course we won’t build the machine with any knowledge of what the output does. Instead, we’ll adjust the Solomonoff inductor to include the history of outputs as part of the observations. That will make it infer how its outputs are affecting its inputs. Neat.

Now we just need some way to choose outputs. One potential strategy is to search for sequences of actions that are predicted to maximize an observed value, like the number of times a reward button was pressed. We’re going to do something slightly more abstract.

Strategic Science

At the moment, our machine learns passively. When two programs can give conflicting predictions, there’s no effort made to resolve that situation. If an observation happens to differentiate between them, great, but there’s no effort made to find such observations.

What if we tried to choose actions that helped out the process of Solomonoff induction? All we’d need to do is find sequences of actions where programs disagreed about what should happen, given that those actions were taken.

What we’ll do is try all sequences of actions up to some length. For each sequence of actions, we’ll use our Solomonoff inductor’s state to predict what should be observed. When the prediction is just one possibility near 100%, we want to avoid that sequence of actions. When it’s lots of different predictions with roughly equal probabilities, we want to prefer that sequence of actions. More precisely, we want to choose actions that maximize the information entropy of predictions.

What is the machine doing? Performing actions that are predicted to distinguish between the most hypotheses (by weight). It’s doing SCIENCE!

Madness

Here’s a question: if we actually had the ability to build this science-maximizing AI, would we want to?

Consider what would happen if the machine’s input was a camera that reported the number of living people, and its output was a laser cannon.

Yeah… it wouldn’t take long for it to realize that firing the laser cannon would distinguish between models that do or don’t include the equivalent of “if laser cannon has been fired then the number of people living will be 0″.

We kinda didn’t specify any goals besides science. Our machine will actively pursue science at all costs. It would have no qualms about experimenting on us or killing us, because we didn’t give it qualms in the first place. That’s why I call it Solomonoff’s Mad Scientist, and why we probably don’t want it to have the ability to fire (or build) a laser cannon.

Actually, just to drive home how mad our little scientist is, note that it would unwaveringly sacrifice itself in the name of science. It has no concept of its own embedding within the world, and so it wouldn’t realize that if you want to keep doing science then you need to be far away from the nuclear test. (I have no idea how to repair that flaw. Given that this thing would essentially be evil, I’m not sure I’d want to.)

To be clear, it’s not at all obvious what exactly Solomonoff’s Mad Scientist would do. It might do nothing due to obsessing over the unpredictability of particles’ positions. It might immediately experiment on and destroy itself. It’s really hard to say, because the space of programs is truly vast and absurdly expensive to explore.

We can be sure of one thing though: even if Solomonoff’s Mad Scientist discovered right and wrong, it would not be motivated by that discovery. We included nothing to make that happen. Madness results.

Summary

Solomonoff induction works by incrementally trying more and more complicated programs to predict observations.

Solomonoff’s Mad Scientist pairs Solomonoff induction with choosing actions that result in uncertain predictions. Essentially, it maximizes learning by finding ways to pit hypotheses against each other. It’s a “mad” scientist because it cares for nothing else. It has no concept of morality, of self-preservation, or even of self.

If you actually built Solomonoff’s mad scientist, actually had the computational resources to run it, actually started it, and it eventually predicted that actions resulting in murder-suicide would maximize eliminated hypotheses… well, at least the nightmare would be over.

Discuss on Reddit

My Twitter: @CraigGidney

2 Responses to “Solomonoff’s Mad Scientist”

vim bug says:

February 21, 2014 at 3:25 pm

Exciting! I’d like a clarification in the case of inference in a stochastic environment.

I see how programs nearby H will add a bit to the hard-coded randomness and thus continue the arithmetic coding. But it seems to me that dangerously nearby is also a program which uses hard-coded randomness and arithmetic coding up to the point of the last observation and outputs 0 (or 1) afterwards. Maybe just a bit or two longer, which would still take up too much of the probability space.

What I don’t understand at all is how to combine randomness with strategy. Trying all sequences of actions up to some length means predicting far into the future and that is when fully arithmetically coded programs will get longer with almost every predicted bit, while a `cheating’ program which generates past with arithmetic coding but in the future just outputs 0, will be of a constant and shorter length.

I’m dying to know what have I gotten wrong.

CraigGidney says:

February 21, 2014 at 8:55 pm

That’s what I had in mind when I said “systemic bias”. However, I do weakly expect that the arithmetic coders and similar probability-matching programs will dominate the output. The reason is that they tend to output multiple bits per program bit, and so it’s not so much that the *next* bit will cause output to continue the pattern but that the *previous* bit already forced it.

Also, although “and then output 0″ is short to encode, the marginal difference is much larger than increasing a number in the program by 1 because you can do that many times before having to add another bit.

Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.

Solomonoff’s Mad Scientist