The Doomsday Argument in Chess

June 7, 2020

Framing a controversial conversation piece as a conservation law

John Gott III is an emeritus professor of astrophysical sciences at Princeton. He was one of several independent inventors of the controversial Doomsday Argument (DA). He may have been the first to think of it but the last to expound it in a paper or presentation.

Today we expound DA as a defense against thought experiments that require unreasonable lengths of time.

Gott thought of the argument when he saw the Berlin Wall as a 22-year-old touring Berlin in 1969. He reasoned that his visit was a uniformly random event in the lifetime ${L}$ of the wall. That assumption gave him a 75% likelihood that he was not observing the wall in the first quarter of its lifetime. Since the wall was then 8 years old, that became a 75% likelihood that the wall would not last beyond 1993. It came down in late 1989.

The “Doomsday” name comes when one’s birthdate is regarded as a uniformly random sample from the sequence of all human births. If you are my age, is probably closer to ordinal 60 billion than 70 or 100 billion. We can then say we are 95% confident that we are not in the initial 5% of this sequence. That entails the sequence stopping before 1.2 trillion births. If our population levels off at 10 billion with 80 years’ life expectancy, that makes the lifetime of humanity extend no further than the year 12,000 AD. The upshot is that a longer entails asserting that our random sample gave a point unusually early in the span. The purer form of DA also argues that is not unusually late, giving this picture:

Modified from Michael Stock source

This doubles the span of allowed with 95% confidence while giving reason—at the time occurs—to believe that the end is not imminent: at least about 1.75 billion more births will come after . For my birth, however, this is already a given.

Debating DA

The dependence on which observer is taken as the reference point is one shiftable parameter of the DA. If you are a preteen reader, then your own birth may be closer to ordinal 70 billion in the sequence, which becomes your reference point. You can then tack on another 2,000 years to . The earliest human cave painters may have been among the first 3 billion Homo sapiens. With regard to their reference point, has already gone past their 95% limit.

A more fundamental rebuff to DA comes from the equal reasonableness of an alternate uniformity assumption: that you are a uniformly random element of the set of all possible human beings. Only a subset of will ever be born. The longer is, the higher was your prior probability of belonging to . Thus the fact of your birth can be construed as weighting the odds toward longer in a way that cancels out the short- reasoning of DA.

Even when an instance of DA passes these objections, the inference remains controversial. We wrote about DA last year in connection with estimating the lifespan of open problems remaining open. A clear non-instance is trying to apply DA to estimate the lifespan of the Covid-19 pandemic. We have all been going through the span together and now is not a uniformly random sample.

The DA assumptions would however hold if an alien tourist with no prior knowledge of events dropped in on Earth today. The delicacy of the assumptions makes it significant to seek scenarios where DA firmly applies—and better, where the inference may be deemed necessary to preserve the validity of established modes of inference against extreme skeptical hypotheses. This is what we will try to argue in regard to inferences of cheating at chess.

The 1-in-100,000 Question

We have posted numerous times about my statistical chess model, its giving judgments of odds against null hypotheses of fair play in the form of z-scores, and my means of validating them. We will take as granted for this argument that the modeling is true in the sense that the distribution of z-scores from testing honest players conforms to the standard normal distribution.

Now let us talk about chess in the years B.C.—before Covid—when the game was played over-the-board (OTB) in-person across a table. Suppose I obtained a z-score of 4.265 from a test of one player in one tournament. I have chosen this number for all of the following reasons:

It corresponds to what I call “face-value odds” of 100,000-to-1 against the null hypothesis, as one can see from this or any similar calculator.
It is close to my number from an actual case in the year 1 B.C., that is, last year.
It is also typical of z-scores I have been obtaining these past three months since chess went online, at the point where certain online platforms have made their own decisions to impose sanctions. Here I must add that the platforms’ cheating detection systems avail information about the manner of play through the platform GUI that often furnishes much greater statistical evidence, whereas my minimalist model uses only the record of the moves played in the games.

Suppose there were no other relevant information about the case. How would one assess the significance of the z-score of 4.265? Here are two different ways of reasoning that—in the case of OTB chess—arrive at similar answers:

The Bayesian prior probability of cheating in OTB chess has been estimated between 1-in-10,000 and 1-in-5,000. Suppose the former, and consider a thought experiment in which 100,000 players are tested. For simplicity, let’s suppose all true instances—that is, cheating players—give above 4.265. We expect there to be ten of them, plus one natural occurrence of 4.265 or more. Thus the odds that our score represents a true positive are only 10-in-11. This is well short of the odds range usually needed to meet the standard of comfortable satisfaction used for example by the Court of Arbitration for Sport. Thus the 4.265 datum alone should not be sufficient grounds for sanction.
Suppose there were a policy of sanction above a threshold of 4.25. The sum of playing fields in events held under auspices of the International Chess Federation (FIDE) each year exceeds 100,000. Thus we would expect to find at least one z-score over 4.265 per year by natural chance, whose sanctioning would be a serious human-rights error. FIDE cannot afford a rate of one such error per year. Thus it is insufficient for sanction.

A 5.0 standard, however, gives a natural frequency of just over 1-in-3.5 million. The resulting error rate of once in 20-to-30 years might be acceptable in prospect. And the Bayesian argument based on a 0.0001 prior leaves about 350-to-1 odds against the null hypothesis, which is comfortably within the comfortable-satisfaction range as it has been applied.

FIDE nevertheless has maintained a policy that statistical evidence must be accompanied by some other kind of evidence. If a player is caught looking at a chess position in a bathroom, or found to have a buzzing device or wires on his-or-her person, or signaling behavior is observed, then in fact much lower z-scores (to a threshold of 2.50, about 160-to-1 odds, in current FIDE regulations) are deemed to lend strong support to such evidence.

2015 Peter Doggers/Chess.com source

I posted a similar rationale on my own website in early 2012, where causal evidence is likened to the “black spot” in the novel Treasure Island.

One More Datum

Now, however, suppose we have the 4.265 and one more piece of “evidence” that is pertinent but not as clearly causal. It could be:

The player wore a hat that covers the ears, or
An unusually bulky sweater (worn on a hot day), or
Unusual gestures or movements during the games.

Say a search of the player turned up nothing, but this occurred after the sequence of games giving the 4.265, a day after the player had been put on notice of suspicion. So the extra information is not a black spot but instead a “grey spot.” What can we conclude now?

The Bayesian argument seems to depend on judging how this information affects the prior probability of cheating. Does it make cheating a more likely hypothesis? We don’t actually know. Whereas the 1-in-10,000 global prior estimate was based on knowing dozens of cases over the past decade, only a handful conformed to this level of indication—short of more obvious things like making frequent visits to the restroom or being seen with an ear adornment. The most we can say is that the datum is not irrelevant. An example of an irrelevant datum would be if the player were wearing neon green sneakers—not bulky, no wires, just a weird green.

I would like, however, to argue that the player’s membership in a smaller sample that is pertinent enhances the significance of the z-score. must be defined by criteria that are not only independent of my statistical analysis of the games but also pertinent so as to avoid selection bias. What is needed to quantify this enhancement is:

(a) to collect all (other) kinds of items on a par with the above—say ostentatious bracelets that could camouflage electronic indicators—and

(b) to establish that the frequency of players having any such accoutrement over the global mass of tournaments is at most, say, 1-in-100.

Now there are several equivalent ways to continue the reasoning. One is to say that since is “at worst” independent, the face-value odds are amplified by a factor of at least 100. The Bayesian mitigation then still leaves about 1,000-to-1 odds against the null hypothesis. Another is to say that in any given year, the natural chance of seeing the conjunction of and the z-score is at most 1-in-100. Thus aside from the frequency of true positives, a policy of sanctioning in such cases would have a prospective error rate of once in 100 years. The conjoined error rate of that and sanctioning on 5.0 in isolation would be acceptable.

A Bayesian defense attorney might still counter: Consider a thought experiment in which we test 100,000 such “bulky” players. We don’t have any new information on the prior rate of cheating by players in . For all we know, it is still 1-in-10,000. Thus the same terms as before will apply: our experiment will expect to have 10 cheaters in plus the one natural false positive, leaving the odds only 10-to-1 as before. Put another way: without knowing the import of specializing to on the likelihood of cheating, you can’t reach any further conclusion.

Doomsday to the Rescue

The nub of rejecting this counter-argument is that:

Because there are only about 1,000 players in per year, the thought experiment of testing 100,000 players in now takes 100 years.

Moreover, the defense attorney is asserting that the mistaken false positive has occurred unusually early in this span. If this is the first year under consideration, then it is a uniformly random event in the first 1% of . By the same reasoning of DA, the odds of this are only 1-in-100. Compounded by the 10-1 odds against this particular score in the thought experiment being the false positive, we recover something near the 1,000-to-1 odds of the original reasoning.

We might allow that we are not in the first year of “the cheating era” in chess. The thicket of high-profile cases with solid grounds for judgment goes back a little over 10 years. The factor from DA then goes down to 1-in-10. But this still leaves the overall odds about 100-to-1 against the null hypothesis, and that is commonly taken as an anchor point for the standard of comfortable satisfaction after all mitigating factors have been addressed.

Thus I am casting the Doomsday interval argument as a defense against unreasonably long thought experiments. It restores a dimension of time that is ignored by the Bayesian objection. This dimension of time is correctly preserved in the analysis of the expected error rate from a policy of imposing sanctions under this –z combination of circumstances.

Is my line of reasoning valid? You can be the judge. If so, then it is a class of instances where DA is applied merely to conserve an inference of unlikelihood that was originally made by other means. This supports the validity of DA-type inference in general.

Online Chess and the Time Warp

We are now in the third month of “the online era” in chess. Even though online platforms can process many more kinds of information than I can avail from OTB play, my work has proved highly relevant for global early indications, second opinions, and transparent explanations. Alas, the sanction rate at the new featured tournaments has been well in excess of 1%. We hope this will come down as the playing pool—which has been greatly democratized in massive online events—wises up to the reality of getting caught.

What I want to discuss here is how this brave new world flips the Bayesian reasoning in a way that may come on too strong for the prosecution, again by its indifference to the element of time.

Take the 4.265 z-score with a prior. The face-value odds from the z-score are now mitigated only to 1-in-1,000. This gives 99.9% confidence in imposing a sanction. However, the rate of errors would be higher than once-per-year because more players total have been involved per tournament. The tournaments are played at faster Rapid and Blitz paces allowing eight or more games per day, whereas classic OTB tournaments feature one game per day, sometimes two, over a span of a week to ten days.

This is also set against a vastly higher global sample size. Whereas the entire historical record of OTB chess represented by the ChessBase Mega database has yet to hit the 10 million games mark, the online platform Lichess has now hit 75 million games played per month. Adding in ICC and chess.com and ChessBase’s and FIDE’s own servers yields an equation that recalls Ps 90:4 and 2 Peter 3:8:

A thousand years of OTB are but a day that passes online.

For online platforms in isolation, absent anything to distinguish one player’s set of games from any other’s (such as their belonging to a highest-profile tournament), this means that even a 5.0 standard is inadequate for sure judgment. At their volume, online sites can see deviations of by natural chance more than once per day. Thus they either tolerate a higher rate of errors or adopt a standard so high as to let many more guilty parties through the sieve.

Such volume means all the more that one should hold a score of 4.265 as insufficient for judgment. This is despite the vastly higher Bayesian likelihood that a sanction based on that score is correct. The greater frequency of actual cheating does mean that the rate of error per positive reading declines, but the rate per absolute time, with regard to the fixed population of honest players, may matter more. This has accompanied deliberations of whether sanctions for online cheating must be given less permanent consequences in order to allow setting thresholds so that a high percentage of actual cheaters are flagged and the error rate can be tolerated.

Open Problems

Does this analysis square with you? Does it help in understanding controversies over the original Doomsday Argument’s paradigm?

For another pass over the argument, suppose I get a z-score of 4.265 in a narrowly-defined event such as one country’s championship league. Does that limit the sample size, so that the score is more dispositive? The kind of reasoning in point (b) above, where we had to gather all possible indicators that would lead us to constrain the sample, would however mandate widening it at least to include other countries’ leagues. This is an aspect of the “look elsewhere” effect where the space of potential tests is widened even before actual tests are considered. Possibly it should be widened to include all tournaments with similar levels of players, in which case we are back to the “square 1” of the 1-in-100,000 section of this post. The point of the analysis of the extra datum about the player is that the sample expansion has an effective pre-defined limit.

[Added note about online cheating detection to third bullet in section 3. Clarified: “… the conjunction of these two factors” –> “… the conjunction of B and the z-score” and changed the succeeding sentence.]

The Doomsday Argument in Chess

The Doomsday Argument in Chess

Debating DA

The 1-in-100,000 Question

One More Datum

Doomsday to the Rescue

Online Chess and the Time Warp

Open Problems

Like this:

Recommend

Apple Watch Ultra durability test: The table breaks before the watch

AMD Radeon RX 7900XT公版概念渲染图：三风扇、三槽厚度、三个8Pin接口

车载HMI设计指南（进阶篇）

GNOME 43 发布，标志性的版本

“没头脑”和“不高兴”作者去世，几代人的童年记忆

任意键｜《赛博朋克 2077》稳了，暂时

These Weeks In Firefox: Issue 124

The PS5 is renewed with a new more powerful and efficient chip

With Dynamic Island, Apple made use of the display cutout in ways Android brands...

京东健康“品牌经营力大赛”收官 Swisse斯维诗打造“教科书式”用户增长范本

About Joyk