

Do You Want to Know a Secret?
source link: https://rjlipton.wpcomstaging.com/2018/08/25/do-you-want-to-know-a-secret/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Do You Want to Know a Secret?
A riff on writing style and rating systems
Mark Glickman is a statistician at Harvard University. With Jason Brown of Dalhousie University and Ryan Song also of Harvard—we’ll call them GBS—he has used musical stylometry to resolve questions about which Beatle wrote which parts of which songs. He is also a nonpareil designer of rating systems for chess and other games and sports.
Today we discuss wider issues and challenges arising from this kind of work.
In fact, we’ll pose a challenge right away. Let’s call it The GLL Challenge. Many posts on this blog have both our names. In most of them the writing is split quite evenly. Others like this are by just one of us. Can you find regularities in the style of the single-author ones and match them up to parts of the joint ones?
Most Beatles songs have single authors, but some were joint. Almost all the joint ones were between John Lennon and Paul McCartney, and in a number of those there are different accounts of who wrote what and how much. Here are examples of how GBS weighed in:
- Although the 1962 song, “Do You Want to Know a Secret?” was credited as “Lennon/McCartney” and even as “McCartney/Lennon” by a band who covered it in 1963, it has long been agreed as mostly by Lennon, as labeled on this authorship list. GBS confirm this.
- The two composers differed, however, in their accounts of “In My Life” and it has taken GBS to credit it all to Lennon with over 98% confidence.
- The song “And I Love Her” is mainly by McCartney, but GBS support Lennon’s claim to have written the 16-syllable bridge verse.
- Lennon said “The Word” was mainly his, but GBS found McCartney’s tracks all over it.
Tell Me Why Baby It’s You
To convey how it works, let’s go back to the GLL Challenge. I tend to use longer words and sentences, often chaining further thoughts within a sentence when I could have stopped it at the comma. The simplest approach is just to treat my sole posts as “bags of words” and average their length. Do the same for Dick’s, and then compare blocks of the joint posts. The wider the gap you find in our sole writings, the more confidently you can ascribe blocks of our joint posts that approach one of our word-length means or the other.
For greater sophistication, you might count cases of two consecutive multisyllabic words, especially when a simple word like “long” could have replaced the second one. Then you are bagging the pairs of words while discarding information about sentence structure and sequencing. An opposite approach would be to model the probability of a word of length following a whole sequence of words of lengths
. This retains sequencing information even if
is small because one sequence is chained to the previous one.
GBS counted pairs—that is, transitions from one note or chord to another—but did not analyze whole musical phrases. The foremost factor, highlighted in lots of popular coverage this past month, is that McCartney’s transitions jump around whereas Lennon’s stay closer to medieval chant. Although GBS covered songs from 1962–1966 only, the contrast survives in post-1970 songs such as Lennon’s “Imagine” and “Woman” versus McCartney’s “Live and Let Die” and the refrain of “Band on the Run.”
To my ears, the verses of the last creep like Lennon, whereas Lennon’s “Watching the Wheels” has swoops like McCartney. Back when they collaborated they may have taken leaves from each other, as I sometimes channel Dick. The NPR segment ended with a query by Scott Simon about collaborative imitation to Keith Devlin, who replied:
For sure. And that’s why it’s hard for the human ear to tell the thing apart. It’s also hard for them to realize who did it and this is why actually the only reliable answer is the mathematics because no matter how much people collaborate, they’re still the same people, and they have their preferences without realizing it. [Lennon’s and McCartney’s] things come together—that works—but they were still separate little bits. The mathematics isolates those little bits that are unique to the two people.
GBS isolated 149 bits that built a confident distinguisher of Lennon versus McCartney. This raises the specter of AI revealing more about us than we ourselves can plumb, let alone already know. It leads to the wider matter of models for personnel evaluation—rating the quality of performance—and keeping them explainable.
A Paradox of Projections
Glickman created the rating system Glicko and partnered in the design of URS, the Universal Rating System. Rather than present them in detail we will talk about the problems they intend to solve.
The purpose is to predict the how a player will do against an opponent
from the difference in their ratings
and
:
Here giving the probability for
to win, or more generally the percentage score expectation over a series of games. The function
should obey the following axioms:
The last says that the marginal value of extra skill tails off the more one is already superior to one’s opponent. Together these say is some kind of sigmoidal curve, like the red or green curve in this graphic from the “Elo Win Probability Calculator” page:
To use the calculator, pop in the difference as , choose the red curve (for US ratings) or green curve (for international ratings), and out pops the expectation
. What could be simpler? Such simplicity and elegance go together. But the paradox—a kind of “Murphy’s Law”—is:
Unless the players are equally rated, the projection is certainly wrong. It overestimates the chances of the stronger player. Moreover, every projection system that obeys the above axioms has the same defect.
Here’s why: We do not know each rating exactly. Hence their difference likewise comes with a
component. Thus our projection really needs to average
and
over a range of
values. However, because
is concave for
, all such averages will be below
.
We might think we can evade this issue by using the curves
This shifts the original curve left and right and averages them. Provided
is not too big,
is another sigmoid curve. Now define
by aggregating the functions
, say over
normally distributed around
. Have we solved the problem? No:
still needs to obey the axioms. It still has sigmoid shape concave above
. Thus
will still be too high for
and too low for
. The following "Law"—whom to name it for?—tries not to be hyperbolic:
All simple and elegant prediction models are overconfident.
Indeed, Glickman’s own explanation on page 11 of his survey paper, “A Comprehensive Guide to Chess Ratings,” is philosophically general:
At first, this consistent overestimation of the expected score formula may seem surprising [but] it is actually a statistical property of the expected score formula.
To paraphrase what he says next: In a world with total ignorance of playing skill, we would have to put for every game. Any curve
comes from a model purporting pinpoint knowledge of playing skill. Our real world is somewhere between such knowledge and ignorance. Hence we always get some interpolation of
and the flat line
. In chess this is really an issue: although both the red and green curve project a difference
to give almost 76% expectation to the stronger player, observed results are about 72% (see Figure 6 in the survey).
Newtonian Ratings and Grothendieck Nulls
The Glicko system solves this problem by giving every player a rating
and an uncertainty parameter
. Instead of creating
‘s and
(or etc.) it keeps
a separate parameter. This solves the problem by making the prediction
a function of
as well as
, with optional further dependence on how the
“glob” may skew as
grows into the tail of high outliers and on other dynamics of the population of rated players.
However, Newton’s laws behave as though bodies have pinpoint mass values at their centers of gravity, no matter how the mass may “glob” around it. Trying to capture an inverse-square law for chess ratings leads to a curious calculation. Put
for . Taking
gives
and allows gluing
. Simplifying
gives a fraction with denominator
and numerator
given by
Then taking cancels out the two bigger terms in the constant part, leaving the numerator as
David Mumford and John Tate, in their 2015 obituary for Alexander Grothendieck, motivated Grothendieck’s use of nilpotent elements via situations where one can consider to be truly negligible—that is, to put
.
Here we have an ostensibly better situation: In our original expression for , the coefficient
of
has to stay pretty small. The linear term for
has coefficient
and the
term has
. Thus if we could work in an algebra where
then the pinpoint value and all averages
for uncertainty would exactly agree. No separate parameter
would be needed.
Alas, insofar as the real world runs on real algebra rather than Grothendieck algebra, we have to keep the numerator and the denominator
. One can choose
to approximate the above green or red chess rating curves in various ways, and then compare the discrepancy for various combinations of
and
. The discrepancies for my “Newtonian”
tend about twice as great as for the standard curves. That is too bad. But I still wonder whether the above calculation of the prediction discrepancy
—and its curious
feature—has further uses.
Open Problems
What will AI be able to tell from our “track records” that we cannot?
Several theories of test-taking postulate a sigmoid relationship between a student’s ability and his/her likelihood
of getting a given exam question right. Changing the difficulty of the question shifts the curve left or right. For a multiple-choice question with
choices the floor might be
rather than
to allow for “guessing” but otherwise, similar axioms hold. Inverting the various
gives a grading rubric for the exam. Do outcomes tend to be bunched toward the middle more than predicted? Are exam “ratings” (that is, grades) robust enough—as chess ratings are—to tell?
Aggregating the curves for various questions on an exam involves computing weighted averages of logistic curves. Is there literature on mathematical properties of the space of such averaged curves? Is there a theory of handling discrepancy terms like my
above?
[some word tweaks and typo fixes]
Like this:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK