6

Statistical Modeling, Causal Inference, and Social Science

 3 years ago
source link: https://statmodeling.stat.columbia.edu/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In this particular battle between physicists and economists, I’m taking the economists’ side.

Posted by Andrew on 19 December 2020, 9:13 am

Palko writes, “When the arrogance of physicists and economists collide, it’s kind of like watching Godzilla battle Rodan . . . you aren’t really rooting for either side but you can still enjoy the show.”

Hey! Some of my best friends are physicists and economists!

But I know what he’s talking about. Here’s the story he’s linking to:

Everything We’ve Learned About Modern Economic Theory Is Wrong

Ole Peters, a theoretical physicist in the U.K., claims to have the solution. All it would do is upend three centuries of economic thought. . . .

Ole Peters is no ordinary crank. A physicist by training, his theory draws on research done in close collaboration with the late Nobel laureate Murray Gell-Mann, father of the quark. . . .

Peters takes aim at expected utility theory, the bedrock that modern economics is built on. It explains that when we make decisions, we conduct a cost-benefit analysis and try to choose the option that maximizes our wealth.

The problem, Peters says, is the model fails to predict how humans actually behave because the math is flawed. Expected utility is calculated as an average of all possible outcomes for a given event. What this misses is how a single outlier can, in effect, skew perceptions. Or put another way, what you might expect on average has little resemblance to what most people experience.

Consider a simple coin-flip game, which Peters uses to illustrate his point.

Starting with $100, your bankroll increases 50% every time you flip heads. But if the coin lands on tails, you lose 40% of your total. Since you’re just as likely to flip heads as tails, it would appear that you should, on average, come out ahead if you played enough times because your potential payoff each time is greater than your potential loss. In economics jargon, the expected utility is positive, so one might assume that taking the bet is a no-brainer.

Yet in real life, people routinely decline the bet. Paradoxes like these are often used to highlight irrationality or human bias in decision making. But to Peters, it’s simply because people understand it’s a bad deal. . . .

This is not quite a “no-brainer”; it’s kinda subtle, but it’s not so subtle as all that.

Here’s the story. First, yeah, people don’t like uncertainty. This has nothing to do with the economic or decision-theoretic concept of utility, except to remind us that utility theory is a mathematical model that doesn’t always apply to real decisions (see section 5 of this article or further discussion here). Second, from an economics point of view you should take the bet. The expected return is positive and the risk is low. I’m assuming this is 100 marginal dollars for you, not that this is the last $100 in your life. One of the problems with this sort of toy problem is that it’s often not made clear wat the money will be used for. There’s a big difference between a middle-class American choosing to wager the $100 in his wallet, or a farmer in some third-world country who has only $100 cash, period, which he’s planning to use to buy seed or whatever. Money has no inherent utility; the utility comes from what you’ll buy with it. Third, from an economics point of view maybe you should not take the bet if it requires that you play 50 times in succession, as this can get you into the range where the extra money has strongly declining marginal value. It depends on what the $100 means to you, and also on what $1,000,000 can do for you.

The above-linked argument refers to “plenty of high-level math,” which is fine—mathematicians need to be kept busy too—but the basic principles are clear enough.

And then there’s this:

Peters asserts his methods will free economics from thinking in terms of expected values over non-existent parallel universes and focus on how people make decisions in this one.

That’s just silly. Economists “focus on how people make decisions” all the time.

That said, economists deserve much of the blame for utility theory being misunderstood by outsiders, just as statisticians deserve much of the blame for misunderstandings about p-values. Both statisticians and economists trained generations of students in oversimplified theories. In the case of statistics, it’s all that crap about null hypothesis significance testing, the idea that scientific theories are statistical models that are true or false and that statistical tests can give effective certainty. In the case of economics, it’s all that crap about risk aversion corresponding to “a declining utility function for money,” which is just wrong (see section 5 of my article linked above). Sensible statisticians know better about the limitations of hypothesis testing, and sensible economists know better about the limitations of utility theory, but that doesn’t always make it into the textbooks.

Also, economists don’t do themselves any favors by hyping themselves, for example by making claims about how they are different “from almost anyone else in society” in their open-mindedness, or by taking commonplace observations about economics as evidence of brilliance.

So, sure, economists deserve some blame, both in their presentations of utility theory and in their smug attitude toward the rest of social science. But they’re not as clueless as implied by the above story. The decision to bet once is not the same as the decision to make 50 or 100 of a series of bets, and economists know this. And the above analysis is relying entirely on the value of $100, without ever specifying the scenario in which the bet is applied. Economists know, at least when they’re doing serious work, that context matters and the value of money is not defined in a vacuum.

So I guess I’ll have to go with the economists here. It’s the physicists who are being more annoying this time. It’s happened before.

The whole thing is sooooo annoying. Economists go around pushing this silly idea of defining risk aversion in terms of a utility curve for money. Then this physicist comes along and notes the problem, but instead of getting serious about it, he just oversimplifies in another direction, then we get name-dropping of Nobel prize winners . . . ugh! It’s the worst of both worlds. I’m with Peters in his disagreement with the textbook model, but, yeah, we know that already. It’s not a stunning new idea, any more than it would be stunning and new if a physicist came in and pointed out all the problems that thoughtful statisticians already know about p-values.

OK, I guess it would help if, when economists explain how these ideas are not new, they could also apologize for pushing the oversimplified utility model for several decades, which left them open to this sort of criticism from clueless outsiders.

The likelihood principle in model check and model evaluation

Posted by Yuling Yao on 18 December 2020, 2:30 pm

(This post is by Yuling)
The likelihood principle is often phrased as an axiom in Bayesian statistics. It applies when we are (only) interested in estimating an unknown parameter \theta, and there are two data generating experiments both involving \theta, each having observable outcomes y_1 and y_2 and likelihoods p_1(y_1 \vert \theta) and p_2(y_2 \vert \theta). If the outcome-experiment pair satisfies p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta), (viewed as a function of \theta) then these two experiments and two observations will provide the same inference information about \theta.

Consider a classic example. Someone was doing an AB testing and only interested in the treatment effect, and he told his manager that among all n=10 respondents, y=9 saw an improvement (assuming the metric is binary). It is natural to estimate the improvement probability \theta by independent Bernoulli trial likelihood: y\sim binomial (\theta\vert n=10). Other informative priors can exist but is not relevant to our discussion here.

What is relevant is that later the manager found out that the experiment was not done appropriately. Instead of independent data collection, the experiment was designed to sequentially keep recruiting more respondents until y=9 are positive. The actual random outcome is n, while y is fixed. So the correct model is n-y\sim negative binomial (\theta\vert y=9).

Luckily, the likelihood principle kicks in for the fact that binomial_lpmf (y\vert n, \theta) = neg_binomial_lpmf (n-y\vert y, \theta) + constant. No matter how the experiment was done, the inference remains invariant.

At the abstract level, the likelihood principle says the information of parameters can only be extracted via the likelihood, not from experiments that could have been done.

What can go wrong in model check

The likelihood is dual-purposed in Bayesian inference. For inference, it is just one component of the unnormalized density. But for model check and model evaluation, the likelihood function enables generative model to generate posterior predictions of y.

In the binomial/negative binomial example, it is fine to stop at the inference of \theta. But as long as we want to check the model, we do need to distinguish between the two possible sampling distributions and which variable (n or y) is random.

Consider we observe y=9 positive cases among n=10 trials, with the estimated \theta=0.9, the likelihood of binomial and negative binomial models are

> y=9
> n=10
> dnbinom(n-y,y,0.9)
 	 0.3486784
> dbinom(y,n, 0.9)
	0.3874205

Not really identical. But the likelihood principle does not require them to be identical. What is needed is a constant density ratio, and that is easy to verify:

> dnbinom(n-y,y, prob=seq(0.05,0.95,length.out = 100))/dbinom(y,n, prob=seq(0.05,0.95,length.out = 100))

The result is a constant ratio, 0.9.

However, the posterior predictive check (PPC) will have different p-values:

> 1-pnbinom(n-y,y, 0.9)
 	0.2639011
> 1-pbinom(y,n, 0.9)
	0.3486784

The difference of the PPC-p-value can be even more dramatic with another \theta:

> 1-pnbinom(n-y,y, 0.99)
 	0.0042662
> 1-pbinom(y,n, 0.99)
	0.9043821

Just very different!

Clearly using Bayesian posterior of \theta does not fix the issue. The problem is that likelihood ensures some constant ratio on \theta, not on y_1 nor y_2.

Model selection?

Unlike the unnormalized likelihood in the likelihood principle, the marginal likelihood in model evaluation is required to be normalized.

In the previous AB testing example, given data (y,n), if we know that one and only one of the binomial or the negative binomial experiment is run, we may want to make model selection based on marginal likelihood. For simplicity we consider a point estimate \hat \theta=0.9. Then we obtain a likelihood ratio test, with the ratio 0.9, slightly favoring the binomial model. Actually this marginal likelihood ratio is constant y/n, independent of the posterior distribution of \theta. If y/n=0.001, then we get a Bayes factor 1000 favoring the binomial model.

Except it is wrong. It is not sensible to compare a likelihood on y and a likelihood on n.

What can go wrong in cross-validation

CV requires some loss function, and the same predictive density does not imply the same loss (L2 loss, interval loss, etc.). For adherence, we adopt log predictive densities for now.

CV also needs some part of the data to be exchange, which depends on the sampling distribution.

On the other hand, the calculated LOO-CV of log predictive density seems to only depend on the data through the likelihood. Consider two model-data pair M1: p_1(\theta\vert y_1) and M2: p_2(\theta\vert y_2), we compute the LOOCV by \text{LOOCV}_1= \sum_i \log  \int_\theta   {\frac{ p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }}  \left({ \int_{\theta} { p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }d\theta}\right)^{-1}   p_1 (y_{{1i}}\vert\theta) d\theta, and replace all 1 with 2 in \text{LOOCV}_2.

The likelihood principle does say that p_\text{post} (\theta\vert M_1, y_1)=p_\text{post} (\theta\vert M_2, y_2), and if there is some generalized likelihood principle ensuring that p_1 (y_{1i}\vert\theta)\propto p_2 (y_{2i} \vert\theta), then \text{LOOCV}_1= \text{constant} + \text{LOOCV}_2.

Sure, but it is an extra assumption. Arguably the point-wise likelihood principle is such a strong assumption that would hardly be useful beyond toy examples.

The basic form of the likelihood principle does not have the notation of y_i. It is possibles that y_2 and y_1 have different sample size: consider a meta-polling with many polls. Each poll is a binomial model with y_i\sim  binomial(n_i, \theta). If I have 100 polls, I have 100 data points. Alternatively I can view data from \sum {n_i} Bernoulli trials, and the sample size becomes \sum_{i=1}^{100} {n_i}.

Finally just like the case in marginal likelihood, even if all conditions above hold, regardless of the identity, it is conceptually wrong to compare \text{LOOCV}_1 with \text{LOOCV}_2. They are scoring rules on two different spaces (probability measures on y_1 and y_2 respectively) and should not be compared directly.

PPC again

Although it is a bad practice, we sometimes compare PPC p-values from two models for the purpose of model comparison. In the y=9, n=10, \hat\theta=0.99 case, we can compute the two-sided p-value: min( Pr(y_{sim} > y), Pr(y_{sim} < y)) for the binomial model, and min( Pr(n_{sim} > n), Pr(n_{sim} < n)) for the NB model respectively.

> min(pnbinom(n-y,y, 0.99),  1-pnbinom(n-y,y, 0.99) )
  0.004717254
> min( pbinom(y,n, 0.99),   1-pbinom(y,n, 0.99))
  0.09561792

In the marginal likelihood and the log score case, we know we cannot directly compare two likelihoods or two log scores when they are on two sampling spaces. Here, the p-value is naturally normalized. Does it mean we the NB model is rejected while the binomial model passes PPC?

Still we cannot. We should not compare p-values at all.

Model evaluation on the joint space

To avoid unfair comparison of marginal likelihoods and log scores across two sampling spaces, a remedy is consider a product space: both y and n are now viewed as random variables.

The binomial/negative binomial narrative specify two joint models p(n,y\vert \theta)= 1(n=n_{obs}) p(y\vert n, \theta) and p(n,y\vert \theta)= 1(y=y_{obs}) p(n\vert y, \theta).

The ratio of these two densities only admit three values: 0, infinity, or a constant y/n.

If we observe several paris of (n, y), we can easily decide which margin is fixed. The harder problem is we only observe one (n,y). Based on the comparison of marginal likelihoods and log scores in the previous sections, it seems both metric would still prefer the binomial model (now it is viewed as a sampling distribution on the product space).

Well, it is almost correct expect that 1) the sample log score is not meaningful if there is only one observation and 2) we need some prior on models to go from marginal likelihood to the Bayes factor. After all, under either sampling model, the event admitting nontrivial density ratios, 1(y=y_{obs}) 1(n=n_{obs}), has zero measure. It is legit to do model selection/comparison on the product space, but we could do whatever we want at this point without affecting any property in almost sure sense.

Some causal inferene

In short, the convenience of inference invariance from the likelihood principle also makes it hard to practise model selection and model evaluation. The latter two modules rely on the sampling distribution besides the likelihood.

To make this blog post more confusing, I would like to draw some remote connection to causal inference.

Assuming we have data (binary treatment: z, outcome y, covariate: x) from a known model M1: y = b0 + b1 z + b2 x + iid noise. If the model is correct and if there is no other unobserved confounders, we estimate the treatment effect of z by b1.

The unconfoundedness assumption requires that y(z=0) and y(z=1) are independent of z given x. This assumption is only a description on causal interpretation, and never appears in the sampling distribution or the likelihood. Assuming there does exist a confounder c, and the true DG is M2: y | (x, z, c) = b0 + b1 z + b2 x + c + iid noise, and z | c= c + another iid noise. Then marginalize-out c (because we cannot observe it in data collection), M2 becomes y | x, z= b0 + b1 z + b2 x+ iid noise. Therefore, (M1, (x, y, z)) and (M2, (x,y,z)) admit an experiment-data pair on which the likelihood principle holds. It is precisely the otherwise lovely likelihood principle that excludes any method to test the unconfoundedness assumption.

“Inferring the effectiveness of government interventions against COVID-19”

Posted by Andrew on 18 December 2020, 10:39 am
Screen-Shot-2020-12-18-at-10.16.28-AM.png

John Salvatier points us to this article by Jan Brauner et al. that states:

We gathered chronological data on the implementation of NPIs [non-pharmaceutical interventions, i.e. policy or behavioral interventions] for several European, and other, countries between January and the end of May 2020. We estimate the effectiveness of NPIs, ranging from limiting gathering sizes, business closures, and closure of educational institutions to stay-at-home orders. To do so, we used a Bayesian hierarchical model that links NPI implementation dates to national case and death counts and supported the results with extensive empirical validation. Closing all educational institutions, limiting gatherings to 10 people or less, and closing face-to-face businesses each reduced transmission considerably. The additional effect of stay-at-home orders was comparatively small.

This seems similar to the hierarchical Bayesian analysis of Flaxman et al. In their new paper, Brauner et al. write:

Our model builds on the semi-mechanistic Bayesian hierarchical model of Flaxman et al. (1), with several additions. First, we allow our model to observe both case and death data. This increases the amount of data from which we can extract NPI effects, reduces distinct biases in case and death reporting, and reduces the bias from including only countries with many deaths. Second, since epidemiological parameters are only known with uncertainty, we place priors over them, following recent recommended practice (42). Third, as we do not aim to infer the total number of COVID-19 infections, we can avoid assuming a specific infection fatality rate (IFR) or ascertainment rate (rate of testing). Fourth, we allow the effects of all NPIs to vary across countries, reflecting differences in NPI implementation and adherence. . . .

Some NPIs frequently co-occurred, i.e., were partly collinear. However, we were able to isolate the effects of individual NPIs since the collinearity was imperfect and our dataset large. For every pair of NPIs, we observed one without the other for 504 country-days on average (table S5). The minimum number of country-days for any NPI pair is 148 (for limiting gatherings to 1000 or 100 attendees). Additionally, under excessive collinearity, and insufficient data to overcome it, individual effectiveness estimates would be highly sensitive to variations in the data and model parameters (15). Indeed, high sensitivity prevented Flaxman et al. (1), who had a smaller dataset, from disentangling NPI effects (9). In contrast, our effectiveness estimates are substantially less sensitive . . . Finally, the posterior correlations between the effectiveness estimates are weak, further suggesting manageable collinearity.

I don’t have anything really to say about their multilevel Bayesian analysis, because I didn’t look at it carefully. In general I like the Bayesian approach because the assumptions are out in the open and can be examined and changed if necessary.

Also, I didn’t quite catch how they decided which countries to include in their analyses. They have a bunch of countries in Europe and near Europe, two in Africa, three in Asia, one in North America, and one in Oceania. No United States, Japan, or Korea.

The other thing that comes up with all these sorts of analyses is the role of individual behavior. In New York everything much shut down in March. Some of this was due to governmental actions (including ridiculous steps such as the Parks Department padlocking the basketball courts and taking down the hoops), but a lot of it was just individual choices to stay at home. It’s a challenge to estimate the effects of policies when many of the key decisions are made by individuals and groups.

What George Michael’s song Freedom! was really about

Posted by Lizzie on 17 December 2020, 4:08 pm

I present an alternative reading of George Michael’s 1990’s hit song Freedom! While many interpret this song as about Michael’s struggles with fame in an industry that constantly aimed to warp his true identity, it can also be interpreted as a researcher progressing in a field where data ownership and data ‘rights’ are still hotly contested.

Heaven knows I was just a young boy
Didn’t know what I wanted to be
I was every little hungry schoolgirl’s pride and joy
And I guess it was enough for me

In these first lines the researcher describes the heady days of early grad school, where most folks were still in their twenties, enjoying a cadre of new friends, and still not sure of their future.

To win the race, a prettier face
Brand new clothes and a big fat place
On your rock and roll TV
But today the way I play the game is not the same, no way
Think I’m gonna get me some happy

The researcher describes a few things here: moving on to their postdoc, the excitement of their first few conferences where they got to present their exciting new results, but also hints at a big change in how they are approaching science recently. A change that could bring great happiness.

 I think there’s something you should know
(I think it’s time I told you so)
There’s something deep inside of me
(There’s someone else I’ve got to be)
Take back your picture in a frame
(Take back your singing in the rain)
I just hope you understand
Sometimes the clothes do not make the man

The researcher is nervous about what they will share. They know it is not the view of many in the field and they hope others will understand. Even though they may wear the tevas-with-socks and pleated shorts of an ecologist, they have views they fear will not be widely accepted by their community.

All we have to do now
Is take these lies and make them true somehow
All we have to see
Is that I don’t belong to you and you don’t belong to me

Here the researcher sings out their truth! They stare down the lies they have repeatedly heard, including:

  • Data you collected are owned by you and you should hold onto them possessively.
  • If you publish your data or don’t tightly guard it, it will be stolen by others, and then your career may be ruined.
  • People build entire careers on reusing other people’s data and they get more fame and recognition than those who toil away collecting data and are not recognized for their efforts.
  • Your data can never be fully understood without your presence, and thus should probably not be used without you around in some way.
  • If we all publish the data we collected regularly we will never have good data again, and we will never ever have long-term data because people will stop collecting long-term data.

The researcher sings out to their colleagues that these are lies and that for science to progress, data should be free, that data don’t ‘belong’ to any of us. They encourage their colleagues to let go of possessiveness (it never makes you happy!) around data.

Freedom (I won’t let you down)
Freedom (I will not give you up)
Freedom (Gotta have some faith in the sound)
You’ve got to give what you take (It’s the one good thing that I’ve got)
Freedom (I won’t let you down)
Freedom (So please don’t give me up)
Freedom (‘Cause I would really)
You’ve got to give what you take (really love to stick around)

Here they sing out for data freedom (“Freedom!”), alternating with pulls they have felt from colleagues who believe data sharing may destroy the field (“I will not give you [data] up”).

Heaven knows we sure had some fun, boy
What a kick just a buddy and me
We had every big-shot goodtime band on the run, boy
We were living in a fantasy

The researcher again looks back fondly on their PhD, remembering days in the field when they pulled on their waders, grabbed their plastic bucket and collected data (for example, see opening images here), and then published their first exciting papers.

We won the race, got out of the place
Went back home, got a brand new face for the boys on MTV (Boys on MTV)
But today the way I play the game has got to change, oh yeah
Now I’m gonna get myself happy

The researcher remembers wrapping up their PhD, submitting a great Dance Your PhD (here’s a favorite example), and then returns to their refrain on realizing that their field must change. Both for the field and for personal happiness.

The chorus repeats ….

All we have to do now
Is take these lies and make them true somehow
All we have to see
Is that I don’t belong to you and you don’t belong to me
Freedom!
Freedom!
Freedom!
It’s the one good thing that I’ve got

Sing it with me! Data freedom! Data freedom! If you love ‘your’ data set it free!

And if you’re an ecologist or in any similar field with a contingent of folks who speak some of the lies mentioned above I encourage to ask for examples. Ask for the list of people whose careers have been ruined by data sharing, ask also for the list of happy people who publish data they collect—and try to actually figure out what the distribution of these ruined versus non-ruined people looks like. If they tell you someday your field will be destroyed, ask them for examples of other fields where people have been made to share data (think GenBank, parts of medicine, please help me expand this list!) and what actually happened.

If—

Posted by Andrew on 17 December 2020, 9:04 am

If you can argue that knuckleheads are rational
And that assholes serve the public good,
But then turn around and tell us we need you
To nudge us as you know we should;
If you can be proud of your “repugnant ideas”
And style yourself a rogue without taboo,
If you assume everyone is fundamentally alike
But circulate among the favored few;

If you can do math—and not make math your master;
If you can analyze—and not make thoughts your aim;
If you can meet with Cost and Benefit
And treat those two impostors just the same;
If you can bear to hear the truth you’ve spoken
Twisted by knaves to make a trap for fools,
Or watch the things you gave your life to, broken,
And stoop and build ’em up with worn-out tools:

If you can make one heap of all your winnings
And risk it on a top-5 journal,
And lose, and start again at your beginnings
And never breathe a word diurnal;
If you can force your models and assumptions
To serve your turn long after they are gone,
And so hold on when there is nothing in you
Except the Will which says to them: ‘Hold on!’

If you can talk with rich people and keep your virtue,
Or walk with NGOs—nor lose the common touch,
If neither data nor experience can hurt your theories,
If all measurements count with you, but none too much;
If you can fill the unforgiving minute
With sixty seconds’ worth of consulting,
Yours is Academia and everything that’s in it,
And—which is more—you’ll be an Economist, my son!

With apologies to you-know-who.

P.S. #NotAllEconomists

Deterministic thinking meets the fallacy of the one-sided bet

Posted by Andrew on 16 December 2020, 9:27 am

Kevin Lewis asked me what I thought of this news article:

Could walking barefoot on grass improve your health? Some science suggests it can. . . .

The idea behind grounding, also called earthing, is humans evolved in direct contact with the Earth’s subtle electric charge, but have lost that sustained connection thanks to inventions such as buildings, furniture and shoes with insulated synthetic soles.

Advocates of grounding say this disconnect might be contributing to the chronic diseases that are particularly prevalent in industrialized societies. There is actually some science behind this. Research has shown barefoot contact with the earth can produce nearly instant changes in a variety of physiological measures, helping improve sleep, reduce pain, decrease muscle tension and lower stress. . . .

One reason direct physical contact with the ground might have beneficial physiological effects is the earth’s surface has a negative charge and is constantly generating electrons that could neutralize free radicals, acting as antioxidants. . . .

Research also suggests physical contact with the Earth’s surface can help regulate our autonomic nervous system . . .

While many clinical studies have demonstrated beneficial physical changes when participants are grounded, studies tend to be small and are done indoors using wires that connect to ground outlets. . . . Still, since being outdoors is proved to be good for you, it probably would not hurt to try it yourself to see if you notice any benefits. So how do you ground? Simply allow your skin to be in contact with any natural conductors of the earth’s electricity, working up to at least 30 minutes at a time (unfortunately, studies do not seem to have addressed how often grounding should occur). . . .

Vagal tone is often assessed by measuring the variation in your heart rate when you breathe in and out, and in one study, grounding was shown to improve heart rate variability and thus vagal tone in preterm infants. In another small study of adults, one two-hour session of grounding reduced inflammation and improved blood flow. . . .

If you have concerns about whether it is sanitary to walk barefoot outside, there are options. Keep a patch of lawn off-limits to your dog. Or put a blanket or towel between your skin and the ground; natural fibers such as cotton and wool do not interfere with grounding. You can even wear leather-soled shoes. . . .

Leather-soled shoes, huh? But no socks, I guess. Unless the socks are made of cotton or wool or, ummm, leather, I guess that would work? Seems like some magical thinking is going on here.

My response is, as always, that these things could help for some people and situations and be counterproductive at other times. I don’t think the deterministic framing is helpful.

Also, there seems to be some admirable skepticism in the following quote:

If you do notice you are more relaxed, or you are sleeping better, or you have less pain or fatigue – is it the grounding or a placebo effect?

But here they’re making the fallacy of the one-sided bet, by implicitly assuming that the effects can only be positive. If grounding has real physical effects, then I doubt they’d always be positive.

P.S. I see lots of silly science stuff every day, including ridiculous claims, not at all supported by data (just a lot of studies with N=40, p=0.04, and more forking paths than a . . . ummm, I dunno what). Just today I came across a published paper that included a pilot study. Fine . . . but don’t’ja know it, they found statistical significance there too and made some general claims. Which is not what a pilot study is for. (But now surprise they found some p-values below 0.05, given the flexibility they had in what to look for.) Anyway, we see this All. The. Time. Even in prestigious journals. I usually don’t bother posting on these bits of routine cargo cult science. When posting, the idea is to make some more general point. As above.

Literally a textbook problem: if you get a positive COVID test, how likely is it that it’s a false positive?

Posted by Phil on 15 December 2020, 7:44 pm

This post is by Phil Price, not Andrew.

This will be obvious to most readers of this blog, who have seen this before and probably thought about it within the past few months, but the blog gets lots of readers and this might be new to some of you.

A friend of mine just tested positive for COVID. But out of every 1000 people who do NOT have COVID, 5 of them will test positive anyway. So how likely is it that my friend’s test is a false positive? If you said “0.5%” then (1) you’re wrong, but (2) you’re in good company, lots of people, including lots of doctors, give that answer. It’s a textbook example of the ‘base rate fallacy.’

To get the right answer you need more information. I’ll illustrate with the relevant real-world numbers.

My friend lives in Berkeley, CA, where, at the moment, about 2% of people who get tested have COVID. That means that when 1000 people get tested, about 20 of them will be COVID-positive and will result in a positive test. But that leaves 980 people who do NOT have COVID, and about 5 of them will test positive anyway (because the false positive rate is 0.5%, and 0.5% of 980 is 4.9). So for every 1000 people tested in Berkeley these days, there are about 25 positives and 5 of those are false positives. Thus there’s about a 5/25 =  1/5 chance that my friend’s positive test is a false positive.

(That’s if we had no other information. In fact we have the additional information that he is asymptomatic, which increases the chance. He still probably has COVID, but it’s very far from the 99.5% chance that a naive estimate would suggest. Maybe more like a 65% chance).

If you think about this issue once, it will be ‘obvious’ for the rest of your life. Of course the answer to the question depends on the base rate! If literally nobody had the virus, then every positive would be a false positive. If literally everybody had the virus, then no positive would be a false positive. So it’s obvious that the probability that a given positive is a false positive depends on the base rate. Then you just have to think through the numbers, which is really easy as I have illustrated above.

Apologies to all of you who have seen this a zillion times. Or twice. 

This post is by Phil.

You can figure out the approximate length of our blog lag now.

Posted by Andrew on 15 December 2020, 9:59 am

Sekhar Ramakrishnan writes:

I wanted to relate an episode of informal probabilistic reasoning that occurred this morning, which I thought you might find entertaining.

Jan 6th is the Christian feast day of the Epiphany, which is known as Dreikönigstag (Three Kings’ Day), here in Zürich, Switzerland, where I live (I work at ETH). There is a tradition to have a dish called three kings’ cake in which a plastic king is hidden in one of the pieces of the cake. Whoever finds the king gets some privileges that day (like deciding what’s for dinner).

Two years ago, the bakery we get our three kings’ cake from decided to put a king in *every* piece of cake. They received many complaints about this, and last year they returned to the normal tradition of one king per cake. Today, we were speculating on whether they were going to try their every-piece-a-king experiment again this year.

My 12-year-old son picked the first piece of cake: he had a king! He said, “It looks like they probably did put a king in every piece again this year.” We had a cake with 5 pieces, so, assuming that one king per cake and five kings per cake are equally likely, I get a posterior probability of 5/6 that there was a king in every piece. I thought it was interesting that my son intuitively concluded that a king in every piece was more likely as well, even though he hasn’t had any formal exposure to statistics or statistical reasoning.

As it turns out, though, there was only one king in the cake — my son just got lucky!

Indeed, there are some settings where probabilistic reasoning is intuitive.

Debate involving a bad analysis of GRE scores

Posted by Andrew on 14 December 2020, 9:29 am

This is one of these academic ping-pong stories of a general opinion, an article that challenges the general opinion, a rebuttal to that article, a rebuttal to the rebuttal, etc. I’ll label the positions as A1, B1, A2, B2, and so forth:

A1: The starting point is that Ph.D. programs in the United States typically require that applicants take the Graduate Record Examination (GRE) as part of the admissions process.

B1: In 2019, Miller, Zwick, Posselt, Silvestrini, and Hodapp published an article saying:

Multivariate statistical analysis of roughly one in eight physics Ph.D. students from 2000 to 2010 indicates that the traditional admissions metrics of undergraduate grade point average (GPA) and the Graduate Record Examination (GRE) Quantitative, Verbal, and Physics Subject Tests do not predict completion as effectively admissions committees presume. Significant associations with completion were found for undergraduate GPA in all models and for GRE Quantitative in two of four studied models; GRE Physics and GRE Verbal were not significant in any model. It is notable that completion changed by less than 10% for U.S. physics major test takers scoring in the 10th versus 90th percentile on the Quantitative test.

They fit logistic regressions predicting Ph.D. completion given undergraduate grade point average, three GRE scores (Quantitative, Verbal, Physics, indicators for whether the Ph.D. program is Tier 1 or Tier 2, indicators for six different ethnic categories (with white as a baseline), an indicator for sex, and an indicator for whether the student came from the United States. Their results are summaries by statistical significance of the coefficients for GRE predictors. Their conclusion:

The weight of evidence in this paper contradicts conventional wisdom and indicates that lower than average scores on admissions exams do not imply a lower than average probability of earning a physics Ph.D. Continued overreliance on metrics that do not predict Ph.D. completion but have large gaps based on demographics works against both the fairness of admissions practices and the health of physics as a discipline.

A2: Weissman responded with an article in the same journal, saying:

A recent paper in Science Advances by Miller et al. concludes that Graduate Record Examinations (GREs) do not help predict whether physics graduate students will get Ph.D.’s. Here, I argue that the presented analyses reflect collider-like stratification bias, variance inflation by collinearity and range restriction, omission of parts of a needed correlation matrix, a peculiar choice of null hypothesis on subsamples, blurring the distinction between failure to reject a null and accepting a null, and an unusual procedure that inflates the confidence intervals in a figure. Release of results of a model that leaves out stratification by the rank of the graduate program would fix many of the problems.

One point that Weissman makes is that the GRE Quantitative and Physics scores are positively correlated, so (a) when you include them both as predictors in the model, each of their individual coefficients will end up with a larger standard error, and (b) the statement in the original article that “completion changed by less than 10% for U.S. physics major test takers scoring in the 10th versus 90th percentile on the Quantitative test” is incorrect: students scoring higher on the Quantitative test will, on average, score higher on the Physics test too, and you have to account for that in making your prediction.

Weissman also points out that adjusting for the tier of the student’s graduate program is misleading: the purpose of the analysis is to consider admissions decisions, but the graduate program is not determined until after admissions. Miller et al. are thus making the mistake of adjusting for post-treatment variables (see section 19.6 of Regression and Other Stories for more on why you shouldn’t do that).

At the end of his discussion, Weissman recommends preregistration, but I don’t see the point of that. The analysis made mistakes. If these are preregistered, they’re still mistakes. More relevant would be making the data available. But I guess a serious analysis of this topic would not limit itself to physics students, as I assume the same issues arise in other science Ph.D. programs.

B2: Miller et al. then responded:

We provide statistical measures and additional analyses showing that our original analyses were sound. We use a generalized linear mixed model to account for program-to-program differences with program as a random effect without stratifying with tier and found the GRE-P (Graduate Record Examination physics test) effect is not different from our previous findings, thereby alleviating concern of collider bias. Variance inflation factors for each variable were low, showing that multicollinearity was not a concern.

Noooooooo! They totally missed the point. “Program-to-program differences” are post-treatment variables (“colliders”), so you don’t want to adjust for them. And the issue of multicollinearity is not that it’s a concern with the regression (although it is, as the regression is unregularized, but that’s another story); the problem is in the interpretation of the coefficients.

A3: Weissman posted on Arxiv a response to the response; go to page 18 here.

Weissman doesn’t say this, but what I’m seeing here is the old, old story of researchers making a mistake, getting it into print, and then sticking to the conclusion no matter what. No surprise, really: that’s how scientists are trained to respond to criticism.

It’s sad, though: Miller et al. are physicists, and they have so much technical knowledge, but their statistical analysis is so crude. To try to answer tough predictive questions with unregularized logistic regression and statistical significance thresholding . . . that’s just not the way to go.

Weissman summarizes in an email:

Although the GRE issue is petty in the current context, the deeper issue of honesty and competence is science is as important as ever. Here are some concerns.

1. Science Advances tells me they (and Science?) are eliminating Technical Comments. How will they then deal with pure crap that makes it through fallible peer review?

2. Correct interpretation of observational studies can obviously be a matter of life or death now. The big general-purpose journals need editors who are experts in modern causal inference.

3. Two of the authors of the atrocious papers now serve on a select NAS policy panel for this topic.

4. Honesty still matters.

My take on this is slightly different. I don’t see any evidence for dishonesty here; this just seems like run-of-the-mill incompetence. Remember Clarke’s law. I think calls for honesty miss the point, because then you’ll just get honest but incompetent people who think that they’re in the clear because they’re honest.

Regarding item 3 . . . yeah, the National Academy of Sciences. That’s a political organization. I guess there’s no alternative: if a group has power, then politics will be involved. But, by now, I can’t take the National Academy of Sciences seriously. I guess they’ll do some things right, but you’ll really have to look carefully at who’s on the committee in question.

Vill du veta hur man köper viagra sildenafil i Sverige? Om du inte känner till varumärket är Viagra ett läkemedel mot erektil dysfunktion som ursprungligen marknadsfördes för manlig sexuell förbättring. Det har blivit mer populärt bland män de senaste åren på grund av filmen “Notebook”. Du behöver dock inte vara en kändis för att dra nytta av detta märke av manlig förbättring.

Hur ska du köpa viagra? De flesta onlineapotek säljer Viagra och många andra läkemedel. När du besöker ditt apotek kan du hitta Viagra, Cialis, Levitra och andra läkemedel som finns tillgängliga på deras apotekswebbplats.

Du kan hitta viagra i form av ett piller, en spray eller en plåster. De mest populära formerna av Viagra är piller och sprayer. Du kan köpa antingen p-piller eller sprayform av Viagra.

När du köper din Viagra från internet måste du kontrollera utgångsdatumet först. Även om du sparar pengar måste du se till att det fortfarande finns i förpackningen innan du köper det. När du köper ett piller måste du betala för det direkt så att du inte behöver vänta tills det har gått ut.

What do Americans think about coronavirus restrictions? Let’s see what the data say . . .

Posted by Andrew on 13 December 2020, 9:56 am

Back in May, I looked at a debate regarding attitudes toward coronavirus restrictions.

The whole thing was kind of meta, in the sense that rather than arguing about what sorts of behavioral and social restrictions would be appropriate to control the disease at minimal cost, people were arguing about what were the attitudes held in the general population.

It started with this observation from columnist Michelle Goldberg, who wrote:

Lately some commentators have suggested that the coronavirus lockdowns pit an affluent professional class comfortable staying home indefinitely against a working class more willing to take risks to do their jobs. . . . Writing in The Post, Fareed Zakaria tried to make sense of the partisan split over coronavirus restrictions, describing a “class divide” with pro-lockdown experts on one side and those who work with their hands on the other. . . . The Wall Street Journal’s Peggy Noonan wrote: “Here’s a generalization based on a lifetime of experience and observation. The working-class people who are pushing back have had harder lives than those now determining their fate.”

But it seemed that Zakaria and Noonan were wrong. Goldberg continued:

The assumptions underlying this generalization, however, are not based on even a cursory look at actual data. In a recent Washington Post/Ipsos survey, 74 percent of respondents agreed that the “U.S. should keep trying to slow the spread of the coronavirus, even if that means keeping many businesses closed.” Agreement was slightly higher — 79 percent — among respondents who’d been laid off or furloughed. . . .

I followed up with some data from sociologist David Weakliem, who reported on polling data showing strong majorities in both parties in support of coronavirus restrictions. 71% of Republicans and 91% of Democrats thought the restrictions at the time were “appropriate” or “not enough,” with the rest thinking they were “too restrictive.” Weakliem also looked at the breakdown by education and income reported that “income is similar to education, with lower income people more likely to take both ‘extreme’ positions; non-whites, women, and younger people more likely to say ‘not restrictive enough’ and less likely to say ‘too restrictive’. All of those differences are considerably smaller than the party differences. Region and urban/rural residence seem relevant in principle, but aren’t included in the report.”

I also reported on a data-free assertion from economist Robin Hanson, who wrote, “The public is feeling the accumulated pain, and itching to break out. . . . Elites are now loudly and consistently saying that this is not time to open; we must stay closed and try harder to contain. . . . So while the public will uniformly push for more opening, elites and experts push in a dozen different directions. . . . elites and experts don’t speak with a unified voice, while the public does.” As I said at the time, that made no sense to me as it completely contradicted the polling data, but I think Hanson was going with his gut, or with the Zakarian or Noonanesque intuition he has about how ordinary Americans should feel, if only they were to agree with him.

My post didn’t appear until November, and at that time Hanson responded in comments that “lockdowns soon weakened and ended. So ‘elites’ in the sense of quoted experts and pundits, vs the rest of the society via their political pressures.” But I still don’t buy it. First, it’s not clear that we should take changes in government policy as representative of pressure from the masses. Elites can apply pressure too. To put it another way, if you take various government policies that Hanson doesn’t approve of, I doubt he’d automatically take these as evidence of mass opinion in favor of those policies. He might instead speak of regulatory capture, elite opinion (politicians are, after all, part of the elite), and so on.

Beyond all this, it’s no mystery why restrictions were loosened between May and November. Rates of positive tests were lower during that period, the initial worries about people dying in the street were gone, and the demonstrated effectiveness of lockdowns gave state officials the confidence to reduce restrictions, secure in the understanding that if the number of cases shot up again, restrictions could be re-implemented. Hence from a straight policy perspective, it made sense to reduce restrictions. There’s no need to appeal to a mythical pressure from “the rest of society” or to a battle between elites and others.

That was then, this is now

As noted, I wrote my earlier discussion in May and posted in November. It’s now December, and Weakliem has been back on the job, again studying public opinion on this issue.

Here’s what he reports:

Yes, ‘elites’ support coronavirus restrictions. So do working-class Americans.

Pundits keep insisting — without evidence — that there’s a class divide over reopening

Throughout the pandemic, pundits have often argued that there are substantial class divisions in attitudes about coronavirus-related restrictions. Seeking an explanation for President Trump’s surprisingly strong electoral performance, Will Wilkinson, vice president of policy at the Niskanen Center, wrote in the New York Times last weekend that Republican calls to reopen businesses appealed to “working-class breadwinners who can’t bus tables, process chickens, sell smoothies or clean hotel rooms over Zoom,” but they were “less compelling to college-educated suburbanites, who tend to trust experts and can work from home, watch their kids and spare a laptop for online kindergarten.”

The Wall Street Journal columnist Peggy Noonan made a similar argument in May . . .

But these observations have been based on rough impressions or intuition, rather than evidence. Surveys — whether conducted recently or earlier in the pandemic — don’t show the class divide that some pundits believe is self-evident. . . .

To be sure, most working-class people can’t do their jobs from home, so they suffer a bigger financial loss from shutdowns. It is at least plausible that they might look skeptically on the views of elites and experts. Perhaps working-class people are more fatalistic (or realistic) and think you must accept some risks in life. But you can also think of reasons that middle-class people might oppose restrictions. Middle-class jobs are more likely to allow some distance from co-workers and customers, for example, and middle-class people tend to go out more frequently for dining and entertainment. As a result, they might risk less and gain more from reopening.

That’s why we need data. Although many surveys have asked for opinions of the government’s handling of the pandemic, only a few have asked about restrictions. However, two recent surveys sponsored by Fox News contain a good measure of general attitudes about the issue: “Which of the following do you think should be the federal government’s priority: limiting the spread of coronavirus, even if it hurts the economy, or restarting the economy, even if it increases the risk to public health?” The first survey was conducted Oct. 3-6, when the recent surge in cases was beginning; the second was conducted Oct. 27-29, when it was well advanced. . . .

The reports on the surveys do not have a general breakdown by education, but they do show opinions among White registered voters with and without a college degree. In the first survey, 36 percent of White voters with a college degree — and 37 percent of Whites without one — thought that restarting the economy should be the priority. In the second survey, 43 percent of White college graduates — and 38 percent without a degree — took that position. There is some evidence, in short, that it is White people with degrees who are becoming more anxious to get back to normal: Their support for focusing on the economy rose more between the surveys, while support among Whites without degrees increased less. But the class differences in both surveys were within the margin of error — they could easily be due to chance — so the safest conclusion is that there is no compelling evidence of a class-based divergence of opinion. . . .

Weakliem follows up with more detail in a blog post. tl;dr summary: Pundits really really want to tell a story, even if it’s unsupported by the data. Weakliem does a good dissection of the way that Noonan shifts from “those with power or access to it” to “the top half” and her the juxtaposition of “figures in government, politics, and the media”—a true elite—with people living in “nice neighborhoods, safe ones,” having a family that functions and kids that go to good schools, which, as Weakliem points out, is a pretty large part of the population.

I agree with Weakliem that the fact the Pulitzer Prize committee admired this stuff (“beautifully rendered columns that connected readers to the shared virtues of Americans”) is interesting in its own right.

But, again, the big story is that policies and attitudes on coronavirus-motivated restrictions have shifted over time in response to changes in actual and perceived risks, and the public is not as divided on the issue one might think from some news reports. There are some differences by political party, but not much going on when comparing different income and education levels, which is something that pundits maybe don’t want to hear because it gets in the way of their Pulitzer prize-winning stories or edgy hot takes.

A good book can completely absorb your attention, to the extent that you forget your surroundings yetişkin sohbet ruleti and even your existence.

17 state attorney generals, 100 congressmembers, and the Association for Psychological Science walk into a bar

Posted by Andrew on 12 December 2020, 9:00 am

I don’t have much to add to all that’s been said about this horrible story.

The statistics errors involved are pretty bad—actually commonplace in published scientific articles, but mistakes that seem recondite and technical in a paper about ESP, say, or beauty and sex ratio, become much clearer when the topic is something familiar such as voting. Rejecting an empty null hypothesis is often an empty exercise—but it’s obviously empty when the hypothesis is that votes at different times are random samples from a common population. This sort of thing might get you published in PNAS and featured in Freakonomics and NPR but it has no place in serious decision making.

In any case, statistics is hard, even if you’re not an oncologist, so you can’t blame people—even Williams College math professors or former program directors at Harvard—for getting things wrong. I’d like to blame these people for being so clueless as to not realize how clueless they are, but perhaps years of being deferred to by business clients and fawning students has made them overestimate their competence.

I can, however, blame the attorney generals of 17 states, and the 100+ members of the U.S. Congress, for signing on to this document—after it had been ridiculed in every corner of the internet. To sign on to something when you know it’s wrong—that’s the kind of behavior we expect from Association for Psychological Science, not from the political leaders of the greatest country on earth.

Back when the Soviet scientific establishment was extolling Lysenko and the Soviet political establishment was following Stalin’s dictates, at least they had the excuse that their lives and livelihoods were at stake. I don’t think this is the case for tenured psychology professors or those 17 state attorney generals and 100 congressmembers. Even if they get primaried, I’m sure they can get good jobs as lawyers or lobbyists or whatever. No, they’re just endorsing something they know is wrong because politics. As a political scientist, I can try to understand this. But I don’t have to like it.

P.S. I can’t bring myself to say “attorneys general,” any more than I can bring myself to say “whom.” It’s a ticket to Baaaath kind of thing.

P.P.S. But I disagree with Senator Chris Murphy who described the recent attempts to reverse the election as “the most serious attempt to overthrow our democracy in the history of our of country.” It’s not as serious as when all those southern states stopped letting black people vote, right? Also I disagree with former congressmember Joe Scarborough, who wrote that “a party that elevated Donald Trump from Manhattan’s class clown to the U.S. presidency no longer has any use for the likes of Kirk, Edmund Burke or William F. Buckley.” I don’t know about Kirk or Burke, but Buckley was a big supporter of Joe McCarthy, who had a lot in common with Trump. The recent post-election statistical clown show is far from the worst behavior seen in American electoral politics, but there’s something horrifying about it, in that it is outrageous and at the same time has been anticipated for months. Kind of like when you’re watching a glass slowly fall off the table but you can’t seem to get your hand there in time to catch it.

P.P.P.S. Yet another thing that’s going on here, I think, is the tradition in this country of celebrating lawyers and advocates who can vigorously argue a case in spite of the facts. Consider Robert Kardashian, Al Sharpton, etc. Senator Ted Cruz offered to argue that latest case in front of a court. Cruz has to know that the argument is b.s., but in this country, lawyers who argue cases that they know are wrong are not just tolerated but sometimes celebrated. I guess Cruz’s implicit case, which he could never say in public, is that that consultant’s argument is beyond stupid, but who cares because it serves the larger cause, which is his political party and his political movement. This is similar to the Soviet scientists supporting Lysenko on the implicit grounds that socialism is more important that scientific accuracy, or the Association for Psychological Science supporting the publication of bad science on the implicit grounds that the most important thing for psychology is NPR appearances, Ted talks, and the production of tenure-track jobs.

I was just returning from college and I was in your neighbourhood. So,I thought I would drop by for a chat at gay hookup denver.

“I Can’t Believe It’s Not Better”

Posted by Andrew on 11 December 2020, 9:42 am

Check out this session Saturday at Neurips. It’s a great idea, to ask people to speak on methods that didn’t work. I have a lot of experience with that!

Here are the talks:

Max Welling: The LIAR (Learning with Interval Arithmetic Regularization) is Dead

Danielle Belgrave: Machine Learning for Personalised Healthcare: Why is it not better?

Michael C. Hughes: The Case for Prediction Constrained Training

Andrew Gelman: It Doesn’t Work, But The Alternative Is Even Worse: Living With Approximate Computation

Roger Grosse: Why Isn’t Everyone Using Second-Order Optimization?

Weiwei Pan: What are Useful Uncertainties for Deep Learning and How Do We Get Them?

Charline Le Lan, Laurent Dinh: Perfect density models cannot guarantee anomaly detection

Fan Bao, Kun Xu, Chongxuan Li, Lanqing Hong, Jun Zhu, Bo Zhang. Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

Emilio Jorge, Hannes Eriksson, Christos Dimitrakakis, Debabrota Basu, Divya Grover. Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning

Tin D. Nguyen, Jonathan H. Huggins, Lorenzo Masoero, Lester Mackey, Tamara Broderick. Independent versus truncated finite approximations for Bayesian nonparametric inference

Ricky T. Q. Chen, Dami Choi, Lukas Balles, David Duvenaud, Philipp Hennig. Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering

Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, Geoff Pleiss, John Patrick Cunningham. Uses and Abuses of the Cross-Entropy Loss: Case Studies in Modern Deep Learning

P.S. The name of the session is a parody of a slogan from a TV commercial from my childhood. When I was asked to speak in this workshop, I was surprised that they would use such an old-fashioned reference. Most Neurips participants are much younger than me, right? I asked around and was told that the slogan has been revived recently in social media.

IEEE’s Refusal to Issue Corrections

Posted by Jessica Hullman on 10 December 2020, 5:27 pm

This is Jessica. The following was written by a colleague Steve Haroz on his attempt to make corrections to a paper he wrote published by IEEE (which, according to Wikipedia, publishes “over 30% of the world’s literature in the electrical and electronics engineering and computer science fields.”)

One of the basic Mertonion norms of science is that it is self-correcting. And one of the basic norms of being an adult is acknowledging when you make a mistake. As an author, I would like to abide by those norms. Sadly, IEEE conference proceedings do not abide by the standards of science… or of being an adult.

Two years ago Robert Kosara and I published a position paper titled, “Skipping the Replication Crisis in Visualization: Threats to Study Validity and How to Address Them”, in the proceedings of “Evaluation and Beyond – Methodological Approaches for Visualization”, which goes by “BELIV”. It describes a collection of problems with studies, how they may arise, and measures to mitigate them. It broke down threats to validity from data collection, analysis mistakes, poorly formed research questions, and a lack of replication publication opportunities. There was another validity threat that we clearly missed… a publisher that doesn’t make corrections.

Requesting to fix a mistake

A few months after the paper was published, a colleague, Pierre Dragicevic, noticed a couple problems. We immediately corrected and annotated them on the OSF postprint, added an acknowledgment to Pierre, and then sent an email to the paper chairs summarizing the issues and asking for a correction to be issued.

Dear organizers of Evaluation and Beyond – Methodological Approaches for Visualization (BELIV),

This past year, we published a paper titled “Skipping the Replication Crisis in Visualization: Threats to Study Validity and How to Address Them”. Since then, we have been made aware of two mistakes in paper:

  1. The implications of a false positive rate

In section 3.1, we wrote:

…a 5% false positive rate means that one out of every 20 studies in visualization (potentially several each year!) reports on an effect that does not exist.

But a more accurate statement would be:

…a 5% false positive rate means that one out of every 20 non-existent effects studied in visualization (potentially several each year!) is incorrectly reported as being a likely effect.

  1. The magnitude of p-values

In section 3.2, we wrote:

…p-values between 0.1 and 0.5 are actually much less likely than ones below 0.1 when the effect is in fact present…

But the intended statement was:

…p-values between 0.01 and 0.05 are actually much less likely than ones below 0.01 when the effect is in fact present…

As the main topic of the paper is the validity of research publications, we feel that it is important to correct these mistakes, even if seemingly minor. We have uploaded a new version to OSF with interactive comments highlighting the original errors (https://osf.io/f8qey/). We would also like to update the IEEE DL with the version attached. Please let us know how we can help accomplish that.

Thank you,

Steve Haroz and Robert Kosara

Summary of what we wanted to fix

  1. We should have noted that the false positive rate applies to non-existent effects. (A sloppy intro-to-stats level mistake.)
  2.  We put some decimals in the wrong place. (It probably happened when hurriedly moving from a Google doc to latex right before the deadline.)

We knew better than this, but we made a couple mistakes. They’re minor mistakes that don’t impact conclusions, but mistakes nonetheless. Especially in a paper that is about the validity of scientific publications, we should correct them. And for a scientific publication, the process for making corrections should be in place.

Redirected to IEEE

The paper chairs acknowledged receiving the email but took some time to get back to us. Besides arriving during everyone’s summer vacation, there was apparently no precedence for requesting a corrigendum (corrections for mistakes made by the authors) at this publication venue, so they needed a couple months to figure out how to go about it. Here was what IEEE eventually told them:

Generally updates to the final PDF files are not allowed once they are posted in Xplore. However, the author may be able to add an addendum to address the issue. They should contact [email protected] to make the request. 

So we contacted that email address and after a month and a half got the following reply:

We have received your request to correct an error in your work published in the IEEE Xplore digital library. IEEE does not allow for corrections within the full-text publication document (e.g., PDF) within IEEE Xplore, and the IEEE Xplore metadata must match the PDF exactly.  Unfortunately, we are unable to change the information on your paper at this time.  We do apologize for any inconveniences this may cause.

This response is absurd. For any publisher of scientific research, there is always some mechanism for corrigenda. But IEEE has a policy against it.

Trying a different approach

I emailed IEEE again asking how this complies with the IEEE code of ethics:

I am surprised by this response, as it does not appear consistent with the IEEE code of ethics (https://www.ieee.org/about/corporate/governance/p7-8.html), which states that IEEE members agree:

“7 … to acknowledge and correct errors…”

I would appreciate advice on how we can comply with an ethical code that requires correcting errors when IEEE does not allow for it. 

And one of the BELIV organizers, to their credit, backed us up by replying as well:

As the organizer of the scientific event for which the error is meant to be reported, […] I am concerned about the IEEE support response that there are NO mechanisms in place to correct errors in published articles. I have put the IEEE ethics board in the cc to this response and hope for an answer on how to acknowledge and correct errors as an author of an IEEE published paper.

The IEEE ethics board was CCed, but we never heard from them. However, we did hear from someone involved in “Board Governance & Intellectual Property Operations”:

IEEE conference papers are published as received. The papers are submitted by the conference organizers after the event has been held, and are not edited by IEEE. Each author assumes complete responsibility for the accuracy of the paper at the time of publication. Each conference is considered a stand-alone publication and thus there is no mechanism for publishing corrections (e.g., in a later issue of a journal). The conference proceedings serves as a ‘snapshot’ of what was distributed at the conference at the time of presentation and must remain as is. IEEE will make metadata corrections (misspelled author name, affiliation, etc) in our database, but per IEEE Publications policy, we do not edit a published PDF unless the PDF is unreadable. 

That said, any conference author who identifies an error in their work is free to build upon and correct a previously published work by submitting to a subsequent conference or journal. We apologize for any inconvenience this may cause.

The problem with IEEE’s suggestion

Rather than follow the norm of scientific publishing and even its own ethics policies, IEEE suggests that we submit an updated version of the paper to another conference or journal. This approach is unworkable for multiple reasons:

1) It doesn’t solve the problem that the incorrect statements are available and citable.

Keeping the paper available potentially spreads misinformation. In our paper, these issues are minor and can be checked via other sources. But what if they substantially impacted the conclusions? This year, IEEE published a number of papers about COVID-19 and pandemics. Are they saying that one of these papers should not be corrected even if the authors and paper chairs acknowledge they include a mistake? 

2) A new version would be rejected for being too similar to the old version.

According to IEEE’s policies, if you update a paper and submit a new version, it must include “substantial additional technical material with respect to the … articles of which they represent an evolution” (see IEEE PSPB 8.1.7 F(2)). Informally, this policy is often described as meaning that papers need 30% new content to be publishable. But some authors have added entire additional experiments to their papers and gotten negative reviews about the lack of major improvements over previous publications. In other words, minor updates would get rejected. And I don’t see any need to artificially inflate the paper with 30% more content just for the heck of it.

It could even be rejected for self-plagiarism unless we specifically cite the original paper somehow. What a great way to bump up your h-index! “And in conclusion, as we already said in last year’s paper…”

3) An obnoxious amount of work for everyone involved.

The new version would need to be handled by a paper chair (conference) or editor (journal), assigned to a program committee member (conference) or action editor (journal), have reviewers recruited, be reviewed, have a meta-review compiled, and be discussed by the paper chairs or editors. What a blatant disregard for other people’s time.

The sledgehammer option

I keep cringing every time I get a Google Scholar alert for the paper. That’s not a good place to be. I looked into options for retracting it, but IEEE doesn’t seem very interested in retracting papers that make demonstrably incorrect statements or that incorrectly convey the authors’ intent:

Under an extraordinary situation, it may be desirable to remove access to the content in IEEE Xplore for a specific article, standard, or press book. Removal of access shall only be considered in rare instances, and examples include, but are not limited to, a fraudulent article, a duplicate copy of the same article, a draft version conference article, a direct threat of legal action, and an article published without copyright transfers. Requests for removal may be submitted to the Director, IEEE Publications. Such requests shall identify the publication and provide a detailed justification for removing access.  -IEEE PSPB 8.1.11-A

So attempting to retract is unlikely to succeed. Also, there’s no guarantee that we would not get accused of self-plagiarism if we retracted it and then submitted the updated version. And really, it’d be such a stupid way to fix a minor problem. I don’t have a better word to describe this situation. Just stupid.

Next steps

  1. Robert and I ask any authors who would cite our paper to cite the updated OSF version. Please do not cite the IEEE version. You can find multiple reference formats on the bottom right of the OSF page.
  2. This policy degrades the trustworthiness and citability of papers in IEEE conference proceedings. And any authors who have published with IEEE would be understandably disturbed by IEEE denigrating the reliability of their work. What if a paper contained substantial errors? And what if it misinformed and endangered the public? It is difficult to see these proceedings as any more trustworthy than a preprint. At least preprints have a chance of authors updating them. So use caution when reading or citing IEEE conference proceedings, as the authors may be aware of errors but unable to correct them.
  3. IEEE needs to make up its mind. It could decide to label conference proceedings as in-progress work and allow them to be republished elsewhere. However, if updated versions of conference papers cannot be resubmitted due to lack of novelty or “self-plagiarism”, IEEE needs to treat these conference papers the way that scientific journals treat their articles. In other words, if IEEE is to be a credible publisher of scientific content, it needs to abide by the basic Mertonian norm of enabling correction and the basic adult norm of acknowledging and correcting mistakes.

What about this idea of rapid antigen testing?

Posted by Andrew on 10 December 2020, 9:48 am

So, there’s this idea going around that seems to make sense, but then again if it makes so much sense I wonder why they’re not doing it already.

Here’s the background. A blog commenter pointed me to this op-ed from mid-November by Michael Mina, an epidemiologist and immunologist who wrote:

Widespread and frequent rapid antigen testing (public health screening to suppress outbreaks) is the best possible tool we have at our disposal today—and we are not using it.

It would significantly reduce the spread of the virus without having to shut down the country again—and if we act today, could allow us to see our loved ones, go back to school and work, and travel—all before Christmas.

Antigen tests are “contagiousness” tests. They are extremely effective (>98% sensitive compared to the typically used PCR test) in detecting COVID-19 when individuals are most contagious. Paper-strip antigen tests are inexpensive, simple to manufacture, give results within minutes, and can be used within the privacy of our own home . . .

If only 50% of the population tested themselves in this way every 4 days, we can achieve vaccine-like “herd effects” . . . Unlike vaccines, which stop onward transmission through immunity, testing can do this by giving people the tools to know, in real-time, that they are contagious and thus stop themselves from unknowingly spreading to others.

Mina continues:

The U.S. government can produce and pay for a full nation-wide rapid antigen testing program at a minute fraction (0.05% – 0.2%) of the cost that this virus is wreaking on our economy.

The return on investment would be massive, in lives saved, health preserved, and of course, in dollars. The cost is so low ($5 billion) that not trying should not even be an option for a program that could turn the tables on the virus in weeks, as we are now seeing in Slovakia—where massive screening has, in two weeks, completely turned the epidemic around.

The government would ship the tests to participating households and make them available in schools or workplaces. . . . Even if half of the community disregards their results or chooses to not participate altogether, outbreaks would still be turned around in weeks. . . .

The sensitivity and specificity of these tests has been a central debate – but that debate is settled. . . . These tests are incredibly sensitive in catching nearly all who are currently transmitting virus. . . .

But wait—if this is such a great idea, why isn’t it already happening here? Mina writes:

The antigen test technology exists and some companies overseas have already produced exactly what would work for this program. However, in the U.S., the FDA hasn’t figured out a way to authorize the at-home rapid antigen tests . . . We need to create a new authorization pathway within the FDA (or the CDC) that can review and approve the use of at-home antigen testing . . . Unlike vaccines, these tests exist today—the U.S. government simply needs to allocate the funding and manufacture them. We need an upfront investment of $5 billion to build the manufacturing capacity and an additional $10 billion to achieve production of 10-20 million tests per day for a full year. This is a drop in the bucket compared to the money spent already and lives lost due to COVID-19. . . .

I read all this and wasn’t sure what to think. On one hand, it sounds so persuasive. On the other hand, lots of tests are being done around here and I haven’t heard of these rapid paper tests. Mina talks about at-home use, but I haven’t heard about these tests being given at schools either. Also, Mina talks about the low false-positive rate of these tests, but I’d think the big concern would be false negatives. Also, it’s hard to believe that there’s this great solution and it’s only being done by two countries in the world (Britain and Slovakia). You can’t blame the FDA bureaucracy for things not happening in other countries, right?

Anyway, I wasn’t sure what to think so I contacted my epidemiologist colleague Julien Riou, who wrote:

I think the idea does make sense from a purely epi side, even though the author appears extremely confident in something that has basically never been done (but that maybe what you need to do to be published in Time magazine). In principle, rapid antigen testing every 4 days (followed by isolation of all positive cases) would probably reduce transmissibility enough if people are relatively compliant and if the sensitivity is high. The author is quick to dismiss the issue of sensitivity, saying:

People have said these tests aren’t sensitive enough compared to PCR. This simply is not true. It is a misunderstanding. These tests are incredibly sensitive in catching nearly all who are currently transmitting virus. People have said these tests aren’t specific enough and there will be too many false positives. However, in most recent Abbott BinaxNOW rapid test studies, the false positive rate has been ~1/200.

Looking at the paper the author himself links (link), the sensitivity of the Abbott BinaxNOW is “93.3% (14/15), 95% CI: 68.1-99.8%”. I find it a bit dishonest not to present the actual number (he even writes “>98%” somewhere else, without a source so I couldn’t check) and deflect on specificity which is not the issue here (especially if there is a confirmation with RT-PCR). The authors of the linked paper even conclude that “this inherent lower sensitivity may be offset by faster turn-around, the ability to test more frequently, and overall lower cost, relative to traditional RT-PCR methods”. Fair enough, but far from “these tests are incredibly sensitive” in the Time piece.

Two more points on the sensitivity of rapid antigen tests. First, it is measured with the RT-PCR as the reference, and we know that the sensitivity of RT-PCR itself is not excellent. There are a lot of papers on that, I randomly picked this one where the sensitivity is measured at 82.2% (95%CI 79.0-85.1%) for RT-PCR in hospitalised people. This should to be combined with that of rapid antigen testing if you assume both tests are independent. Of course there is a lot more to say about this, sensitivity probably depends on who is tested, when, whether there are symptoms, and both tests are probably not independent. Still I think it’s worth mentionning, and again far from “these tests are incredibly sensitive”. Second, the sensitivity is measured in lab conditions, and while I don’t have a lot of experience with this I doubt that you can expect everyone to use the test perfectly. And on top of that, people might not comply to isolation (especially if they have to work) and logistics problems are likely to occur.

Even with all these caveats, I think that this mass testing strategy might be sufficient to curb down cases if we can pull it off. Combined with contact tracing, social distancing, masks and all the other control measures in place in most of the world, being able to identify and isolate even a small proportion of infectious cases that you wouldn’t see otherwise can be very helpful. We’ll soon be able to observe the impact empirically in Slovakia and Liverpool.

So, again, I’m not sure what to think. I’d think that even a crappy test if applied widely enough would be better than the current setting in which people use more accurate tests but then have to wait many days for the results. Especially if the alternative is some mix of lots of people not going to work and to school and other people, who do have to go to work, being at risk. On the other hand, some of the specifics in that above-linked article seem fishy. But maybe Riou is right that this is just how things go in the mass media.

“It’s turtles for quite a way down, but at some point it’s solid bedrock.”

Posted by Andrew on 9 December 2020, 7:35 pm

Just once, I’d like to hear the above expression..

It can’t always be turtles all the way down, right? Cos if it was, we wouldn’t need the expression. Kind of like if everything was red, we wouldn’t need any words for colors.

What are the most important statistical ideas of the past 50 years?

Posted by Andrew on 9 December 2020, 9:53 am

Aki and I wrote this article, doing our best to present a broad perspective.

We argue that the most important statistical ideas of the past half century are: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. These eight ideas represent a categorization based on our experiences and reading of the literature and are not listed in chronological order or in order of importance. They are separate concepts capturing different useful and general developments in statistics. We discuss common features of these ideas, how they relate to modern computing and big data, and how they might be developed and extended in future decades.

An earlier version of this paper appeared on Arxiv but then we and others noticed some places to fix it, so we updated it.

Here are the sections of the paper:

1. The most important statistical ideas of the past 50 years

1.1. Counterfactual causal inference
1.2. Bootstrapping and simulation-based inference
1.3. Overparameterized models and regularization
1.4. Multilevel models
1.5. Generic computation algorithms
1.6. Adaptive decision analys
1.7. Robust inference
1.8. Exploratory data analysis

2. What these ideas have in common and how they differ

2.1. Ideas lead to methods and workflows
2.2. Advances in computing
2.3. Big data
2.4. Connections and interactions among these ideas
2.5. Theory motivating application and vice versa
2.6. Links to other new and useful developments in statistics

3. What will be the important statistical ideas of the next few decades?

3.1. Looking backward
3.2. Looking forward

The article was fun to write and to revise, and I hope it will motivate others to share their views.

The p-value is 4.76×10^−264 1 in a quadrillion

Posted by Andrew on 8 December 2020, 9:51 pm

Ethan Steinberg writes:

It might be useful for you to cover the hilariously bad use of statistics used in the latest Texas election lawsuit.

Here is the raw source, with the statistics starting on page 22 under the heading “Z-Scores For Georgia”. . . .

The main thing about this analysis that’s so funny is that the question itself is so pointless. Of course Hillary’s vote count is different from Joe’s vote count! They were different candidates! Testing the null hypothesis is really pointless and it’s expected that you would get such extreme z-scores. I think this provides a good example of how statistics can be misused and it’s funny to see this level of bad analysis in a high level legal filing.

Here’s the key bit:

Screen-Shot-2020-12-08-at-9.31.18-PM.png

There are a few delightful—by which I mean, horrible—items here:

First off, did you notice how he says “In 2016, Trump won Georgia” . . . but he can’t bring himself to say that Biden won in 2020? Instead, he refers to “The Biden and Trump percentages of the tabulations.” So tacky. Tacky tacky tacky. If you want to maintain uncertainty, fine, but then refer to “the Clinton and Trump percentages of the tabulations” in 2016.

Second, the binomial distribution makes no sense here. This corresponds to a model in which voters are independently flipping coins (approximately; not quite coin flips because the probability isn’t quite 50%) to decide how to vote. That’s not how voting works. Actually, most voters know well ahead of time who they will be voting for. So even if you wanted to test the null hypothesis of no change (which, as my correspondent noted above, you don’t), this would be the wrong model to use.

Third . . . don’t you love that footnote 3? Good to be educating the court on the names of big powers of ten. Next step, the killion, which, as every mathematician knows, is a number so big it can kill you.

Footnote 3 is just adorable.

What next, a p-value of 4.76×10^−264?

The author of the expert report is a Charles J. Cicchetti, a Ph.D. economist who has had many positions during his long career, including “Deputy Directory of the Energy and Environment Policy Center at the John F. Kennedy School of Government at Harvard University.”

The moral of the story is: just because someone was the director of a program at Harvard University, or a professor of mathematics at Williams College, don’t assume they know anything at all about statistics.

The lawsuit was filed by the State of Texas. That’s right, Texas tax dollars were spent hiring this guy. Or maybe he was working for free. If his consulting fee was $0/hour, that would still be too high.

Given that the purpose of this lawsuit is to subvert the express will of the voters, I’m glad they hired such an incompetent consultant, but I feel bad for the residents of Texas that they had to pay for it. But, jeez, this is really sad. Even sadder is that these sorts of statistical tests continue to be performed, 55 years after this guy graduated from college.

P.S. The lawsuit has now supported by 17 other states. There’s no way they can believe these claims. This is serious Dreyfus-level action. And I’m not talking about Amity Beach.

Postdoc at the Polarization and Social Change Lab

Posted by Andrew on 8 December 2020, 1:48 pm

Robb Willer informs us that the Polarization and Social Change Lab has an open postdoctoral position:

The Postdoctoral Associate will be responsible for co-designing and leading research projects in one or more of the following areas: political polarization; framing, messaging, and persuasion; political dimensions of inequality; social movement mobilization; and online political behavior.

This looks super-interesting!

Also, the lab is at Stanford so maybe they could do some local anthropology and study’s going on with the Hoover Institution.

“A better way to roll out Covid-19 vaccines: Vaccinate everyone in several hot zones”?

Posted by Andrew on 8 December 2020, 11:08 am

Peter Dorman writes:

This [by Daniel Teres and Martin Strossberg] is an interesting proposal, no? Since vaccines are being rushed out the door with limited testing, there’s a stronger than usual case for adaptive management: implementing in a way that maximizes learning. I [Dorman] suspect there would also be large economies in distribution if localities were the units of sequencing rather than individuals. It would be useful to hear from your readers what they think a good distribution-cum-research-design plan would look like.

In the article, Teres and Strossberg write:

Vaccines are on the brink of crossing the finish line of approval, but the confusion surrounding the presidential transition has brought great uncertainty to the distribution plan.

The National Academies of Sciences, Engineering, and Medicine developed an ethical framework for equitable distribution of Covid-19 vaccines, as have others. But national plans based on these frameworks are problematic. They recommend giving the vaccine first to Phase 1a front line high-risk health workers and first responders. That stretches the supply chain to include workers in every hospital, nursing home, long-term care facility, as well as all ambulance, fire rescue, and police first responders. . . .

We propose a different approach: target several hot zones with high numbers of Covid-19 cases, especially those zones with rising Covid-19 hospitalization rates. . . . Vaccination within each hot zone would begin with Phase 1a individuals and then move on quickly through Phases 1b, 2, 3, and 4. . . .

We believe that our approach offers the best way to break the chain of transmission. The first 60 days will be key in showing the results. Our plan has many advantages over the “phase” plan proposed by the National Academies.

This seems reasonable to me. On the other hand, I don’t know anything about this, and I’m easily persuaded. What do youall think?

Covid crowdsourcing

Posted by Andrew on 7 December 2020, 7:06 pm

Macartan Humphries writes:

We put together a platform that lets researchers contribute predictive models of cross national (and within country) Covid mortality, focusing on political and social accounts.

The plan then is to aggregate using a stacking approach.

Go take a look.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK