Fetishizing p-Values

September 23, 2010

Posted by Tom Leinster

$MathML-enabled post (click for more details).$

The first time I understood the problem was when I read this:

$P$ values are not a substitute for real measures of effect size, and despite its popularity with researchers and journal editors, testing a null hypothesis is rarely the appropriate model in science [long list of references, from 1969 onwards]. In natural populations, the null hypothesis of zero differentiation is virtually always false, and if sample size is large enough, this can be demonstrated with any desired degree of statistical significance.

(Lou Jost, Molecular Ecology 18 (2009), 2088–2091.)

If Jost’s criticism is valid, it’s shockingly important. He is attacking what is perhaps the most common use of statistics in science.

$MathML-enabled post (click for more details).$

This goes as follows:

Formulate a null hypothesis.
Collect some data.
Perform a statistical test on the data.
Conclude that if the null hypothesis is correct, the probability of obtaining data as extreme as, or more extreme than, yours is less than $0.01$ (say). That’s a $p$ -value of $0.01$ , or, sloppily, $99\%$ certainty that the null hypothesis is false.
Trumpet the high statistical significance of your conclusion.

Low $p$ -values are taken to indicate high certainty, and they’re used all over science. And that, Jost claims, is a problem. I’ll explain that further in a moment.

Now there’s a whole book making the same point: The Cult of Statistical Significance, by two economists, Stephen T. Ziliak and Deirdre N. McCloskey. You can see their argument in this 15-page paper with the same title. Just because they’re economists doesn’t mean their prose is sober: according to one subheading, ‘Precision is Nice but Oomph is the Bomb’.

So, is this the scandalous state of affairs that Jost, Ziliak and McCloskey seem to believe?

First I’ll explain what I believe the basic point to be. Take a coin from your pocket. A truly unbiased coin has never been made. In particular, your coin is biased. That means that if you toss your coin enough times, you can demonstrate with $99.99\%$ certainty that it is biased. Or, to speak properly, you can reject the null hypothesis that the coin is fair with a $p$ -value of $0.0001$ .

Does that matter? No. The coin in your pocket is almost certainly as fair as it needs to be for the purposes of making coin-toss-based decisions. The precision of the conclusion has nothing to do with the magnitude of the effect.

Coins don’t matter, but drug trials do. Suppose that you and I are each trying to develop a drug to speed up the healing of broken legs. You do some trials and run some statistics and conclude that your drug works—in other words, you reject the null hypothesis that patients taking your drug heal no faster than a control group. I do the same for my drug. Your $p$ -value is $0.1$ (a certainty of $90\%$ ), and mine is $0.001$ (a certainty of $99.9\%$ ).

What does this tell us about the comparative usefulness of our drugs? Nothing. That’s because we know nothing about the magnitude of the effect. I can now reveal that my wonder drug for broken legs is… an apple a day. Who knows, maybe this does have a minuscule positive effect; for the sake of argument, let’s say that it does. As with the coin example, given enough time to conduct trials, I can truthfully claim any degree of statistical significance I like.

The danger is that someone who buys into the ‘cult of statistical significance’ might simply look at the numbers and say ‘ $90\%$ is less than $99.9\%$ , so the second drug must be better’.

Ziliak and McCloskey’s book is reviewed in the Notices of the AMS (fairly positive), the Times Higher Education (positive but not so detailed), and an economics-ish blog called PanCrit (‘a poorly argued rant’). Olle Häggström, author of the Notices review, takes up the story:

A major point in The Cult of Statistical Significance is the observation that many researchers are so obsessed with statistical significance that they neglect to ask themselves whether the detected discrepancies are large enough to be of any subject-matter significance. Ziliak and McCloskey call this neglect sizeless science.

[…]

In one study, they have gone over all of the 369 papers published in the prestigious journal American Economic Review during the 1980s and 1990s that involve regression analysis. In the 1980s, 70 percent of the studied papers committed sizeless science, and in the 1990s this alarming figure had increased to a stunning 79 percent.

This analysis backs up Jost’s swipe at journal editors.

(How many academics would be brave enough to criticize journal editors like that? It may be significant that Jost, whose work I’ve enjoyed very much and rate very highly, is not employed by a university.)

So what should scientists be doing? Jost makes a brief suggestion:

The important scientific question is the real magnitude of the differentiation, not the smallness of the $P$ value (which confounds the magnitude of the effect with the sample size). Answering the question is a matter of parameter estimation, not hypothesis testing. In this approach, the final result should be an estimate of a meaningful measure of the magnitude of differentiation, accompanied by a confidence interval that describes the statistical uncertainty in this estimate. If the confidence interval includes zero, then the null hypothesis cannot be rejected. If the confidence interval does not include zero, then not only can we reject the null hypothesis, but we can have an idea of whether the real magnitude of the differentiation is large or small.

It seems to me that the basic point made by Jost and by Ziliak and McCloskey has to be right. But I’m not very knowledgeable about statistics or its application to science, so I’m less sure that it’s as common a fallacy as they suggest. Maybe you think they’re attacking a straw man. And maybe there are differences between the positions of Jost on the one hand, and Ziliak and McCloskey on the other, that I’m not detecting. I’d be interested to hear from those who know more about this than I do.

Posted at September 23, 2010 5:44 PM UTC

TrackBack URL for this Entry: https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2281

47 Comments & 0 Trackbacks

Re: Fetishizing p-Values

Do you have a direct like to the paper by Jost that you quote at the start of the post? I can’t find any journal called Molecular Biology that was at volume 18 in 2009.

Posted by: Ben Martin on September 23, 2010 7:23 PM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

Sorry, Ben: my mistake. That should be Molecular Ecology. I’ve changed the text above now, so as not to waste anyone else’s time.

The journal seems to be having a server problem at the moment, but this link to the paper briefly worked for me, then stopped working. (Moreover, I’m sure it’s behind a paywall.) The journal homepage is here.

Posted by: Tom Leinster on September 23, 2010 7:35 PM | Permalink | Reply to this

Re: Fetishizing p-Values

That’s great Tom, thanks. Now I need to figure out if we’ve got a subscription. :)

Posted by: Ben Martin on September 23, 2010 8:15 PM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

(The reader can decide for themselves whether “sour grapes” are a significant component in this comment.)

Firstly, I’ll call this a “collective self-deception” rather than a “fallacy”, since I think that’s what it is. I also think “fetishising” is the wrong word, although I can’t think of a better one.

Secondly, the quote

the observation that many researchers are so obsessed with statistical significance that they neglect to ask themselves whether the detected discrepancies are large enough to be of any subject-matter significance

strikes me as being “politely inaccurate”: in an experimental science, if you don’t demonstrate “statistical significance” your paper/study is pretty much guaranteed to be rejected by the journal/government licensing agency (delete as appropriate), whereas it’s relatively rare to be rejected on the basis of effect magnitude. It’s also somewhat difficult for a reviewer to decide what size of effect ought to be important: I know I’ve reviewed papers (my field’s computer engineering, not medicine though) where I’ve thought “This seems like an awful, awful lot of work for not very much benefit” without feeling that this was a strong enough feeling for me to reject a piece of work. As such, it’s “to be expected” that researchers are going to care about statistical significance and much less about magnitude. (For those who aren’t involved in experimental fields, the amount of hassle and work even if you don’t do any of the data collection yourself means you’re strongly motivated to do everything that’s not fraudulent to produce a publishable paper; if you’re involved in the “back-breaking months” of data collection yourself you’re even more strongly motivated.)

The other really big “collective deception” that crops up in engineering is essentially use of percentages in prediction/classification without reference to the number/type of “candidates”: Eg, suppose the security services human-intelligence can correctly classify 80 percent of “terrorist/non-terrorist” correctly, if I can classify based on analysing everyone’s text messages into “terrorist/non-terrorist” 98 percent correctly, that sounds like I’m doing better. But say the security services only start looking at 50 thousand people whilst because I’m analysing say 3 million people’s text messages I’m actually making a much bigger number of errors (and possibly I’m doing better because I’ve got a lot more easy cases, although this is not guaranteed and the technique might be better for “genuine” reasons). This case is fairly obvious, but it can be much more obscure. (This is related to, but distinct from, the “baserate fallacy” in medicine.) In this and $p$ -values nothing that is being said is wrong, but the more nuanced full picture is not being highlighted: it’s unfortunately the case that reviewers prefer a paper making a binary “this is better” claim that they can accept or reject rather than one that claims it’s incomparable to existing work.

Regarding the suggestion of estimating parameters, it clearly involves a shift towards a more model based interpretation (since the model of what you’re measuring the parameter of is likely to be more restricted than the model that you’re using to model the null hypothesis). As such, there’s a need to discuss what assumptions and “conceptual framework” lies behind the model, but that’s actually a good thing. But I’d have to think further about what effects it might have in practice.

Posted by: bane on September 23, 2010 10:31 PM | Permalink | Reply to this

Re: Fetishizing p-Values

Thanks for your thoughts, bane. You wrote:

my field’s computing engineering, not medicine though

Although Jost is primarily an ecologist, and Ziliak and McCloskey are primarily economists, I think they’re both making points about the whole of science (and indeed non-sciences that use this kind of statistics). So as I understand it, your experiences are directly relevant to the point they’re making.

I find what you wrote in your last paragraph especially interesting. At an abstract level, I suppose it shouldn’t be surprising: Jost argues for a less simplistic approach (involving the magnitude of effect as well as its statistical significance), so perhaps it’s inevitable that, as you say, the whole framework becomes more complex.

Posted by: Tom Leinster on September 23, 2010 10:57 PM | Permalink | Reply to this

Re: Fetishizing p-Values

For those who aren’t involved in experimental fields, the amount of hassle and work even if you don’t do any of the data collection yourself means you’re strongly motivated to do everything that’s not fraudulent to produce a publishable paper; if you’re involved in the “back-breaking months” of data collection yourself you’re even more strongly motivated.
Well, only one word should be enough for scientists to show that the above quote doesn’t hold for all ‘scientists’, sadly.

Stapel

Posted by: Kim on December 21, 2011 2:26 PM | Permalink | Reply to this

Re: Fetishizing p-Values

A simple example of fetishizing p-Values is any radio piece or second-hand popular science article that says scientists have found “a significant effect.” It’s very easy to hear the word significant as if it means big enough to matter.

Of course, the fact that one hears and sees this often in popular science coverage doesn’t actually prove that people are fooled by it… much…

Posted by: Steve Witham on September 23, 2010 11:17 PM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

People have a hard time understanding “significant” when it does matter. Just look at the mess caused when climate scientist Phil Jones said there was no significant global warming for the last 15 years. What got repeated was that global warming was not significant, even though Jones went on to clarify that the 0.12C increase per decade was near but not quite 95% significant, and that 15 years is too short an analysis period to often yield significant results.

Posted by: Rod McGuire on September 23, 2010 11:58 PM | Permalink | Reply to this

Re: Fetishizing p-Values

This depends on your areas standard of confidence.

If you are looking at it as a physics theory, it is somewhat lacking. A few months back a paper was released that showed that attribution to AGW is now possible at the 2 sigma level. (I’ll omit since I don’t have it handy, but I can find it at request I think. Interestingly IIRC it answered Jones outstanding question if there are possible natural causes as “no”, by testing the naturalness (consistency) of the models parameters. AGW is now a necessary component.)

This is however well inside the certainty of medical diagnoses of similarly complicated natural systems called “humans” for symptoms correlated to “sickness”. As Jones admit, he and IPCC are satisfied with such certainty. In fact, the IPCC -07 report had 80 % in sum estimate, and only 60 % on similar modeling as above, I’m reasonable sure. I think it is a reasonable judgment, but if not I get it’s roughly even odds we hit 3 sigma with the IPCC -14 report with the above trend figures.

Now, it is even worse if you take a diagnosed patient and treat it with a medicine. 60 - 80 % diagnose meets the individual response to a 40 - 80 % full cure, say (but at a good p-level :-D). Ouch! Luckily the Earth is a bit better characterized!

Posted by: Torbjörn Larsson, OM on September 25, 2010 9:35 AM | Permalink | Reply to this

Re: Fetishizing p-Values

I’m just an econ grad student, but after a few semesters of stats and econometrics, I’m pretty confident that I understand what a p-value is (and I’m sure most economists do as well). Thus it seems to me that these kinds of criticism are (at best) overstated. Regarding McCloskey and Ziliak book, you may find interesting a (rather critical) reply by Hoover and Siegler.

About the topic itself: it is true that p-value does not measure the magnitude of the effect (but then, anyone who has taken at least one course in statistics should know that), that’s why you usually provide also a numerical estimate of the effect. What p-value really measures is whether the signal is strong enough given the uncertainty in the data - and if it is not, we cannot be sure whether it is real, or just a random fluke. Therefore, it seems reasonable that statistical significance should be an important criterion - of course, by itself it does not guarantee that your result is important, but it is something like a necessary condition.

And while I didn’t find the criticism of p-values convincing, the proposed “solution” (focusing on confidence interval and whether it includes zero) does not even make sense: it provides no new information, because the the confidence interval and statistical significance are closely related. For example, if you are measuring some variable, and your null hypothesis is that its mean is zero, then you reject H0 at 95% level if the sample average is at least (approximately) twice as big as its standard error. On the other hand, 95% confidence interval for the mean will be sample average plus/minus twice the standard error - so if your result is significant, confidence interval will not go over zero, and vice versa!

Posted by: Ivan Sutoris on September 24, 2010 1:28 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

the proposed “solution” (focusing on confidence interval and whether it includes zero) does not even make sense: it provides no new information, because the the confidence interval and statistical significance are closely related

I think the final sentence of the quote addresses this:

If the confidence interval does not include zero, then not only can we reject the null hypothesis, but we can have an idea of whether the real magnitude of the differentiation is large or small.

Indeed, whether or not the confidence interval includes zero contains essentially the same information as a p-value — but giving the whole confidence interval contains strictly more information. E.g. a confidence interval of 0.0001 to 0.0002 shows a very small effect (though statistically significant), while a confidence interval of 0.73 to 0.74 shows a more substantial effect. (All depending on the units, of course, but you get the point.)

(Although, having been mostly converted to Bayesianism myself, I would be inclined to say that a “confidence interval” is a confused frequentist notion and what should really be reported is the posterior probability distribution of the parameter being estimated.)

Posted by: Mike Shulman on September 24, 2010 6:09 AM | Permalink | Reply to this

Re: Fetishizing p-Values

Another Bayesian. Which of the 46656 varieties are you?

On the subject of confidence intervals, a classic text is by the inimitable Edwin Jaynes:

“Confidence Intervals vs. Bayesian Intervals”, in Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science, W. L. Harper and C. A. Hooker (eds.), D. Reidel, Dordrecht, 1976, pg. 175. [8Mb pdf].

Posted by: David Corfield on September 24, 2010 9:03 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

That’s a cracking article, David (the one by Jaynes). I wish I had time to read it. I began, but it started to have the same effect on me as a really good page-turner of a novel… and I’m really supposed to be doing something else.

Thanks, anyway!

Posted by: Tom Leinster on September 24, 2010 4:28 PM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

Well, not being a practicing statistician/probabilist, I haven’t really had occasion to classify myself in most of those categories. Thanks for the link; I’ll put it on my reading list.

Posted by: Mike Shulman on September 24, 2010 9:30 PM | Permalink | Reply to this

Re: Fetishizing p-Values

Indeed, whether or not the confidence interval includes zero contains essentially the same information as a p-value — but giving the whole confidence interval contains strictly more information.

Of course, that is true. But you will get the same information also if you report estimated value of an effect and its p-value (or the value and its standard error, or the value and its t-statistic). So if the whole point of this controversy is that you should report size of the effect in addition to its significance, it is rather trivial one… and is it really a problem at all? Can’t speak for other sciences, but in economics, empirical papers include pages of tables with regression estimates, almost always reporting both the size and significance of coefficients.

Posted by: Ivan Sutoris on September 24, 2010 9:43 AM | Permalink | Reply to this

Re: Fetishizing p-Values

Ivan Sutoris wrote:

So if the whole point of this controversy is that you should report size of the effect in addition to its significance, it is rather trivial one… and is it really a problem at all?

Yes, that’s the whole point (it seems to me); no, it’s not trivial; and yes, it’s a serious problem. See the mentioned The Cult of Statistical Significance for evidence (contrary to your ‘everyone I know knows this already’ claim) that most papers across disciplines don’t report the size, and in making policy decisions, care about significance at the cost of size. (E.g., something with significance/p-value as high as p=0.2 but a very large size would be entirely ignored, when it shouldn’t be.)

Posted by: svat on September 24, 2010 12:13 PM | Permalink | Reply to this

Re: Fetishizing p-Values

“what should really be reported is the posterior probability distribution of the parameter being estimated”

Yes. If we just made everyone into Bayesians, these problems would go away. No-one would care about an arbitrary 5% significance, they would simply care about the probability distribution of outcomes given intervention, and outcomes given no intervention; once they had that, then one would ask harder questions about what counts as a “big” difference and what doesn’t, for which one would need to an analysis of the utility difference between the two outcome distributions.

Posted by: Todel on October 16, 2011 12:43 PM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

Thanks, Ivan. No need to be humble about your status; we all have different backgrounds here, and you’re probably better-placed to discuss Ziliak and McCloskey’s book than most of us.

It’s interesting that Hoover and Siegler—in the paper you link to—question the accuracy of Ziliak and McCloskey’s figures on how widespread a sin this is. (It doesn’t bode well for their own accuracy that they misspell McCloskey’s first name in the abstract and keywords, but let’s be generous and blame an editor for that.) Of course I haven’t gone through the data myself. I’d like to see comparable figures for disciplines other than economics.

You wrote:

it is true that $p$ -value does not measure the magnitude of the effect (but then, anyone who has taken at least one course in statistics should know that)

I think Jost, Ziliak and McCloskey would completely agree that anyone who has taken at least one course in statistics should know that. They’re pointing out, open-mouthed, that this incredibly basic mistake is being made on a massive scale, including by many people who should know much, much better. Bane used the term ‘collective self-deception’; one might go further and say ‘mass delusion’. It’s a situation where a fundamental mistake has become so ingrained in how science is done that it’s hard to get your paper accepted if you don’t perpetuate that mistake.

That last statement is probably putting it too strongly, but as I understand it, the point they’re making is along those lines.

Posted by: Tom Leinster on September 24, 2010 12:53 PM | Permalink | Reply to this

Re: Fetishizing p-Values

They’re pointing out, open-mouthed, that this incredibly basic mistake is being made on a massive scale, including by many people who should know much, much better.

Actually I think that McCloskey and Ziliak not only criticize this misundesanding of p-values, but have some deeper reservations against statistical significance testing itself, and that animosity is then carried over to their research on actual practices of economists. Regarding these actual practices, whether the misunderstanding between statistical and economics significance really happens on a massive scale, is an empirical question and their main argument here is a survey of couple hundred articles published in AER.

In their critique, Hoover and Siegel argue that this survey is flawed, based on arbitrary, subjective and questionable criteria. I admit I haven’t read McCloskey and Ziliak’s book itself, so maybe I am missing some of their better arguments, but Hoover and Siegel’s critique seems reasonable and well agued to me. Anyway, if anybody is interested, McCloskey and Ziliak wrote a reply (I tried to read it myself, but the ratio of rhetorics to factual arguments is too high for my taste).

Posted by: Ivan Sutoris on September 24, 2010 4:17 PM | Permalink | Reply to this

Re: Fetishizing p-Values

This is actually a big issue in natural language processing, but sort of from the other direction. In any large-scale NLP project you’re going to be training your models on hundreds of thousands to millions of words. This means that, according to standard statistical significance tests, the results of large-scale NLP research is almost always “significant”. And yet, a minor change of genre or a move to a different language is often enough to completely change the results of an experiment. So you never see p-values in NLP, in part because it highlights this issue that we can give whatever precision you like without it meaning a thing.

So it ends up that the only thing that matters is the magnitude of an effect because with a large enough effect, chances are that the effect is real rather than being a fluke of having chosen a particular genre of text to work with. Unfortunately, there are no hard guidelines on what magnitude counts as “good”. It varies dramatically by task, and every task has a non-linear relationship where +1% accuracy is worthless if the baseline is 60%, 70%, or 80% but it’s great when the baseline is 97%. Of course, when you’re talking about translating a million words, 99.9% accuracy means you’re still making an error every page or two; which isn’t really anywhere near human proficiency.

Posted by: wren ng thornton on September 24, 2010 8:17 AM | Permalink | Reply to this

Re: Fetishizing p-Values

That sounds like the p-value critique in Ioannidis seminal Plos paper “why most published research findings are false”.

http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124

This _appears_ to be a separate issue from what Tom Leinster talks about. It’s very important in molecular biology. Basically, You have two groups of people, one ill one healthy and you look at half a million genes and proteins in each person. You’r going to come up with say a gene that is present only in the ill group and conclude that it’s causing the disease. But ofcourse with half a million genes to choose from you’ll have several that is distributed like this in the groups just by chance.

The p-value actually warns you about this. It says that the conclusion could be wrong 5% of the time. 5% chance and half a million chances to get it wrong give you 25 000 false positives and maybe 1 true result. so you get swamped in false positives.

Posted by: Robert on September 24, 2010 1:36 PM | Permalink | Reply to this

Re: Fetishizing p-Values

This is what I understand an old, as one comment say, animosity against tests. Here it is shored up by confusing the good principle with the bad practice. Ironically they seem to admit the good principle by suggesting another test.

As for the drug vs drug “apple” example, if you are fishing for effects by lowering the sensitivity (observed magnitude) until you equivocate, that is one problem. But to then reverse and claim that differences are equivocated is different. To sort out real drugs causative effect you would do a parameter study before the epidemic study, not because you are comparing to other drugs primarily but to test the drug and dosage as such. (Robert mentions a similar problem.)

Only if the drug pass as an actual drug, you can start comparing.

Posted by: Torbjörn Larsson, OM on September 25, 2010 9:09 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

Torbjörn wrote:

This is what I understand an old, as one comment says, animosity against tests. Here it is shored up by confusing the good principle with the bad practice.

It’s certainly old. As it says in the opening quote, Jost’s references go back to 1969. And Ziliak and McCloskey’s hero, making a similar point, was Gosset (‘Student’), who died in 1937. But being an old point doesn’t make it wrong.

What ‘good principle’ and ‘bad practice’ do you mean?

As for the drug vs drug “apple” example, if you are fishing for effects by lowering the sensitivity (observed magnitude) until you equivocate, that is one problem. But to then reverse and claim that differences are equivocated is different.

I don’t understand what you mean. Perhaps you’re using ‘equivocate’ in some technical sense that I don’t know. Could you explain?

I hope it’s clear that the drug example was just for the sake of making a general point. I wasn’t saying that real-life drug trials are that simple.

Posted by: Tom Leinster on September 25, 2010 2:58 PM | Permalink | Reply to this

Re: Fetishizing p-Values

I did not read the whole post and comments yet, but here are a couple of cents (for the meantime): In a clinical trial, it is not sufficient to show a statistically significant difference between two treatment (e.g. experimental drug and placebo) to claim efficacy of the treatment. One also needs to show that the difference is clinically meaningful.

The usual practice is to determine what effect DELTA would be clinically meaningful, and then test the null hypothesis that the difference is smaller than DELTA vs. the alternative that the effect is equal to DELTA or larger.

Posted by: Joseph Levy on September 26, 2010 12:32 PM | Permalink | Reply to this

Re: Fetishizing p-Values

Thanks: I don’t know much about clinical trial design.

What is DELTA?

Without knowing what DELTA is, I think I can still formulate an objection to the “usual practice” that you describe in your last paragraph. Suppose you have two drugs, and you conduct some trials. Here are various possible outcomes, and possible problems that they raise.

Outcome: for both drugs, the result is statistically significant, but drug A is only just over the “clinically meaningful” threshold, whereas drug B is way better than the threshold. Problem: the practice you describe is a simple yes/no procedure, and therefore fails to reflect that difference.
Outcome: for both drugs, the result is statistically significant, but drug A is just under the “clinically meaningful” threshold, and drug B is just over. Problem: even though there’s only a small difference between the performances of the two drugs, one gets a “yes” and the other a “no”.
Outcome: for drug A the results are only just statistically significant, and it’s only just over the “clinically meaningful” threshold; for drug B, the results are not quite statistically significant, but it’s much much better than the threshold.

In cases 1 and 2 it’s the yes/no nature of the procedure that causes the trouble. There’s a simple remedy: always state the magnitude of the effect. But the question is whether people actually do this in practice.

Case 3 is more subtle. It’s not just a matter of the inherent problems with yes/no tests (that there’s a sudden change at the borderline). Something deeper, involving precision vs. magnitude, seems to be going on.

Posted by: Tom Leinster on September 26, 2010 10:36 PM | Permalink | Reply to this

Re: Fetishizing p-Values

Firstly, my remarks upthread were purely about the case of publishing results about a single “new process” that, whilst having a statistical significance, have a “practically non-useful” magnitude. This thread seems to also be discussing comparison between two or more new processes, sometimes in invalid ways: this is something that wasn’t in what I was talking about, especially not things that are actually wrong. (The “collective self-deception” is essentially that, since statistical significance is the dominant criterion that will be applied in reviewing, that’s genuinely the most important thing to think about.)

Now, I’m a bit confused about your point 3, which implies that everyone has to act on the basis of the statistics of the currently performed trials. Part of the reason that you’re using statistics is that you expect there will be both random and “unknown systematic” contamination of the basic mechanisms you are observing. If you can convince whoever controls the money/time/etc that it’s justified (not necessarily based on rigorous statistics), you can probably get to both attempt to see if you can spot systematic patterns in your data (eg, this drug appears to cause problems in diabetics, so it’s not a drug for diabetics anymore, and new trials can exclude potential candidates with diabetes) and conduct a new study. In as much as you probably have a theoretical reasons for believing your treatment should work, it’s a rational thing to do (even if it’s not formalised in a Bayesian way). The reason for having statistics is to providing convincing evidence based on as little theoretical assumptions as possible (eg, for drug licensing agencies). [Apart from anything else, your “new process” may be beneficial even though you’re actually wrong about why.]

(I actually think a Bayesian process is better, but when dealing with “community standards” things change slowly.)

Posted by: bane on September 27, 2010 12:25 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

Tom wrote:

What is DELTA?

I imagine it’s just a variable standing for the change in some real-valued quantity. People like to use the letter $\Delta$ to stand for ‘difference’.

For example, if we’re trying to test a hair tonic, $\Delta$ could be the area of previously bald scalp covered in newly grown hair, measured in square centimeters.

This makes Joseph’s comment make sense to me:

The usual practice is to determine what effect DELTA would be clinically meaningful, and then test the null hypothesis that the difference is smaller than DELTA vs. the alternative that the effect is equal to DELTA or larger.

So, for example, we might decide that the effect of a hair tonic is clinically meaningful if at least one square centimeter of previously bald scalp gets covered with newly grown hair. Then our null hypothesis would be $\Delta \lt 1$ and the alternative would be $\Delta \ge 1$ .

Posted by: John Baez on September 27, 2010 11:00 AM | Permalink | Reply to this

Biomedical issues; Re: Fetishizing p-Values

In my experience with publishing biomedical papers, writing biomedical grants applications, and studying Venture Capital as it applies to newly invented pharmaceuticals and medical devices, it is not enough to be shown clinically effective by statistical analysis.

A new drug, besides the roughly $300,000 expense in the USA of being proven nontoxic, must be seen as enough better than existing alternatives to promote physicians shifting to its use for their patients, despite the switching barriers (or switching costs) which are terms from Microeconomics, Strategic Management, and Marketing, which describe any impediment to a customer’s changing of suppliers.

There is what is called “The Valley of Death” where an openly published scientific paper about a promising molecule or apparatus must be developed with infusion of enough funding to test toxicity and the like. Investors have to decide which of hundreds of candidates deserve their money. The Catch-22 is that the testing is only feasible with the money in hand, yet the money cannot be obtained without prior risk reduction through the testing.

Posted by: Jonathan Vos Post on September 27, 2010 5:16 PM | Permalink | Reply to this

Re: Biomedical issues; Re: Fetishizing p-Values

A new drug, besides the roughly $300,000 expense in the USA of being proven nontoxic, must be seen as enough better than existing alternatives to promote physicians shifting to its use for their patients, despite the switching barriers (or switching costs) which are terms from Microeconomics, Strategic Management, and Marketing, which describe any impediment to a customer’s changing of suppliers.

Don’t worry this problem is very much solvable…

Posted by: J-L Delatre on September 29, 2010 5:55 PM | Permalink | Reply to this

“Desperately Seeking Cures”; Re: Biomedical issues; Re: Fetishizing p-Values

Independent of
Merlo-Pich E, Alexander RC, Fava M, Gomeni R., A New Population-Enrichment Strategy to Improve Efficiency of Placebo-Controlled Clinical Trials of Antidepressant Drugs, Clin Pharmacol Ther. 2010 Sep 22,

Francis Collins, head of NIH, describes the following process, which I paraphrase and annotated based on his comments, and provide as context
“Desperately Seeking Cures: How the road from promising scientific breakthrough to real-world remedy has become all but a dead end.” by Sharon Begley, an editor at Newsweek.

The Newsweek story ends: “In perhaps the clearest sign that patience among even the staunchest supporters of biomedical research is running thin, the health-care-reform bill that became law in March includes a Cures Ac-celeration Network that Sen. Arlen Specter, a longtime supporter of biomedical research, sponsored. Located at the NIH, the network would give grants ($500 million is authorized this year) to biotech companies, academic researchers, and advocacy groups to help promising discoveries cross the valley of death. It may or may not make a difference. But something had better, and soon.”

Posted by: Jonathan Vos Post on September 30, 2010 1:13 AM | Permalink | Reply to this

The earth is round (p<. 05)

You may enjoy reading The earth is round (p<. 05) by Jacob Cohen as it is similar style of criticism.

Posted by: Anon on October 4, 2010 9:39 AM | Permalink | Reply to this

Re: Fetishizing p-Values

As a theoretical physicist who now works in biology (not biophysics, actual pure biology), I can attest to the fact that the real-life situation is both better and worse. The criticism levelled above (i.e. confidence interval is better) is not very cogent, because people do usually list both p-value and magnitude. However, far too little discussion ever points out the absolute insanity of using p-values for anything.

Consider the following problem: I flip two biased coins, and want to find out if they’re biased the same way (side-track: I’m putting my biologist/mathematician hat on here — a physicist recognises the absolute absurdity in “biased coins”). How might I do this? One might suggest a significance test, i.e. use the results of the first coin to guess at the “true” bias, and check the probability of the 2nd coin being just a statistical fluctuation. However, two serious problems arise, one theoretical, the other practical. The theoretical objection is that the procedure is not symmetric: if I use the second coin as the null hypothesis, then I get (slightly) different numbers. The practical consideration is to ask what does p=0.05 (say) really mean. Should I then be 95% confident that the coins are biased differently? No. If you actually compute the ratio of likelihoods between a model with only one bias and a model with two, you find the ratio to be about 2:3, i.e. given the data it’s only 50% more likely to actually be biased differently.

The reason for the large difference is one of Occam’s Razor — two biases simply gives you a massively large space to potentially fit it, but you should then also pay the cost (exponential in number of fitting parameters). The 2nd effect is essentially completely missing from the language of hypothesis testing, and is a far greater problem in the sciences (I should qualify that with “not in physics”, because physicists don’t get taught statistics (thankfully) and so don’t try and do silly things like using it).

Posted by: Gen Zhang on October 4, 2010 10:30 AM | Permalink | Reply to this

Re: Fetishizing p-Values

Related to this discussion is this paper (found via story on slashdot):

Why Most Published Research Findings are False

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/

Posted by: bane on October 15, 2010 10:04 PM | Permalink | Reply to this

Re: Fetishizing p-Values

Also related to this discussion is this answer to a question on CrossValidated (like MathOverflow but for statistics).

Posted by: Mark Meckes on December 1, 2010 6:16 PM | Permalink | Reply to this

Re: Fetishizing p-Values

I hope it is not too late to join this discussion. Tom, thanks for the kind words above, and may I say I am fascinated by your posts connecting biodiversity to cardinality. Wish I had seen them sooner, as I have been groping towards some of that from a more primitive empirical viewpoint. I had no idea there was such a thing as category theory. Your work will help connect biological diversity analysis to truly rigorous math. That is thrilling!

Anyway, on the subject of this thread, there is one small difference between my position and that of Ziliak and McCloskey. They seem to believe that testing null hypotheses is never good science, while I think it depends on the question under investigation. Is the mere falseness of the null hypothesis interesting, in and of itself? Then a p-value is a good final answer. If my null hypothesis is that the speed of light in a vacuum is identical for each observer regardless of his state of motion, and I had a body of experimental evidence to reject this null hypothesis at p<0.0001, it would deserve every physicist’s attention.

Biology, my adopted field, is not such an exact science, and what little theoretical structure we have is widely recognized as a rough approximation to reality. Here the mere existence of an effect will rarely be earthshaking.

In ecology and population genetics, the standard null hypothesis is that the effect (say, the difference in species compositions among communities) is zero. This is a particularly uninteresting null hypothesis. We know it is false even before we leave the office or forest hut to take data. This kind of science is an empty game, where success (rejecting the null hypothesis at a given level of significance) is guaranteed if the scientist has enough resources and patience to achieve a large sample size. [At least the probability of Type 1 errors is zero in this approach, as noted in Cohen’s wonderfully written “The earth is round (p<0.05)”; cited above.]

The effect of this is to cheapen the notion of meaning in these sciences. Measures become primarily tools for extracting p-values; their magnitudes may be very difficult or impossible to interpret, and may not even be monotonically related to the real quantity of interest.

This disrespect for meaning affects some branches of biology to their very core. The best example is population genetics. One of the basic concerns of this science is explaining genetic differences between populations. For this, a good meaningful measure of genetic differentiation between populations would seem essential. However, the main measures of genetic differentiation, Fst and Gst, are not independent of within-group diversity, so they are impossible to interpret in terms of differentiation. A set of completely differentiated populations (no shared gene types or “alleles” between any populations) can have a Gst value of unity, or near zero, or anything in between. This does not stop geneticists from using Gst to measure differentiation (though to be fair, it does have other, perfectly legitimate uses and interpretations).

I suspect part of the reason for the lack of awareness of this measure’s interpretational problems was that p-values could be so easily obtained from it (using the always false null hypothesis of no difference between populations). As a result of this abuse, some of our basic ideas about evolution and speciation are wrong, because they are based on analyses of a misinterpreted measure. That is why I wrote the paper mentioned at the head of this post.

The same measure (actually its 1-complement) was popular in ecology as a measure of similarity of species compositions among communities. It gave nonsensical magnitudes but nice p-values (rejecting the always false null hypothesis of no differences among communities).

Posted by: Lou on January 8, 2011 2:22 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

Welcome, Lou! It’s really great to have your comments.

It’s interesting that you say that

The same measure (actually its 1-complement) was popular in ecology

(my emphasis). I take it you’re referring here to the Gini–Simpson index. Do you see signs that there’s been a shift in attitude? I ask because I get the impression that there’s a great deal of inertia in these things. You’ve done a lot to dispel various misconceptions, e.g. misuse of the Gini–Simpson index and $G_{ST}$ , and inappropriate reliance on $p$ -values. The kind of people who read and write theory papers are clearly listening and responding, but I wonder how fast this is filtering through to experimentalists and data-gatherers. Some of these methods seem so entrenched in current practice, and in the literature, that it seems like it could take a while to turn the ship around. What positive signs do you see?

My biologist friends often remind me that most field ecologists are unlikely to think about the advantages and disadvantages of the measures they’re using: they’ll simply use whatever measure everyone else in their subject uses. (And I don’t particularly blame them for it; it’s simply not what they’re interested in.) Maybe it would take some daring for them to report their results using measures that make more sense but are unfamiliar to most of their peers.

For those reading who have no idea what I’m talking about, the idea here is that (as in this previous post) an ecological community is modelled as a collection of $S$ species in proportions $p_1, \ldots, p_S$ , so that $p_i \geq 0$ and $\sum_i p_i = 1$ . In other words, it’s a finite probability space. One of the most popular measures of the diversity of a community is the Simpson or Gini–Simpson index, $1 - \sum_{i = 1}^S p_i^2.$ This looks plausible at first (and has seen many decades of vigorous use). It takes its minimum value, $0$ , when some $p_i$ is $1$ and the rest are $0$ —that is, one species has taken over entirely. That’s appropriate, because that’s the least diverse situation possible. It takes its maximum value, $1 - 1/S$ , when $p_1 = \cdots = p_S = 1/S$ —that is, the species are equally represented. That’s appropriate too, because that’s the most diverse situation possible.

The problem is that the magnitude of the Gini–Simpson index is near-impossible to interpret. As Lou has pointed out (page 4), if a plague hits a continent of 30 million species and wipes out half of the species, any sensible measure of diversity should fall by 50%… but the Gini–Simpson index drops by 0.000004%. And there really are papers that treat percentage changes in the Gini–Simpson index as meaningful.

So clearly there’s something very wrong here. I’m interested to know how fast people are catching on to the problem (and indeed, its solution).

Posted by: Tom Leinster on January 8, 2011 11:05 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

Cohen’s wonderfully written “The earth is round (p < 0.05)”; cited above.

It is wonderfully written! Who can fail to be charmed by

Like many men my age, I mostly grouse.

One interesting thing about it is that it’s written by an (experimental) psychologist, for other psychologists, so the discussion of why people fetishize $p$ -values has a slight psychological flavour to it. He writes, for instance, of the “Bayesian Id’s wishful thinking”, and he asks

What’s wrong with NHST [null hypothesis significance testing]? Well, among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is “Given these data, what is the probability that $H_0$ [the null hypothesis] is true?” But as most of us know, what it tells us is “Given that $H_0$ is true, what is the probability of these (or more extreme) data?”

I’m not doing justice to the serious argument by quoting just the particularly entertaining bits. (But it’s only five and a half pages.) While I’m not doing it justice, here’s another great quote, this one itself a quotation by Cohen from a paper of B. Thompson. To give just a little context: Cohen is making the point that the larger the sample size, the easier it is to reject the “nil hypothesis” that the size of an effect is precisely 0 (because in reality, it never is).

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired. This tautology has created considerable damage as regards the cumulation of knowledge.

Posted by: Tom Leinster on January 8, 2011 12:05 PM | Permalink | Reply to this

Re: Fetishizing p-Values

I sent the Cohen paper to everyone I know…

One clarification on your penultimate post, Tom. The Gini-Simpson index was used as a measure of diversity of a single community in ecology, while the measures I mentioned above, Gst and its 1-complement, are measures of similarity or differentiation among multiple communities. (Diversity might be considered the self-similarity of a single community?)

Gst is indeed built from the Gini-Simpson index, which is still used as the main measure of diversity in genetics, where it travels under the alias “heterozygosity” or “gene diversity”. If Ht is the Gini-Simpson index of the pooled communities, and Hs is the mean of the Gini-Simpson indices of the individual communities, Gst is (Ht-Hs)/Ht and its 1-complement is Hs/Ht. The thinking here was that if the communities are very similar in composition, pooling them will not increase their diversity much, so Ht will be very close to Hs. If the communities have little in common, the pooled diversity Ht will be much greater than the mean of the diversities of the individual communities.

The problem you pointed out with the Gini-Simpson index throws a monkey-wrench into this thinking. It has an asymptote of unity when diversity is high. So if within-group diversity Hs is high (close to unity), pooling the groups cannot possibly lead to a diversity much greater than Hs, even if the groups have no elements in common. Ht can never exceed unity. So you get a ratio Hs/Ht essentially equal to unity (which would be interpreted as indicating high compositional similarity between groups) even if the groups or communities have no elements in common.

This loose way of inventing diversity and similarity measures will probably leave the physicists and mathematicians here laughing at us biologists. The ratio of mean within-group measure divided by the total measure is commonly used throughout the soft sciences without much reflection on the mathematical requirements for this measure to be meaningful in a given application.

The same problem that you point out for the Gini-Simpson index (which is the Havrda-Charvat-Daroczy-Tsallis entropy of order 2) is also found in Shannon entropy, to a lesser degree. As a result, when entropy is high, the ratio Hs/Ht (now using H for Shannon entropy) approaches unity even if the groups or communities have no elements in common.

If this kind of ratio is to truly reflect the compositional similarity of the groups, a sufficient (perhaps necessary?) condition is that the diversity measure must obey the Replication Principle developed in economics. A measure obeying this principle is linear with respect to pooling of equally large, completely distinct, equally diverse groups. Then the ratio can be interpreted as a measure of community similarity, and for N equally weighted communities, it will range from 1/N to unity, with 1/N indicating no similarity between groups. The mean within-group diversity here has to be a generalized mean. Normalizing this ratio or its reciprocal ( the between-group component of diversity) so it ranges from 0 to 1 makes similarity indices, and the 1-complement of these similarity indices makes rigorous measures of compositional differentiation. See Jost 2007 for details. I’ll be happy to send anyone pdfs of any of these papers, by the way.

It turns out to be simple to transform any standard entropy measure into a measure obeying the Replication Principle. I call such transformed measures “true diversities” because they support the rules of inference biologists have traditionally, if inappropriately, applied to their classical diversity measures like Shannon entropy and the Gini-Simpson index. They are also called “numbers equivalents” or “effective numbers”. In the case of Shannon entropy using natural logs as the base, the transformation is accomplished by just taking the exponential.

To get back to your question, yes, there is a lot of inertia. In both ecology and genetics, the classical diversity and similarity measures have been used for many decades (seventy or eighty years in genetics!) In both sciences, there have been dissenting voices for decades: in ecology, Robert MacArthur and Mark Hill, among others, argued in the 70’s that it would be better to make the transformation to true diversities (though they did not use that term). In genetics there was some awareness (thanks to the work of Gregorius and others) that Fst and Gst was not actually a normal measure of differentiation between groups (even the developer of Fst, Sewall Wright, was aware of some of this). But these reservations did not attract the attention they deserved, perhaps because most of these authors did not forcefully point out that wildly wrong answers will result if biologists used the traditional methods, and because there wasn’t a completely-developed alternative.

I started pointing out these wildly wrong answers in ecology in Jost (2006), before I had a full understanding of the solution. That article got considerable attention. In Jost (2007) I figured out how to partition “true diversity” into independent within- and between-group components, and how to derive valid similarity measures from these components. This provided a viable alternative to the classical approach, and this made my criticism much easier to accept. While not everyone swallowed the whole pill, most people saw the advantages of using “true diversities” instead of the classical measures. The July 2010 issue of Ecology had a 30 page, multiple-author Forum on my 2007 paper. Though some authors were very critical of other aspects of my work, there was general agreement on this point. The editorial preface to the Forum concluded that “All of the authors in this Forum agree that using numbers equivalents instead of the classical diversity indices (entropies) such as H0 should be used in any diversity partitioning. One could go further and suggest that, even if the interest is only in describing the diversity of a single assemblage, the numbers equivalent, not the entropy, should be the diversity measure of choice. But my goal in organizing this Forum was to move beyond this easy point of agreement…” (Ellison 2010).

So this went from heresy to “easy point of agreement” very quickly in ecology (though if you read those Forum articles, you will find misunderstandings still persist). There are still people who are not aware of the issues, but no one in ecology is explicitly defending the old way of using these measures.

It is not so in genetics. In Jost (2008) I introduced a rigorous replacement for Gst, and argued for the use of numbers equivalents instead of the Gini-Simpson index (heterozygosity) as a measure of gene diversity. Though it also got a lot of attention (eg. it immediately spawned a meta-analysis of many prior studies subtitled “How wrong have we been?” by Heller and Siegismund 2009), it also has drawn much criticism (eg. Ryman and Leimar 2009) and some mixed reviews (eg. Miermans and Hedrick 2010). Very few studies use my approach. It may be that since I am a complete outsider to this field, I am not expressing myself in a way that resonates with practitioners, or maybe I do not appreciate all the background behind these old measures. But my colleague Anne Chao and I are continuing to develop this new approach, which turns out to be surprisingly rich, and I think that once people realize that this alternative is a useful tool, they will come around.

I have to say that overall, I have been impressed with the open-mindedness of scientists in both ecology and genetics. Even though I have little face-to-face interaction with the outside world (I live on a volcano in the Andes) and no university affiliation, they have still listened to me and published my papers.

Jost L 2006 Entropy and diversity. Oikos 113: 363–375.

Jost L 2007 Partitioning diversity into independent alpha and beta components. Ecology 88: 2427–2439.

Jost L 2008 Gst and its relatives do not measure differentiation. Molecular Ecology 17: 4015−4026.

Miermans P and Hedrick P 2010 Assessing population structure: Fst and related measures Molecular Ecology Resources 11: 5-18.

Heller R and Siegismund H 2009 Relationship between three measures of genetic differentiation Gst, Dest, and Gst’: how wrong have we been? Molecular Ecology 18: 2080–2083.

Ryman N and Leimar O 2009 Gst is still a useful measure of differentiation: a comment on Jost’s D. Molecular Ecology, 18, 2084–2087.

Ellison A 2010 Partitioning diversity. Ecology 91:1962-1963.

Posted by: Lou on January 8, 2011 3:50 PM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

The Gini-Simpson index was used as a measure of diversity of a single community in ecology, while the measures I mentioned above, Gst and its 1-complement, are measures of similarity or differentiation among multiple communities.

Yes, right.

If Ht is the Gini-Simpson index of the pooled communities, and Hs is the mean of the Gini-Simpson indices of the individual communities, Gst is (Ht-Hs)/Ht and its 1-complement is Hs/Ht. The thinking here was that if the communities are very similar in composition, pooling them will not increase their diversity much, so Ht will be very close to Hs.

Do you know how physicists handle this? Do they also need some quantity analogous to between-community diversity?

When I first got interested in diversity I skimmed several physics papers that appeared in the wake of Tsallis’s rediscovery of the entropy of Havrda, Charvat, Patil, Taillie and co. I remember there being various generalized means and things called “escort probabilities”. But I’ve mostly forgotten the story now. I assume that your canonical alpha-beta partitioning hadn’t already appeared in the physics literature.

Incidentally, the proof of your partitioning theorem in the appendix to your 2007 paper is something I’d be interested in discussing some time. I spent a while reshaping it in various ways, and though in the end some event intervened and I had to stop, I was left with the impression that there was more that could be done with the mathematics of it. It’s something I’ve been meaning to come back to. I’m interested in biological diversity, and regard it as highly important, but I’m also interested in understanding diversity as an extremely broad scientific concept, applicable in many settings. As you say, these measures should have a firm logical foundation.

I call such transformed measures “true diversities” because they support the rules of inference biologists have traditionally, if inappropriately, applied to their classical diversity measures like Shannon entropy and the Gini-Simpson index.

One thing I regret about my previous posts is my choice of terminology. I (mis)used the word “diversity” for what I’d now call “expected surprise”, or more vaguely “entropy”. (My biologist colleague Richard Reeve promptly told me off.) I used “cardinality” for what you’d call “true diversity”. At that point I hadn’t seen your papers.

I saw that Ecology supplement, and do remember being struck at how strongly everyone agreed that effective numbers were a good thing. I guess all the contributors will tell their postdocs and students, and the message will spread.

I have to say that overall, I have been impressed with the open-mindedness of scientists in both ecology and genetics.

Me too. I don’t have any formal training in biology, but almost every biologist I’ve been in contact with has been interested and willing to discuss new ideas.

Posted by: Tom Leinster on January 8, 2011 6:48 PM | Permalink | Reply to this

Re: Fetishizing p-Values

I remember the night I first derived the formula for within-group diversity, with weird weights to the power q, each weight normalized by the sum of these weights…and they were the same as the weird q-normalized or “third choice” weighting that had recently taken hold in nonextensive thermodynamics. As I understand it, the physicists arrived at this weighting scheme for empirical reasons– things didn’t work out properly in thermodynamics unless you did it this way. They had gone through several years of “first choice” or simple weighting and then had a brief love affair with “second choice” or w^q weighting, and then finally switched to this third choice weighting. Yet that night, this third choice scheme came directly out of the math, so it wasn’t an empirical choice at all. That was exciting!

Then reality set in. The between-group component you get with these weights is uninterpretable if q<>0 or 1, unless all the weights are equal (in which case the third choice scheme collapses onto the first choice weighting scheme). So now I have some doubts about the whole Tsallis program. But that is beyond my ability to address.

I agree with you completely that “there was more that could be done” (thank you for being so polite!) on my proof of independence. I would love to see a more rigorous and profound reformulation of the concept. I hope you have time to get back to it. Like you, I regard diversity as a concept of very wide applicability and also of some abstract mathematical interest. Lots of discussions about entropies and generalized entropies are enormously simplified when reframed in terms of diversity. But my approach, of necessity, is the approach of a tropical biologist—crawling through the tangled undergrowth of imprecise empirical concepts, covered with mud, being distracted by swatting flies, barely catching glimpses of the pure concept in the shadowy distance. You meanwhile are coming at this from the other direction, from the clear blue sky of pure logic….may we meet in the middle someday!

Posted by: Lou on January 8, 2011 7:42 PM | Permalink | Reply to this

Re: Fetishizing p-Values

That’s interesting that the physicists got there empirically. There are other regulars here at the Café who know all about thermodynamics. Unfortunately I’m not one of them. But if any of them are reading, maybe they have a point of view.

I agree with you completely that “there was more that could be done” (thank you for being so polite!) on my proof of independence.

I should make clear that this wasn’t a euphemism or veiled criticism — I meant that I’ve been hoping your result can be extended and used in other ways. It wasn’t politeness!

We’ve talked a lot here about counting and measurement. A banal but fundamental property of counting is that if you have a pile of apples and split it into two smaller piles, then the number of apples in the original pile is the sum of the numbers in each of the two small piles. This simple rule alone, applied in contexts other than collections of discrete objects, turns out to get you an incredibly long way: e.g. it gets you the concept of Euler characteristic in topology.

Partitioning an ecological community, and measuring the diversities within and between the subcommunities, seems like it should fit right into this framework. Actually, it’s not precisely clear what “this framework” is, and thinking about diversity is expanding it in interesting ways.

Posted by: Tom Leinster on January 9, 2011 10:35 PM | Permalink | Reply to this

Re: Fetishizing p-Values

Here’s a vaguely related article:

Ben Goldacre, The statistical error that just keeps on coming, The Guardian, September 8, 2011.

We all like to laugh at quacks when they misuse basic statistics. But what if academics, en masse, deploy errors that are equally foolish? This week Sander Nieuwenhuis and colleagues publish a mighty torpedo in the journal Nature Neuroscience.

They’ve identified one direct, stark statistical error so widespread it appears in about half of all the published papers surveyed from the academic psychology research literature.

It’s not quite the same error—is it? But it seems related somehow.

Posted by: John Baez on September 10, 2011 9:45 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

It’s a nice article! The error he discusses isn’t exactly the same as the one discussed here, as I understand it, but they’re certainly related. Let me have a go at explaining.

The theme of my post is over-attention to statistical significance, at the expense of attention to magnitude of effect. I can be 99.9% sure that my coin is biased, but that says nothing about how biased it is. Since every coin is biased, I can always achieve that level of certainty simply by flipping it enough times.

The theme of Goldacre’s article is experiments in which you have two different magnitudes of effect, and you have to decide whether they’re significantly different. In his example, you have two different groups of mice, one normal and one mutant, and the question is whether their nerve cells respond differently to a certain chemical. Apparently lots of authors do this kind of statistics incorrectly.

I liked Goldacre’s article, but for me his point would have been made more strongly if he’d changed his example a bit. In his hypothetical mouse experiment, he imagines that:

When you drop the chemical on the mutant mice nerve cells, their firing rate drops, by 30%, say. With the number of mice you have this difference is statistically significant, and so unlikely to be due to chance. That’s a useful finding, which you can maybe publish. When you drop the chemical on the normal mice nerve cells, there is a bit of a drop, but not as much – let’s say 15%, which doesn’t reach statistical significance.

He goes on to say:

But here’s the catch. You can say there is a statistically significant effect for your chemical reducing the firing rate in the mutant cells. And you can say there is no such statistically significant effect in the normal cells. But you can’t say mutant and normal cells respond to the chemical differently: to say that, you would have to do a third statistical test, specifically comparing the “difference in differences”

But this becomes really obvious if you change that 15% to 29%. We’re imagining, then, that the threshold for statistical significance is between 29% and 30%. The mutant mice have a response of 30%, which is statistically significant, and the normal mice have a response of 29%, which is not. Nevertheless, you clearly can’t say with confidence that the mutant and normal mice respond differently.

It’s also clear that the same phenomenon’s going to come up all over the place, whenever you have results near the threshold for statistical significance. And that’s probably most of the time.

And it’s clear that the law is an ass: treating the “threshold of statistical significance” as if it meant something special is absurd. It is, after all, statistical significance at a chosen level, maybe $p = 0.05$ or some other conventional round number.

That, I think, is where this post and Goldacre’s make contact. People who have made the error Goldacre describes seem to be have fallen prey to the cult of statistical significance.

Posted by: Tom Leinster on September 11, 2011 5:59 AM | Permalink | Reply to this

Re: Fetishizing p-Values

There’s an article on the arXiv today (19th December 2011) which discusses this in relation to the neutrino claim: arXiv 1112.3620. It’s well written (barring occasional errors due, no doubt, to being written in non-native language) and well worth reading.

Posted by: Andrew Stacey on December 19, 2011 11:42 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

What a great idea! The American Psychological Association runs over 50 journals, and if you publish there you sign a contract promising to share your raw data with anyone who asks for it.

But wait! When some researchers asked for the data for 141 papers in these journals, 73% of the authors didn’t share it!

Other researchers went further. They asked 49 authors for their data. 27% didn’t reply at all - neither to the original request, nor to two followups. 25% said they would share the data but never did. 6% said the data were lost or there was no time to write it up. And the rest said their dog ate the data.

Seriously, 67% of authors didn’t share their data. But here’s the really damning part. A result often gets counted as ‘statistically significant’ if the ‘null hypothesis’ would give the observed result or something more extreme with probability p less than 5%. This approach is full of problems, and the American Psychological Association Task Force on Statistical Inference actually considered banning it in 1999, but they didn’t. And it turned out that authors whose results came very close to the 5% cutoff were less likely to share their data!

In other words, psychologists whose results are just barely ‘significant’, according to this problematic but widely used measure, are less likely to share their data… even though they promised to!

And: for all seven papers that had mistakes, where correctly computing statistics would make their findings non-significant, none shared their data.

This sucks. I think we should ‘name and shame’ scientists like this. Here are the studies:

J. M. Wicherts, M. Bakker and D. Mlenar, Willingness to share research data is releated to the strength of the evidence and the quality of reporting of statistical results, PLoS One 6 (2011), 1–7.
J. P. Simmons, L. D. Nelson, and U. Simonsohn, False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant, Psychological Science 22 (2011), 1359–1366.

Both are free, and the first one appears on a journal where others can comment - that’s great! And both suggest solutions to the problem.

For more, try:

Robert Trivers, Fraud, disclosure, and degrees of freedom in science, Psychology Today, May 2012.

Posted by: John Baez on January 27, 2013 12:51 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

An interesting blog article about ‘ $p$ -hacking’ in certain psychological studies:

Neuroskeptic, Reproducibility crisis: the plot thickens, November 10, 2015.

A quote:

A new paper from British psychologists David Shanks and colleagues will add to the growing sense of a “reproducibility crisis” in the field of psychology.

The paper is called Romance, Risk, and Replication and it examines the question of whether subtle reminders of ‘mating motives’ (i.e. sex) can make people more willing to spend money and take risks. In ‘romantic priming’ experiments, participants are first ‘primed’ e.g. by reading a story about meeting an attractive member of the opposite sex. Then, they are asked to do an ostensibly unrelated test, e.g. being asked to say how much money they would be willing to spend on a new watch.

There have been many published studies of romantic priming (43 experiments across 15 papers, according to Shanks et al.) and the vast majority have found statistically significant effects. The effect would appear to be reproducible! But in the new paper, Shanks et al. report that they tried to replicate these effects in eight experiments, with a total of over 1600 participants, and they came up with nothing. Romantic priming had no effect.

So what happened? Why do the replication results differ so much from the results of the original studies?

The answer is rather depressing and it lies in a graph plotted by Shanks et al.

You’ve got to look at the graph, and take the time to understand it. It’s damning!

Posted by: John Baez on November 12, 2015 6:08 AM | Permalink | Reply to this

Re: Fetishizing p-Values

$MathML-enabled post (click for more details).$

I wrote a comment here, but I got tons of W3C errors, so I gave up posting it here directly.

You can see the comment here.

The comment was about that I hadn’t found a remark to the problem of sample size and p-value here. I hope I haven’t overlooked something.

For obvious reasons drug testing with animals is usually done with not so many animals, so testing for significance only seems highly problematic. However this seems to be a common method if I look for example at animal experiments in 2013 with glyphosate.

And then I give examples from the report.

In this context I actually also wanted to ask wether there is a some standard procedure for weighting relevance according to sample sizes if one has to calculate means and variances of various samples but with different sample sizes.

Posted by: nad on May 19, 2016 9:57 AM | Permalink | Reply to this

The n-Category Café

Skip to the Main Content

September 23, 2010