As Statistics is about extracting information, one of the most important components of a statistical analysis is communicating the appropriate conclusions. Unfortunately, it can be subtle and unforgiving territory, where double negatives and positives may be vastly different and terms we think we know in general can suddenly assume rather specific meaning.
We’ll look at what’s probably the most common problematic scenario here: the classical p-value hypothesis test. One of the major criticisms of this framework is actually how amenable it is to misinterpretation, so if you find you’ve fallen into some of the following traps you’ve got plenty of company. Still, there are really only a few pain points to be mindful of.
Recall that such a hypothesis test goes something like this:
- We form null and alternative hypotheses about some quantity of interest, where the null hypothesis is a sort of fallback in the absence of evidence to the contrary.
- We acquire data on a sample of subjects to help choose between the hypotheses for the population of interest.
- We estimate how likely sample results at least as unlikely as those we saw would be if the null hypothesis were true for the population. This probability is called the p-value, and if it is sufficiently low (i.e. below a pre-specified threshold) we reject the null hypothesis in favor of the alternative.
Describing the third step is usually where linguistic problems arise. Here are three big ones:
Issue 1: Thinking that the p-value measures the probability of the null hypothesis being true
We’re trying to decide whether the null hypothesis is true and have this probability laying around that is roughly proportional to our trust in the null hypothesis; surely it’s the probability of the null hypothesis being true!
In a sense, it’s the opposite of that. The p-value is the probability of the results (or more extreme results) given the null hypothesis, NOT the probability of the null hypothesis given the results. To even calculate the p-value, we must assume the null hypothesis is true.
We performed the 35-sample j-test and arrived at a p-value of .04. We reject the null hypothesis because it has less than a 5% chance of being true.
We performed the 666-sample i-test and arrived at a p-value of .04. We reject the null hypothesis at the .05 level.
Super Not Bad:
We performed the 69-sample x-test and arrived at a p-value of .04. We reject the null hypothesis because for samples of this size there would be less than a 5% chance of arriving at results at least this different from the expected results if the null hypothesis were true.
Issue 2: Confusion about the term “statistically significant”
I see a tendency to allow “statistically significant” to overstep its bounds and serve as a comment on the end-to-end validity of a hypothesis test.
Used correctly it’s a much weaker claim, and not even dependent on the hypothesis test being applicable or the underlying experiment being properly designed.
Saying that results are statistically significant simply means that your p-value was below the pre-specified threshold you set, called the significance level of the test—you met your criterion for rejecting the null hypothesis.
This means that you could design and execute a perfectly rigorous experiment, choose the perfect hypothesis test for assessing the results, validate its assumptions, and arrive at a result that is not statistically significant. In fact, that’s the desired result whenever the null hypothesis is actually true.
On the other hand, you can do everything wrong and arrive at a result that is statistically significant, if also useless and dangerous.
An unfortunate side effect is that you have to resort to general pleasantries to describe how on point your experimental design was; “statistically significant” doesn’t capture it.
Issue 3: Not reporting the significance level of a test
No p-value is inherently statistically significant. It only becomes significant or not when compared to the subjectively prescribed significance level of the test. Therefore, it’s sloppy to omit the significance level from a claim of statistical significance. People are going to assume that you used the standard .05 if you don’t mention it, but that’s not the standard in all domains and you should show you at least thought about it.
We performed the 8-sample maids-a-milking-test and arrived at a p-value of .04. The results are statistically significant.
We performed the 5-sample golden-ring-test and arrived at a p-value of .04. The results are statistically significant at the .05 level.
We performed partridge in a pear tree and arrived at a p-value of .04. The results are not statistically significant at the .01 level.
Note that the same p-value of .04 is statistically significant at the .05 level but not at the .01 level. Which level is appropriate is context-dependent and up to the investigator. The famous experiments around the Higgs Boson used a significance level of 3 x 10-7. Explain what you’re doing!
That’s all for now. Got more pain points? Feel free to share them in the comments.