Some of you may be familiar with this Nature news article describing the effort to lower the threshold of the p-value (from 0.05) for statistical significance. In the wake of major reproducibility crises in psychology and biomedicine, this issue is more pertinent.
What’s the p-value anyway? In the language of statisticians, it’s the probability you would have gotten your values/parameters/data or more extreme data if the null hypothesis was true.
Wait? Probability? Null hypothesis? I thought in science we were trying to do an experiment! Yep, but interpreting the results of your experiment is done in the most backwards way possible. You make a null hypothesis and you try to disprove it.
Say you want to test whether Drug A lowers pain scores more than Drug B. You gather two groups of people matched on every relevant characteristic: one taking Drug A, one taking Drug B. You start from the premise that there is no difference in mean (remember: you have more than one patient in each group!) pain score change between the Drug A group and Drug B group. There are a variety of null hypothesis tests out there. Let’s make the probably wrong assumption that the values in pain score lowering follow a normal distribution; thus we will use a two-sample t-test. I’m not going to go in the fine details of the t-test as most software will do it for you (see here for a good description). Ultimately, you will use the data and their variability to calculate a t statistic, or the test statistic for the t-test. You then determine if the t statistic is large enough (based on the normal distribution) to be statistically significant. You then conclude that you can reject the null hypothesis. But you can’t definitively say that Drug A lowers pain more than Drug B! That is based on your scientific knowledge. You determine if the change in pain score is practically significant.
Ok, so why the hubbub? In medical science, we use a threshold of less than 0.05 to determine statistical significance. That is, there is a less than 5% chance that we would have gotten the pain score difference we saw. This threshold was popularized by R.A. Fisher, a famous statistician in the 1920s. I’m not going to spend $60 to buy Fisher’s Statistical Manual for Research Workers, but Gerard Dallal summarizes the history nicely. Fisher felt that a 1 in 20 probability of data being due to random chance was a good indication that something was going on. Of course, he also stated that it was a guide and that the scientist should make a determination based on the current evidence. There is an art in science!
If the threshold is such a big deal, then why not just lower the threshold like the Nature piece suggests? Well, aside from trying to get every scientist from agreeing on one thing, you would need a substantially larger sample size if you wanted to avoid false negatives. Good luck convincing someone to fund that when you’re a grad student!
There are alternatives, such as confidence intervals, Bayesian credible intervals, and maybe machine learning. However, they come with their own complexities and not everyone agrees that they would work.
What do I think? I think a p-value threshold less than 0.05 is still useful, and not just because I use it in my work! Of course it is key you don’t solely depend on the threshold to make your conclusions. I myself use confidence intervals and effect sizes to make careful interpretations. Still, I think a big shift in thinking will occur in the near future. Best to prepare by brushing up on all those other techniques.
Summary: The p-value is used as evidence in much of science to help determine if you’re observing a true effect. The problem is that too many people take it as a hard and fast rule, and thus risk publishing a false positive. Science may mostly be wrong! But are the alternatives better? Who knows if the debate will be settled (maybe we’ll all become Bayesians?).