by Salvador Balkus

Collectively, scientists conduct a lot of experiments. Whether they study addiction, air pollution, or animal populations, most basic scientific experiments have one thing in common: data. 

To perform an experiment, scientists first formulate a hypothesis about how something works. Then, they collect data – measurements, sensor information, images, surveys, and the like – that either support their hypothesis or prove it false.

Usually, though, it is impossible to measure all of the data. After all, we cannot track every person with addiction, or measure the particles in every cubic inch of the air – that would be impractical. Instead, scientists take a random sample. They gather data for only a small number of people, or a small set of locations, and use the results to inform our knowledge about the world at large (Figure 1).

Figure 1. Since scientists cannot measure all possible data for a given phenomenon, they conduct experiments or analyses on a small subset of the data. Collecting this data is called “sampling.” Here, the process is depicted for two potential scientific topics – air pollution studies and clinical drug trials. 

However, drawing conclusions from only a sample of the possible data can be risky. Suppose the results show some novel finding – perhaps that a specific drug is less addictive than others. If the finding is based on a random sample of people, then it could be possible that the scientists just happened to select people for whom the drug was less addictive, and that the findings would fail to hold among the general population.  

In this case, how do the scientists know if their hypothesis is supported or if their hypothesis is wrong and the results simply occurred randomly?

To do this, scientists rely on a mathematical calculation called a “p-value.” Though ubiquitous – p-values have been included in millions of scientific papers – these calculations can also be controversial. And even if you’re not a scientist, the debate around p-values holds crucial implications regarding the public’s trust in science as a whole.

So, what is a p-value?

When a scientist sets up an experiment, they want to test a hypothesis that “some interesting phenomenon” happens. No amount of evidence can ever prove a hypothesis is correct 100% of the time. Instead, scientists first assume that the phenomenon does not actually happen (which, in technical terms, is called the null hypothesis), and attempt to reject this idea.

Once they gather data, they calculate a p-value: the probability of that data being collected from the experiment simply by chance assuming the null hypothesis – that the phenomenon does not occur. A low p-value suggests the null hypothesis is highly unlikely, lending credence to the researcher’s own hypothesis that the phenomenon does exist. Let’s explore an example.

Imagine you just met a fine lady at a tea shop. The lady claims that by tasting a cup of tea made with milk, her delicate palate can detect whether the milk or the tea was poured into the cup first. You’re skeptical, so you devise an experiment. You prepare 8 cups of tea – 4 with milk added first, and 4 with tea – order them randomly, and ask her to taste each and say how it was prepared. How many cups would she need to classify correctly in order for you to believe her? (Figure 2)

Figure 2. In Fisher’s famous experiment, a lady claims she can distinguish a cup of tea made by adding milk first from one made by adding tea first. Fisher tests her with 8 cups of tea – 4 of each type. How many would she need to classify correctly for us to be sure of her claim and that she is not just a lucky guesser? Fisher calculates p-values of potential events assuming she was guessing randomly. (1) If she classified 4 cups correctly, p 0.01 – so rare that we can be fairly sure she was not just randomly guessing. (2) If she classified 3 cups correctly, p ≈ 0.20 – not enough evidence to rule out the possibility that she randomly guessed. Hence, the lady would need to get all 4 cups right for us to be reasonably sure of her claim!

This story was originally recounted in 1935 by a statistician named Ronald Fisher in order to motivate the use of probability in designing experiments. Fisher began by counting the number of possible ways in which a person could label all 8 cups, knowing that 4 were prepared with milk first and 4 with tea. 

As he explains, a person with no distinguishing ability would be expected to, just by chance, classify all 4 cups of each type correctly in only 1 out of 70 experiments, or about 1 percent of the time. Since such an event would be exceedingly rare, if the lady classified all of the cups correctly, you could rule out random guessing and be fairly certain of her claim. 

On the other hand, an unskilled taster would be expected to classify just 3 cups of each type correctly (with one of each incorrect) about 16 times out of 70, or about 20 percent of the time. Since an event with 20 percent probability happens fairly often, if the lady only classified 3 cups of each type correctly, there would not be enough evidence to ascertain if her claim is true or if she was simply a lucky guesser.

Each of these values – the probability of a person with no special tasting ability classifying 3 cups correctly (20%) or 4 cups correctly (1%) – are examples of p-values. In Fisher’s tea experiment, he assumed that the lady did not have the ability to tell if milk was added to the cup first. Then, since she could classify all 8 cups correctly – a highly improbable event under Fisher’s “null hypothesis” – he concluded that she probably did have the ability to tell if milk was added first or second.

Researchers rely on similar logic every day. Though Fisher did not invent the p-value, he did popularize its use in scientific studies and define the common threshold of 1 in 20 (p < 0.05) as the definition of a “rare” event. The smaller the p-value, the less likely that the results obtained are due to random chance assuming the hypothesized phenomenon does not exist – and the stronger the evidence that the phenomenon under study really occurred.

Why are p-values controversial?

Since Fisher’s time, millions of experiments have used p-values to test whether their models of the world reflect the data they gathered. Today, however, the practice is debated. In 2019, over 800 academics and researchers signed an open letter in Nature to abolish the use of p-values “to decide whether a result refutes or supports a scientific hypothesis.” In addition, select journals have banned the publication of papers containing p-values

One major reason for this is the arbitrary definition of “rare.” Though Fisher only mentioned p < 0.05 as an example of rarity, his offhand comment has morphed into a hard threshold that scientists must often meet in order for their studies to be published at all. This can lead to publication bias and frustration from researchers who cannot publish important results that only attain, say, p = 0.06. 

Conversely, some scientists mistake a low p-value to mean that their results are consequential. This is wrong: a low p-value only helps us rule out the possibility of the results occurring due to random chance under a null hypothesis. Results could have low p-value but have limited practical importance (sometimes called “clinically insignificant”).

A commonly-cited example is the Physicians Health Study, which found that taking aspirin reduced subjects’ risk of having a heart attack, albeit only by 0.8%, with p < 0.00001. Though the low p-value ruled out the possibility of aspirin having no effect at all, the effect of taking aspirin was so tiny as to be meaningless for most people – which is why not everyone should take aspirin every day. This idea is related to Fisher’s tea experiment in Figure 3.

Figure 3. A demonstration, using Fisher’s tea tasting experiment, of the difference between a p-value and the magnitude of results. (1) A high p-value would not prove that the woman has no tasting ability; it only means there is not enough evidence to support a claim of the opposite. (2) With enough data, a p-value can detect even the smallest phenomenon – like a boorish rube’s ability to distinguish the type of tea only 50.1% of the time. A p-value provides evidence that some phenomena exist – it does not tell us how important that phenomenon actually is!

Another problem is what some refer to as “p-hacking.” In the famous “dead salmon study,” which won an IgNobel Prize in 2012, researchers put a dead salmon under an fMRI scan, showed it pictures of people, and found that the salmon actually responded positively!

Of course, this conclusion is nonsense. In fact, the study was written specifically to show how errors arise when calculating many p-values at once. The problem was that, when an fMRI machine scans a human brain (or in this case, a salmon), it measures changes in thousands of tiny sections, called voxels – and a p-value is computed for each. If you run thousands of tests like this, even events with low probability (low p-values under the null hypothesis) are bound to occur eventually by chance. 

This is just one of many types of p-hacking: repeating multiple statistical tests until something “significant” is found. If the insignificant p-values are not reported, this is also considered “cherry-picking” – but even if they are, the presented p-values will be incorrect if the authors fail to correct for the number of tests run (which is not always feasible). The dead salmon study demonstrates how authors can misuse statistical techniques to present misleading results.

Figure 4. The Dead Salmon Study. (1) Researchers used fMRI (normally used for brain imaging) to test whether a dead salmon reacted when shown pictures of people in social situations. (2) fMRI output tests thousands of individual parts of the image (voxels) for a response. (3) Across thousands of statistical tests, even rare false results (low p-values) will eventually occur by chance. (4) Obtaining a few positives out of many negative results, the researchers found that the salmon “reacted” to the images, demonstrating how p-values that do not correct for a large number of tests can produce absurd findings.

Why this controversy is not as bad as it may seem

You might then wonder: isn’t running thousands of experiments exactly what scientists do every day? If doing so is bad, and if p-values are controversial and often misinterpreted, how can we trust scientific papers?

The reason you can rest easy is that scientific evidence requires consensus. Any given phenomenon can have innumerable explanations. Hence, to collectively conduct their work, scientists must propose many hypotheses – and since only one can be correct, most must be wrong

But that isn’t a bad thing! Though the media often reports on single articles, one sole study proves little. Accurate knowledge requires replication – repeated experiments that support and refine the theory – as well as disproof of other, incorrect hypotheses. If one poor study happens to publish a false positive or use p-values incorrectly, later studies will correct the error and disprove the previous conclusions.

Yet, such limitations usually are not emphasized in news coverage. That’s why it is important to keep this in mind when reading or listening to news on scientific studies. Does the story report on only a single scientific paper? Does it disprove an existing hypothesis? Importantly, does the paper in question build on years of previous research? Even if an individual paper is trustworthy, it is important to consider questions like these to properly digest scientific news.

Even though a single p-value cannot “prove” a hypothesis, p-values help scientists avoid publishing results that are attributable more to randomness than any relevant phenomenon. They are just one tool out of many that allow scientists to critique their own findings and, in the process, build the types of consensus that really are scientifically important.  

So, if you see low p-values reported from a scientific article, know that the authors took care to ensure the quality of their findings – but also know that the p-value is far from the end of their story.

Salvador Balkus is a PhD student in Biostatistics at the Harvard T.H. Chan School of Public Health.

Cover image by ColiN00B on pixabay.

For More Information:

  • Read this article for a more in-depth explanation of p-values and their implications regarding replication of scientific studies. 
  • To better understand what p-values do and do not communicate, as well as how they can be misinterpreted, read this.
  • Check this out for  a more detailed discussion of Fisher’s “Lady Tasting Tea” experiment.

2 thoughts on “How do scientists know whether to trust their results?

  1. If “consensus” were a criterion, the earth would still hve been FLAT!

    Time to start thinking‽

  2. All very interesting but testing if the lady was correct is hardly important is it?
    I have just read an article about a man called Jordan McSweeney, who attacked and murdered Zara Aleena. It seems mistakes were made by probation staff. Could science be used to assist people who interview criminals to assist in finding out if they are telling the truth?

Leave a Reply

Your email address will not be published. Required fields are marked *