Thursday, March 12, 2026

The Birth—and Trouble—of Statistical Significance

How p < 0.05 Became Science’s Most Famous Threshold

If you have ever read a scientific paper, you have almost certainly seen a statement like:

“The result was statistically significant (p < 0.05).”

This tiny number—0.05—has quietly shaped scientific conclusions for nearly a century. It determines which results are celebrated, which papers get published, and sometimes even which careers flourish.

But where did this number come from?
Why 0.05 and not 0.03, 0.1, or 0.001?

And why are many scientists today arguing that our dependence on this threshold has created serious problems for science?

Let’s explore the story of statistical significance, the errors it produces, and the historical accident that turned 0.05 into a scientific dogma.


The Birth of Statistical Significance

The idea of statistical significance emerged in the early 20th century when scientists began grappling with a fundamental problem:

How do we know whether an observed effect is real or just due to random chance?

To address this, statisticians developed hypothesis testing.

The framework works like this:

  1. Start with a null hypothesis (H₀)
    Example: There is no difference between two treatments.

  2. Collect data.

  3. Calculate how surprising the data would be if the null hypothesis were true.

This probability is the p-value.

A small p-value suggests that the observed result would be unlikely if the null hypothesis were correct.


What Exactly is a p-value?

Mathematically, the p-value represents:

𝑝=𝑃(data as extreme as observed𝐻0 is true)

In simple words:

The p-value measures how surprising your data would be if nothing interesting were actually happening.

A small p-value means the data would be rare under the null hypothesis.


The Man Who Made 0.05 Famous

The threshold p < 0.05 is closely associated with the British statistician and geneticist Ronald Aylmer Fisher.


4

Fisher was one of the founders of modern statistics and population genetics. In 1925, he published a landmark book:

Statistical Methods for Research Workers

In that book, Fisher suggested a convenient guideline:

Results with p < 0.05 could be considered statistically significant.

But here is the crucial point:

Fisher never intended this to be a strict rule.

He wrote that 0.05 was merely a convenient benchmark, not a universal law.

Yet over time, scientists treated it exactly like one.


What Did Scientists Use Before Fisher?

Before Fisher’s work, statistics was far less standardized.

Researchers used several different approaches:

1. Probability thresholds varied widely

Different fields used thresholds like:

  • 1/10 (0.1)

  • 1/20 (0.05)

  • 1/100 (0.01)

There was no universal standard.


2. Direct probability reasoning

Scientists often asked questions like:

“What is the probability that this difference arose by chance?”

But they rarely used formal hypothesis testing.


3. Early work by Karl Pearson

Another major figure was Karl Pearson, who developed tests like the chi-square test.

However, Pearson did not promote a universal significance threshold.


The Neyman–Pearson Revolution

In the 1930s, statisticians Jerzy Neyman and Egon Pearson introduced a more formal decision framework.

They introduced the concepts of:

  • Type I error

  • Type II error

  • statistical power


The Two Errors of Significance

When we perform hypothesis testing, two kinds of mistakes can occur.

Type I Error: False Positive

A Type I error occurs when we reject the null hypothesis even though it is true.

Example:

You conclude a drug works when in fact it does not.

The probability of making this error is called the significance level (α).

This is where 0.05 comes from.


Type II Error: False Negative

A Type II error occurs when we fail to detect a real effect.

Example:

A drug actually works, but your study concludes that it does not.

The power of a statistical test is:

Power=1𝛽

This measures how likely we are to detect a real effect.


When “Significant” Became a Problem

Over time, the threshold p < 0.05 became deeply embedded in scientific practice.

But this created several problems.


The First Problem: Misinterpreting the p-value

Many scientists misunderstand what a p-value means.

Common incorrect interpretations include:

  • “p = 0.05 means there is a 95% chance the hypothesis is correct.”

  • “p = 0.05 means the result will replicate 95% of the time.”

  • “p < 0.05 proves the effect is real.”

All of these interpretations are incorrect.

The p-value only measures:

How surprising the data are under the null hypothesis.

It says nothing about whether the hypothesis itself is true.


The Binary Thinking Trap

The 0.05 threshold created a dangerous habit:

p-valueConclusion
0.049Significant
0.051Not significant

Yet these results are almost identical statistically.

Still, one result may be published while the other is ignored.

This binary thinking has distorted scientific decision-making.


The Rise of p-Hacking

Because journals prefer statistically significant results, researchers sometimes (often unintentionally) manipulate analyses until significance appears.

This practice is called p-hacking.

Examples include:

  • trying multiple statistical tests

  • analyzing subsets of the data

  • removing outliers selectively

  • collecting data until p < 0.05 appears

Each of these increases the chance of false discoveries.


The Multiple Testing Problem

This issue becomes especially severe in modern biology.

Imagine testing 20,000 genes for association with a disease.

If we use p < 0.05:

  • about 1000 genes may appear significant purely by chance

This is why fields like genomics now use multiple testing correction, such as:

  • Bonferroni correction

  • False Discovery Rate (FDR)


Statistical Significance vs Scientific Importance

Another issue is that statistical significance does not measure importance.

With very large datasets, even tiny effects become significant.

Example:

  • effect size = extremely small

  • p-value = extremely small

The result may be statistically significant but biologically trivial.

This is why modern papers increasingly report:

  • effect sizes

  • confidence intervals

  • biological interpretation


The Replication Crisis

In the last decade, large replication studies in psychology and biomedical research discovered something troubling:

Many statistically significant findings could not be reproduced.

This became known as the replication crisis.

Major contributing factors include:

  • small sample sizes

  • publication bias

  • p-hacking

  • misuse of p-values

  • multiple testing


The “Winner’s Curse”

There is also a phenomenon called the Winner’s Curse.

When a result barely passes the significance threshold, the estimated effect size is often overestimated.

Later studies typically find much smaller effects.

This explains why many exciting discoveries initially appear strong but weaken over time.


The Modern Reform Movement

In response to these problems, statisticians are proposing several reforms.


1. Lowering the significance threshold

Some scientists propose using:

  • p < 0.005 instead of 0.05

to reduce false discoveries.


2. Reporting effect sizes

Instead of only reporting p-values, studies should report:

  • magnitude of the effect

  • uncertainty around the estimate


3. Multiple testing correction

Common approaches include:

  • Bonferroni correction

  • False Discovery Rate (FDR)

These are essential in genomics and proteomics.


4. Pre-registration

Researchers declare their:

  • hypotheses

  • analysis plans

before collecting data.

This reduces p-hacking.


Where Science Stands Today

A century after Fisher’s suggestion, the 0.05 threshold still dominates scientific research.

Yet many statisticians now argue that science should move away from rigid thresholds and toward a more nuanced approach.

Instead of asking:

“Is this result significant?”

we should ask:

  • How large is the effect?

  • How reliable is the evidence?

  • Does it replicate?

  • Is it biologically meaningful?


A Final Irony

The most fascinating part of this story is that the famous threshold 0.05 was never meant to be a rule.

It was simply a convenient guideline suggested by Fisher nearly 100 years ago.

Yet that small number went on to shape the structure of modern science.

Sometimes, the biggest forces in science begin with the smallest numbers.

No comments: