Experimentation By Gregor Spielmann, Adasight

Statistical Significance Explained: A Plain-English Guide

Statistical significance is one of the most misunderstood concepts in growth analytics. Most people who run A/B tests use statistical significance without being able to explain what it actually means — and that misunderstanding leads to real mistakes: shipping losing variants, missing real improvements, and building false confidence in unreliable results. This guide explains statistical significance in plain English, without equations.

🧮 Use the free tool: A/B Test Sample Size Calculator — no signup required

Open tool →

The coin flip analogy: what statistical significance is actually measuring

Imagine you flip a coin 10 times and get 7 heads. Is the coin biased? Maybe. But 7 heads out of 10 isn't that unusual — it happens by chance about 17% of the time with a fair coin. Now imagine you flip it 1,000 times and get 700 heads. That is almost impossible by chance (probability < 0.0001). Statistical significance is that calculation: given the data I observed, how likely is it that the result is just random noise? When we say 'statistically significant at 95% confidence', we mean: if there were no real effect, we would observe a result this extreme or more extreme only 5% of the time by random chance.

What statistical significance does NOT mean

It does not mean there is a 95% chance that your variant is better. This is the most common misconception. The 95% confidence level is a property of the testing procedure, not a probability about the specific result. It does not mean the effect is large or practically important. A 0.01% improvement can be statistically significant if your sample size is large enough. It does not mean you should ship the variant. Statistical significance tells you that an effect probably exists; it says nothing about whether that effect is worth the implementation cost, user experience trade-offs, or technical debt.

The relationship between sample size and significance

Statistical significance depends directly on sample size. With a large enough sample, even tiny differences become statistically significant. This is why you should always pre-calculate your required sample size based on the minimum effect you care about (the minimum detectable effect), not just run a test until it 'gets significant'. Running a test until significance is reached — without a pre-determined sample size — is called 'peeking' and it guarantees a false positive rate much higher than 5%, even if you're using a 95% confidence threshold.

Practical statistical significance vs. statistical significance

Statistical significance answers: is this effect real (not random noise)? Practical significance answers: is this effect large enough to matter for the business? These are different questions. A test that shows a statistically significant 0.1% improvement in signup conversion is almost certainly not worth shipping unless your signup volume is enormous. A test that shows a 15% improvement but only barely reaches statistical significance (p = 0.049) might be worth shipping because the effect size is large. Good experimentation culture evaluates both — effect size (confidence interval) alongside p-value — not just whether p < 0.05.

Statistical significance checklist for A/B tests

Need expert help applying this?

Adasight works with scaling D2C and SaaS companies to build the analytics foundations and experimentation programs that make this work in practice.

Talk to Adasight →

Frequently asked questions

What does 95% confidence mean in an A/B test?

95% confidence means that if the null hypothesis (no real effect) were true, you would see a result as extreme as yours or more extreme only 5% of the time. It does not mean there is a 95% probability that your variant is better — that interpretation (the posterior probability) requires Bayesian methods, not frequentist significance testing.

What is the difference between statistical significance and confidence interval?

Statistical significance is a binary outcome (significant or not), based on a threshold. A confidence interval gives you a range of plausible values for the true effect. A 95% confidence interval on a conversion rate improvement of '4% ± 2%' tells you both that the effect is significant and that the true effect is likely between 2% and 6%. Confidence intervals are more informative than p-values alone and should always be reported alongside significance.

Can an A/B test be statistically significant but wrong?

Yes. Statistical significance at 95% confidence means you'll have a false positive 5% of the time — 1 in 20 tests with no real effect will appear significant. In a high-velocity program running 50 tests per year on features with no real effect, you'd expect 2–3 false positives per year by chance alone. This is why replication and null result documentation matter: a result that can't be replicated in a follow-up test was probably a false positive.

Related guides