A/B Testing Statistics: The Complete Guide for Growth Teams
Most growth teams have a statistics problem they don't know they have. They run A/B tests, see a p-value below 0.05, and ship the variant — not knowing that the way they ran the test almost guarantees a high false positive rate. This guide covers everything you need to know about A/B testing statistics to run tests that produce reliable results: statistical significance, power, sample size, effect size, and the critical mistake of peeking at results before your test is complete.
🧮 Use the free tool: A/B Test Sample Size Calculator — no signup required
Open tool →Statistical significance, p-values, and what they actually mean
A p-value is the probability of observing your result (or something more extreme) if there were actually no effect. A p-value of 0.05 means there is a 5% chance of seeing this result by random chance alone if the null hypothesis is true. 'Statistical significance at 95% confidence' does not mean there is a 95% chance your variant is better — it means you'd see a false positive 5% of the time if you ran this test on no-effect data. The practical implication: if you run 20 tests with no real effect, you expect 1 false positive. In a high-velocity testing program, false positives compound quickly unless you control for them.
Statistical power and the danger of underpowered tests
Statistical power is the probability of detecting a real effect if one exists. The standard target is 80% power — meaning you accept a 20% chance of a false negative (missing a real effect). An underpowered test doesn't just risk missing effects — it also produces inflated effect size estimates when it does reach significance (the 'winner's curse'). The most common cause of underpowered tests is stopping early: seeing a positive trend on day 3 and ending the test before reaching your pre-calculated sample size. This inflates your measured effect by 2–5× and produces a false positive at roughly the same rate as running a test on completely random data.
How to calculate the right sample size
Sample size depends on four inputs: your baseline conversion rate, your minimum detectable effect (MDE), your desired statistical power (default 80%), and your significance threshold (default 5%). For a signup flow with a 3% baseline conversion rate and a target of detecting a 20% relative improvement (0.6 percentage points absolute), you need roughly 11,000 users per variant at 80% power. Smaller MDEs require exponentially larger samples. The most common mistake is setting MDE too small (trying to detect a 5% relative change with insufficient traffic) and either running underpowered tests or running tests for months. Our sample size calculator handles this math automatically.
The peeking problem and how to solve it
Peeking — looking at test results before reaching your pre-calculated sample size and potentially stopping early — inflates your false positive rate dramatically. A test run with an 'optional stopping' rule (stop when p < 0.05) has an actual false positive rate of 22–26%, not 5%. The solutions: (1) pre-commit to your sample size and don't look at results until you've reached it (hard to maintain in practice), (2) use sequential testing methods (SPRT or mSPRT) that account for multiple looks, or (3) use Bayesian methods that naturally accommodate early stopping without inflating error rates. Tools like Statsig, Eppo, and Amplitude Experiment have sequential testing built in.
A/B test statistical standards checklist
- Sample size is calculated before the test launches (not after)
- Minimum detectable effect is defined and realistic for your traffic volume
- Statistical power is set to 80% or higher
- Significance threshold is defined (95% standard; 99% for high-stakes decisions)
- Test runs for the full pre-calculated duration regardless of early trends
- No peeking: results not reviewed until sample size is reached
- Null results are documented the same way as positive results
- A minimum test duration of 7 days accounts for weekday/weekend variation
Need expert help applying this?
Adasight works with scaling D2C and SaaS companies to build the analytics foundations and experimentation programs that make this work in practice.
Talk to Adasight →Frequently asked questions
What p-value should I use for A/B tests?
The standard is p < 0.05 (95% confidence). Use p < 0.01 (99% confidence) for high-stakes decisions like pricing changes or permanent feature removals. Using p < 0.10 (90% confidence) is acceptable for low-traffic, low-stakes tests where the cost of a false negative is higher than the cost of a false positive.
How do I know if my A/B test has enough traffic?
Before launching, calculate your required sample size based on your baseline conversion rate and minimum detectable effect. If reaching that sample size would take more than 4 weeks at your current traffic volume, either increase your MDE (accept detecting larger effects only) or wait until you have more traffic. Running underpowered tests that never reach significance is one of the most common wastes of experimentation resources.
What is the minimum test duration for an A/B test?
Regardless of when you reach your sample size, run every test for at least 7 full days (including at least one full weekend cycle). This ensures you capture weekly behavioral patterns — many products have 30–50% higher conversion on weekdays vs. weekends, and stopping a test on a Tuesday when it reached significance Sunday dramatically overstates the effect.
Related guides
What Is Growth Analytics? A Complete Guide for 2026
Growth analytics is the discipline of using data to understand, measure, and improve how a product grows. It sits at the...
Read guide →The 12 Growth Analytics Metrics Every Team Should Track
Most growth teams track too many metrics and understand too few. The result is a dashboard full of numbers that don't co...
Read guide →