Experimentation By Gregor Spielmann, Adasight

How to Build an Experimentation Program from Scratch

Most companies don't have an experimentation program — they have a collection of A/B tests. The difference is infrastructure, process, and culture. A real experimentation program runs tests continuously, captures learnings systematically, generates better hypotheses over time, and compounds into a durable competitive advantage. This guide covers how to build one from scratch — or how to upgrade what you have.

🧮 Use the free tool: Experimentation ROI Calculator — no signup required

Open tool →

Step 1: Get the data foundation right before running tests

Experimentation requires trustworthy data. Before launching your first A/B test, you need: clean event tracking on the metrics you plan to test (with no duplicate events, no data gaps, no inconsistent naming), a validated baseline conversion rate that you can reproduce from your analytics tool, and a sample size estimate that confirms your traffic volume is sufficient to detect effects of the size you care about. Starting experimentation on a weak data foundation is one of the most common and costly mistakes — you'll run 6 months of tests and not know whether to trust any of the results.

Step 2: Choose your tooling stack

The minimum experimentation stack: a feature flagging and A/B testing tool + an analytics tool for measuring results. Options: Amplitude Experiment (if you use Amplitude for analytics — tight integration is a major advantage), Statsig (feature flags, A/B testing, and analytics in one, strong sequential testing), GrowthBook (open-source, free, integrates with any analytics warehouse), or VWO/Optimizely (legacy web testing tools, less suited for product-side experiments). The common mistake is using your analytics tool's built-in 'experiment' feature without understanding its statistical methodology — some tools default to non-rigorous methods.

Step 3: Build the process (hypothesis → test → learn → repeat)

A structured experimentation process has five steps: (1) Hypothesis generation — a written hypothesis in the format 'We believe that [change] for [user segment] will [metric impact] because [mechanism]'. (2) Prioritization — ranking hypotheses by potential impact and confidence (ICE or PIE scoring). (3) Test design — defining primary metric, guardrail metrics, sample size, duration. (4) Analysis — reading results after reaching the pre-determined sample size, not before. (5) Documentation — recording the hypothesis, result, effect size, and learning, regardless of outcome. The last step is where most programs fail. Learnings that aren't documented don't compound.

Step 4: Scale the program

A mature experimentation program (20+ tests per month) requires more than a single analyst. The operational requirements: a hypothesis backlog that's always populated with 20+ ranked hypotheses (so teams never wait for ideas), a learning repository that's searchable and referenced when generating new hypotheses, a weekly experimentation review where results are shared cross-functionally, and statistical governance that prevents the most common errors (underpowered tests, peeking, incorrect metric selection). The velocity ceiling for most programs isn't traffic — it's hypothesis quality and engineering capacity to implement tests.

Experimentation program launch checklist

Need expert help applying this?

Adasight works with scaling D2C and SaaS companies to build the analytics foundations and experimentation programs that make this work in practice.

Talk to Adasight →

Frequently asked questions

How many tests should an experimentation program run per month?

The right number depends on your traffic, engineering capacity, and hypothesis pipeline. Early programs typically run 2–5 tests per month. Mature programs at well-funded growth companies run 20–50. The constraint usually shifts from traffic (early) to hypothesis quality and engineering velocity (as the program matures). Velocity without quality produces false learnings.

What is a good win rate for an experimentation program?

20–35% is healthy for a well-run program. Below 20% suggests hypothesis quality is low or the program is testing too conservatively (only obvious improvements). Above 40% often indicates peeking or underpowered tests producing false positives. A 'winner' in a low-quality program is often a false positive, not a real improvement.

How do you get leadership buy-in for experimentation?

The most effective approach is calculating the ROI of one well-run A/B test — take an existing conversion rate, model a 10% improvement, and multiply by annual revenue to show the dollar impact. Then show the Experimentation ROI Calculator to project the cumulative impact of running 5–10 tests per month. Leadership resistance usually comes from not understanding the compounding nature of experimentation: each winning test permanently improves the baseline for all future tests.

Related guides