A/B Test Plan Template — the pm manual

The number one experimentation mistake is not a bad hypothesis. It is deciding what 'success' means after you have already seen the results. This template makes you commit before you run anything.

Talvinder Singh, Pragmatic Leaders

Every team that runs experiments eventually ships a change that "won" only because someone picked the metric that looked good after the fact. A written test plan eliminates that. You define the hypothesis, the primary metric, and the success threshold before any code runs. If you cannot fill this template, you are not ready to experiment — you are guessing with extra steps.

The template

Copy everything between the horizontal rules below.

# Experiment Plan: [Experiment Name]

**Owner:** [PM or experimenter name]
**Created:** [Date]
**Status:** Planned | Running | Completed | Killed

---

## Hypothesis

We believe that [specific change]
will [expected effect — increase / decrease / improve]
on [metric name]
because [reasoning grounded in evidence — user research, data pattern, prior experiment].

---

## Primary metric (ONE — pick one, commit)

- **Metric:** [e.g. Trial-to-paid conversion rate]
- **Current baseline:** [Measured number — go pull it before writing this]
- **Data source:** [Exact dashboard, event, or query — not "analytics"]

## Secondary metrics (guardrails — these must not degrade)

| Metric | Baseline | Acceptable range | Why it matters |
|--------|----------|-----------------|----------------|
| [e.g. Page load time] | [current] | [must stay under Xms] | [degradation kills the win] |
| [e.g. Support ticket volume] | [current] | [no more than X% increase] | [hidden cost of the change] |

---

## Control vs Variant

**Control (A):** [Describe current experience — what users see today]

**Variant (B):** [Describe the change — be specific enough that an engineer can build it]

If testing multiple variants, add Variant C/D with the same level of detail.
Each additional variant increases the sample size you need.

---

## Target audience

- **Segment:** [e.g. New users, first 7 days, mobile web, India]
- **Traffic allocation:** [e.g. 50/50 split]
- **Exclusions:** [Who is excluded and why — e.g. internal users, users already past onboarding]

---

## Sample size and duration

- **Minimum detectable effect (MDE):** [The smallest improvement worth shipping — e.g. 3% relative lift]
- **Required sample size per variant:** [Calculate at https://www.evanmiller.org/ab-testing/sample-size.html]
- **Expected daily traffic to this flow:** [number]
- **Minimum runtime:** [Calculated: required sample / daily traffic per variant]
- **Hard minimum:** 2 full business cycles (typically 2 weeks) regardless of sample size
- **Do not peek rule:** No decisions before the minimum runtime completes. No exceptions.

---

## Success threshold

We ship if:
- Primary metric improves by at least [X%] with 95% statistical confidence (p < 0.05)
- No secondary metric degrades beyond acceptable range above

We iterate if:
- Directionally positive (>0% lift) but below confidence threshold — extend runtime or redesign
- Primary metric wins but a guardrail metric degrades — fix the guardrail, retest

We kill if:
- Primary metric is flat or negative at full sample
- Guardrail metric degrades significantly with no clear fix

---

## Risks and mitigation

| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| [e.g. Novelty effect inflates early results] | [High/Med/Low] | [Run for 4+ weeks, compare week 1 vs week 4 cohorts] |
| [e.g. Segment too small for significance] | | [Widen audience or accept longer runtime] |
| [e.g. Engineering cannot feature-flag cleanly] | | [Scope change to flaggable surface only] |

---

## Results (fill after experiment completes)

- **Runtime:** [Start date] to [End date]
- **Sample size achieved:** [Control: N / Variant: N]
- **Primary metric result:** [Control: X% / Variant: Y% / Lift: Z% / p-value: ___ ]
- **Secondary metric results:** [List each]
- **Decision:** Ship / Iterate / Kill
- **Reasoning:** [Why — reference the thresholds you set above]
- **Follow-up:** [Next experiment, if any]

Common traps

Peeking. Checking results daily and stopping the moment you see significance is not experimentation — it is confirmation bias with a dashboard. Statistical significance fluctuates early. Set the runtime. Wait for it to finish. If your tool shows a "winner" on day 3, ignore it.

Multiple comparisons. Testing five metrics and declaring victory on the one that moved is the equivalent of rolling a die five times and celebrating the six. One primary metric. Define it upfront. Everything else is a guardrail or an observation, not a success criterion.

Novelty effects. Users click on new things because they are new. If your experiment runs for five days and shows a 40% lift, that number will decay. Run for at least two full business cycles. Compare the first-week cohort to the second-week cohort. If the lift shrinks, the novelty is doing the work, not the design.

Underpowered tests. Running an experiment on 200 users and declaring "no significant difference" means you did not have enough data to detect a real effect — not that there was no effect. Calculate sample size before you start. If you cannot reach the required sample in a reasonable timeframe, the experiment is not viable for this traffic level. Find a higher-traffic surface or accept a larger MDE.

Experimentation — the full framework: when to experiment, when not to, and how to build an experimentation culture
Activation Optimization — where most early-stage experiments should focus
Retention Loops — experimenting on retention without confusing engagement with value
Data-Informed Decision Making — when to trust the data, when to override it, and what "statistical significance" actually means for product decisions

The template

Common traps

Related pages