A/B Testing & Experimentation — the pm manual

An A/B test does not tell you what is best. It tells you which of two options is less bad. If both options are wrong, the test will cheerfully declare a winner anyway.

Talvinder Singh, from a Pragmatic Leaders masterclass on data-driven PM

A PM at a food delivery company in Bangalore runs an A/B test on two checkout button colors. Green versus orange. The test runs for a week. Orange wins with 95% confidence. The PM presents this in the sprint review as a data-driven decision. Everyone nods.

Nobody asks: why were users dropping off at checkout in the first place? The answer was a confusing address form that required re-entry on every order. No button color was going to fix that.

When to experiment and when not to

Most product decisions do not need an A/B test. This is the first thing to internalize.

Use A/B testing when:

You have a clear hypothesis about user behavior that can be validated by a measurable metric change
Both variants are acceptable outcomes — you are optimizing, not deciding direction
You have enough traffic to reach statistical significance in a reasonable timeframe (more on this below)
The cost of being wrong is low enough that you can afford to show a worse experience to half your users for two weeks

Do not A/B test when:

You are making a strategic product decision (should we build payments or not?)
You have fewer than a few thousand users hitting the relevant flow per week
The change is a bug fix or a compliance requirement — just ship it
You already have strong qualitative signal from user interviews or support tickets
The variants are so different that the test is really two product strategies pretending to be a UI comparison

A painted door test — showing a feature that does not exist yet and measuring interest — is often more valuable than a full A/B test at the discovery stage. You learn if users want the thing before you build the thing. I have seen PMs at Pragmatic Leaders cohorts use painted doors to kill ideas in two days that would have taken two sprints to build and test properly.

// scene:

Product review at a B2B SaaS company in Pune. The growth PM is presenting experimentation results.

Growth PM: “We tested two onboarding flows. Variant B had a 12% higher activation rate. I recommend we roll out B to all users.”

VP Product: “How long did the test run?”

Growth PM: “Five days. We had about 800 users in each bucket.”

VP Product: “What was your minimum detectable effect when you set up the test?”

Growth PM: “We didn't calculate that upfront. But 12% is a big difference, right?”

Data Analyst: “I ran the numbers. With 800 per variant and a baseline activation of 34%, you would need a minimum of 2,100 users per variant to detect a 12% relative lift at 95% confidence. This result is not statistically significant.”

The team decided to extend the test for three more weeks. The final result: a 3% lift, within the margin of error. They shipped it anyway because they had already told the CEO about the 12%.

// tension:

Running a test without calculating sample size upfront is like measuring a room without a tape measure. You will get a number. It will be wrong.

The hypothesis comes first, not the tool

Every experiment starts with a hypothesis. Not a guess. Not a hunch. A falsifiable statement that connects an action to a measurable outcome.

Bad hypothesis: "We think the new design is better." Better is not measurable. Better for whom? By what metric? Over what timeframe?

Good hypothesis: "Adding a progress bar to the KYC flow will increase completion rate from 62% to 70% within 14 days of launch, because users currently drop off at step 3 without knowing how many steps remain."

A good hypothesis has four parts:

The change — what you are doing (adding a progress bar)
The metric — what you are measuring (KYC completion rate)
The expected magnitude — how much movement you expect (62% to 70%)
The reasoning — why you believe this will work (users drop off at step 3, session recordings show confusion about remaining steps)

Statistical significance without a PhD

You do not need to be a statistician. You need to understand four concepts well enough to not get fooled by your own data.

Sample size. Before running a test, calculate how many users you need in each variant. This depends on your baseline conversion rate, the minimum effect you want to detect, and your desired confidence level. Use an online calculator — Evan Miller's is the standard. If your calculator says you need 5,000 per variant and you have 500 users per week hitting the flow, your test will take 10 weeks per variant, or 20 weeks total. That is not a two-week sprint experiment. That is a quarter-long commitment. Know this before you start.

P-value. The probability that the observed difference happened by random chance. A p-value of 0.05 means there is a 5% chance you are seeing a difference that does not actually exist. Most experimentation tools calculate this for you — VWO, Google Optimize (now sunset), Mixpanel, PostHog, Statsig. You do not compute this by hand. But you need to understand that p < 0.05 does not mean "this is definitely true." It means "this is unlikely to be noise."

Confidence interval. The range within which the true effect probably falls. If your test shows a 5% lift with a 95% confidence interval of -1% to +11%, your result includes zero. That means the true effect might be negative. Do not ship based on point estimates alone. Look at the interval.

Minimum detectable effect (MDE). The smallest difference you care about detecting. If a 1% improvement in conversion is not worth the engineering effort to ship, set your MDE at 3% or 5%. This directly determines your sample size requirement. A smaller MDE requires exponentially more users.

// thread: #product-growth — After a test shows inconclusive results

Ananya (PM)The checkout test ended. Variant A: 4.2% conversion. Variant B: 4.5% conversion. P-value is 0.23. Not significant.

Ravi (Data)Correct. With the traffic we had, we could only detect effects larger than 8%. A 0.3% difference is deep in the noise.

Ananya (PM)So we learned nothing?

Ravi (Data)We learned the two designs perform roughly the same. That is useful. It means the conversion problem is not in the checkout UI. Look upstream.

Ananya (PM)Fair point. I will pull the funnel data from the product listing page. That is where 40% of sessions end.Upstream analysis doc shared in #product-growth

The experimentation process in practice

Here is how I teach experimentation to PMs who have never run a test before. It is six steps, and most of the work happens before you touch any tool.

Step 1: Identify the problem. Start with your funnel or your core metric. Where is the drop-off? Where is the friction? Use analytics, session recordings, and support tickets. Do not start with "I want to A/B test something."

Step 2: Form a hypothesis. Use the four-part structure above. Write it down. Share it with your team. If the engineer or designer immediately says "that is not why users drop off," you have saved yourself two weeks.

Step 3: Calculate feasibility. Run the sample size calculation. If you cannot reach significance in a reasonable timeframe (two to four weeks for most consumer products, four to eight weeks for B2B), either increase your MDE or consider a different method — a pre/post analysis, a cohort comparison, or qualitative research.

Step 4: Design the variants. Change one thing. Not three. Not five. One. If you change the copy, the layout, and the button color simultaneously, you will not know which change caused the effect. Multivariate testing exists but requires dramatically more traffic.

Step 5: Run and monitor. Do not peek at results daily and stop the test early when it looks good. This is called the peeking problem and it inflates your false positive rate. Set a duration upfront based on your sample size calculation and commit to it. Most tools now offer sequential testing that accounts for peeking, but the discipline of pre-commitment is still valuable.

Step 6: Decide and document. If the result is significant, ship the winner. If it is not significant, document what you learned and move on. Do not run the test again with a slightly different variant hoping for a different result — that is p-hacking, and it will eventually blow up in your face.

The most underrated step is documentation. Three months from now, someone on your team will want to test the same thing. If you documented your hypothesis, sample size, results, and learnings, they will build on your work instead of repeating it.

B2B experimentation: the hard mode

If you work in enterprise SaaS or B2B, everything above gets harder. Your user base is smaller. Your sales cycles are longer. A single large account can skew your metrics.

Approaches that work in B2B:

Early access programs. Instead of random assignment, invite specific accounts to try a new feature. This is not a clean A/B test — it has selection bias — but it gives you real usage data from real accounts. DocuSign, Freshworks, and Zoho all use this pattern.

Cohort analysis. Compare metrics for accounts that adopted a feature versus those that did not. Control for account size and tenure. This is observational, not experimental, so the causal inference is weaker. But with 200 accounts, it is often the best you can do.

Qualitative-first. Run five to ten deep interviews with users. Watch them use the product. Identify the friction. Fix it. Measure the before/after. This is not statistically rigorous but it is far better than running an underpowered A/B test that tells you nothing.

The biggest mistake B2B PMs make is importing B2C experimentation practices without adjusting for their context. You do not have Swiggy's traffic. Stop pretending you do.

Common traps

The HIPPO override. You run a clean test. The Highest Paid Person's Opinion disagrees with the result. The test gets ignored. This is an organizational problem, not a statistical one. The fix is to get alignment on the test criteria before you run it, not after.

Testing when you should be deciding. Some product decisions are judgment calls. Whether to enter a new market, whether to sunset a feature, whether to change your pricing model — these are strategic decisions that require conviction, not A/B tests. Testing them is a way of avoiding the responsibility of deciding.

The local maximum trap. A/B tests optimize within the current design space. They will tell you the best shade of blue for your button. They will not tell you to remove the button entirely and use a different interaction pattern. For that, you need to step back and rethink the flow from first principles.

Survivorship bias in results. You are only testing users who reached the point of the experiment. If 60% of users drop off before they see your test, your results only apply to the 40% who stayed. The bigger opportunity might be in the 60% you never reached.

// exercise: · 30 min

Design an experiment for your product

Pick one metric in your product that has been flat or declining for the past month. Then:

Write the hypothesis using the four-part structure: the change, the metric, the expected magnitude, and the reasoning. If you cannot articulate the reasoning, stop — you need more discovery before you experiment.
Calculate sample size using an online calculator (Evan Miller or similar). Plug in your baseline rate, your minimum detectable effect, and 95% confidence / 80% power. How many weeks will the test take at your current traffic?
Decide: test or not? If the test takes longer than four weeks, write down two alternative approaches (qualitative research, cohort analysis, painted door test) that could give you directional signal faster.
Document the decision. Whether you run the test or not, write a one-paragraph memo explaining why. Share it with your team. This builds the habit of treating experimentation as a deliberate choice, not a default.

If you do not have a product to work with, use this scenario: a food delivery app where the reorder rate (users placing a second order within 7 days of their first) has been stuck at 28% for three months. Your target is 35%.

Test yourself

// interactive:

The Inconvenient Result

You are a PM at an edtech startup in Hyderabad. Your team ran a three-week A/B test on the course recommendation algorithm. Variant B — which uses collaborative filtering instead of rule-based recommendations — shows a 9% improvement in course enrollment. The p-value is 0.03, well within significance. Your VP of Engineering is excited and wants to deploy Variant B to all users immediately. But you notice something in the segment breakdown: the improvement is entirely driven by users in metros (Bangalore, Mumbai, Delhi). For tier-2 and tier-3 city users, Variant B actually decreased enrollment by 4%. Your tier-2/3 users make up 45% of the user base and are the company's growth focus for this year.

The VP of Engineering is waiting for your go/no-go decision. The CEO has a board meeting next week and wants to present the 9% improvement.

Where to go from here

Metrics & KPIs — You cannot experiment without knowing what to measure. Start here if your metric definitions are fuzzy.
Activation Optimization — Where experiments have the highest ROI: onboarding, aha moments, and time-to-value.
Retention Loops — Experiments on engagement loops and habit formation — where most growth teams should focus.
User Research Methods — When your traffic is too low for A/B testing, qualitative methods are your best alternative.
Prioritization — Deciding what to experiment on is itself a prioritization decision. Do not test low-impact changes.
Writing PRDs — Your experiment design belongs in the PRD. The hypothesis, sample size, and success criteria should be documented before engineering starts.

// learn the judgment

PhonePe's PM wants to A/B test a new onboarding flow. The data scientist says they need 4 weeks to reach statistical significance at 95% confidence. The product head wants results in 2 weeks for a board deck. Both are citing real constraints.

The call: Do you run the 2-week test anyway, ship with inconclusive results, or wait 4 weeks?

Your reasoning:

// practice

Your task: Do you run the 2-week test anyway, ship with inconclusive results, or wait 4 weeks?

your reasoning:

0 chars (min 80)

// practice

You're a PM at Linear building an AI sprint planner — it ingests a backlog and outputs a prioritized two-week sprint. The prototype has been live with 40 engineering teams for 4 weeks. Sprint suggestion accuracy (did teams use the AI plan vs. edit it significantly?) is 52% at week four; your target was 75%. Teams that accepted AI plans saw a 6% higher on-time delivery rate; teams that rejected them saw a 9% drop (they reported wasted time reviewing bad suggestions). Eng estimates 3 more months to improve the model.

Your task: Do you halt the trial and kill the feature from the roadmap until the model is meaningfully better, or keep the trial running to generate more training data?

your reasoning:

0 chars (min 80)