Measuring Outcomes — the pm manual

If you cannot quantify it, you are not doing a good job. Without data, it is all a fairy tale — you are just trying to create a story that this awesome feature will create these outcomes.

Talvinder Singh, from a Pragmatic Leaders cohort session

Most PMs treat launch day as the finish line. The feature ships, the team celebrates, someone posts on Slack, and everyone moves on to the next thing on the roadmap.

Then three months later, someone asks "did that feature actually work?" and nobody has an answer. The dashboard was never set up. The success metric was never defined. The feature is live, consuming engineering maintenance cycles, and nobody knows if a single user benefited from it.

This is the most common failure mode in product execution. Not building the wrong thing — building the right thing and never learning whether it worked.

The measurement gap

Here is what typically happens after a feature launches at most Indian startups I have seen:

// scene:

Two weeks after launching a new referral program. Monthly business review.

VP Product: “How is the referral feature performing?”

PM: “We have had 2,400 referral links generated since launch.”

VP Product: “Great. How many converted to paying users?”

PM: “I... need to check with analytics on that.”

VP Product: “What was the target we set before launch?”

Long pause. There was no target.

PM: “We were focused on getting it out before the quarter ended. We planned to set up tracking after launch.”

VP Product: “So we shipped a referral program two weeks ago and we do not know if it is working.”

// tension:

Activity metrics (links generated) are not outcome metrics (paying users acquired). The PM confused output with impact.

The problem is not laziness. The problem is that most teams define success implicitly — in the PM's head, never written down, never agreed upon — and then measure whatever is easiest to pull from the database after the fact.

The Measurement Contract: define success before you build

Every feature should have a Measurement Contract before engineering writes a single line of code. Not after launch. Not "when we have time to set up the dashboard." Before.

Part	What it is	Example
Primary metric	One number that tells you if it worked	D7 retention
Target	Specific threshold	Increase from 32% to 38% within 60 days
Baseline	Current value, measured before launch	32% (measured week before ship)
Kill trigger	What happens if it misses	Below 34% after 30 days → kill. 34-37% → iterate. 38%+ → scale

If you cannot fill all four, you do not understand the feature well enough to build it. The contract has four parts:

1. Primary metric. One number that tells you whether the feature achieved its purpose. Not three numbers. Not a dashboard with twelve charts. One metric.

2. Target. A specific, quantified threshold. "Improve retention" is not a target. "Increase D7 retention from 32% to 38% within 60 days of launch" is a target.

3. Baseline. The current value of that metric, measured before you ship. Without a baseline, you cannot calculate impact. Measure it the week before launch, not the day after.

4. Decision trigger. What you will do based on the result. If the metric hits the target, what happens? (Scale it, invest more.) If it misses by a little? (Iterate.) If it misses badly? (Kill it.) Write these decisions down before you have the data, when you can think clearly without sunk-cost bias.

This is not bureaucracy. It takes thirty minutes. And it saves you the three months of ambiguity where a half-working feature sits in production because nobody agreed on what "working" means.

The Outcome Ladder: three layers of post-launch measurement

Most PMs stop at adoption and call it success. The Outcome Ladder forces you to climb all three layers — because a feature with high adoption and zero business impact is a vanity ship.

Layer	Question	What it measures	If it fails here...
1. Adoption	Did anyone use it?	Discovery, activation, first use	Fix distribution, not the feature
2. Effectiveness	Did it solve the problem?	Task completion, error rate, support tickets	Fix the feature design
3. Impact	Did it move the business?	Revenue, retention, cost	Re-evaluate whether the problem matters

Layer 1: Did anyone use it?

This is adoption. Sounds obvious, but I have seen features launched behind three clicks in a navigation menu that nobody ever discovered. Before you measure whether a feature is effective, measure whether anyone found it.

Feature discovery rate: What percentage of eligible users encountered the feature?
Activation rate: Of those who encountered it, what percentage completed the core action?
Time to first use: How long after the feature went live did users start engaging?

If discovery is below 20%, your feature does not have a quality problem. It has a distribution problem. Fix the entry point before you touch the feature itself.

Layer 2: Did it solve the problem?

This is effectiveness. The user found the feature and used it. Did it actually do what you intended?

Task completion rate: Did users finish what they started?
Error rate: How often did users fail or need to retry?
Support ticket volume: Did tickets related to this problem go down after launch?

One method I have found reliable: compare users who adopted the feature with a matched cohort who did not. If the adopters show better retention, lower churn, or higher transaction frequency, the feature is working. If the cohorts look the same, the feature is noise.

Layer 3: Did it move the business?

This is impact. The feature works for users. Does it matter for the business?

Revenue impact: Did the feature affect conversion, average order value, or lifetime value?
Retention impact: Are users who engage with this feature more likely to come back?
Cost impact: Did this feature reduce support load, operational cost, or acquisition cost?

Most PMs stop at Layer 1 — counting users. Senior PMs get to Layer 3 — connecting feature usage to business outcomes. That connection is what makes your work legible to leadership.

The HEART framework in practice

Google's HEART framework gives you a systematic way to cover your measurement bases. I have adapted it for how Indian product teams actually operate:

Dimension	What it measures	Example metric (fintech app)
Happiness	User satisfaction	CSAT score for UPI payment flow
Engagement	Depth of usage	Transactions per active user per week
Adoption	New usage	% of MAU who used the new "split bill" feature
Retention	Continued usage	D30 retention for users who completed KYC
Task success	Efficiency	% of UPI payments completed on first attempt

You do not need all five for every feature. Pick two or three that match the feature's purpose. A new onboarding flow cares about Adoption and Task Success. A social feature cares about Engagement and Retention. A checkout redesign cares about Task Success and Happiness.

The trap is measuring all five poorly. Better to measure two well — with baselines, targets, and clean instrumentation — than to have five dashboards full of numbers nobody acts on.

When to iterate vs. when to kill

This is the decision most PMs avoid. The feature launched. It did not hit the target. Now what?

// thread: #product-core — The PM separated the diagnosis (discovery problem vs. product problem) from the decision (iterate vs. kill), and set a hard deadline.

Nidhi (PM)Quick update on the smart notifications feature. D14 results are in.

Adoption is at 11% vs. our 25% target. Engagement among adopters is decent though -- 3.2 interactions/week vs. 2.8 target.

Arjun (Eng Lead)So the people who use it like it, but most people are not finding it?

Nidhi (PM)Exactly. This is a discovery problem, not a product problem. Proposing we move the entry point from Settings to the home feed and retest for 14 days before we decide anything.

Vikram (Design)Agreed. If adoption does not cross 20% after that, we should have the kill conversation.

Nidhi (PM)Setting a hard deadline: March 28. If adoption is still under 20%, we sunset it and reallocate the eng hours.thumbsup 3

Here is a decision framework that has worked for me:

Iterate when:

The metric missed the target but the direction is right (moving up, not flat)
You can identify a specific, fixable cause (discovery, onboarding, a single broken step)
The fix is small relative to the original investment (days, not months)
Users who do engage show strong signals (retention, repeat usage, organic sharing)

Kill when:

The metric is flat or declining after two iteration cycles
Users who engage show no difference from non-users on downstream metrics
The only argument for keeping it is sunk cost ("we already built it")
Maintaining it creates ongoing engineering burden with no measurable return

Escalate when:

The data is ambiguous and the stakes are high
Killing the feature has political consequences (an executive's pet project)
The feature serves a strategic purpose that metrics do not capture (market positioning, regulatory compliance)

Instrumenting before launch

Measurement does not happen by magic. Someone has to add the event tracking, build the dashboard, and verify the data is clean. This is engineering work, and it needs to be scoped inside the feature work — not as a follow-up ticket that never gets prioritized.

A practical checklist:

Define events during PRD review. List every user action you need to track. Get engineering agreement that these events are part of the build, not a post-launch task.
Validate instrumentation in staging. Fire every event manually. Confirm it shows up in your analytics tool with the right properties. I have seen teams launch with broken tracking because nobody tested the events before production.
Set up the dashboard before launch day. Not after. Before. When the feature goes live, you should be able to open a dashboard and see real-time data within hours.
Baseline the metric one week before launch. Take a snapshot. Store it somewhere permanent — not a Slack message that will scroll away. Put it in the PRD, the feature ticket, or a shared doc.

If your organization treats measurement as optional post-launch work, you will never measure anything well. Measurement is part of shipping. A feature without instrumentation is not shipped — it is abandoned in production.

// exercise: · 15 min

Write a measurement contract

Pick a feature you recently shipped (or are about to ship). Write its measurement contract:

Primary metric: What single number tells you if this worked?
Baseline: What is the current value? (If you do not know, that is your first problem to solve.)
Target: What specific number do you need to hit, by when?
Decision triggers:
- If metric exceeds target by Day 30: ____
- If metric is 50-100% of target by Day 30: ____
- If metric is below 50% of target by Day 14: ____

If you cannot fill in the decision triggers, you do not yet have a measurement plan. You have a dashboard wish.

The iteration loop

When you decide to iterate — not kill, not ship-and-forget, but deliberately improve — you need a structured loop. Otherwise "iterate" becomes "tinker aimlessly for three sprints."

Step 1: Diagnose. Why did the metric miss? Use your three layers. Is it an adoption problem (nobody found it), an effectiveness problem (they found it but it did not work), or an impact problem (it works but does not move the business)?

Step 2: Hypothesize. Form a single, testable hypothesis. "Moving the entry point from the settings menu to the home feed will increase feature discovery from 11% to 25%." Not three hypotheses. One. You need clean signal.

Step 3: Scope. The iteration should be smaller than the original build. If your iteration is the same size as the original feature, you are not iterating — you are rebuilding. And rebuilding is a different decision with a different cost-benefit calculation.

Step 4: Timebox. Set a measurement window before you start. Two weeks. Four weeks. Whatever is appropriate for your usage frequency. When the window closes, you make the next decision: iterate again, scale, or kill.

Step 5: Decide. This is the step everyone skips. The timebox ends. The data comes in. And someone has to make a call. Not "let us keep watching it." A call. Continue, change direction, or stop.

Common traps

Vanity metrics. Page views, app installs, registered users. These numbers go up and to the right and tell you nothing about whether your product is working. If your CEO asks for a dashboard, give them one with engagement and retention metrics, not raw counts.

Metric fixation. Once you pick a metric, you will optimize for it — sometimes at the expense of things you did not measure. A team optimizing for daily active users might build addictive notification patterns that damage long-term retention. Always pair your primary metric with a guardrail metric that catches unintended harm.

Survivorship bias. You survey users who love the feature and conclude it is a success. But you never talk to users who tried it once and abandoned it. The people who left have the information you actually need.

The "more data" delay. "We need more data before we can decide." Sometimes this is legitimate. More often it is decision avoidance disguised as rigor. If you have two weeks of data and the metric is at 15% of target, you do not need four more weeks to know the feature is struggling.

Test yourself

// interactive:

The Underperforming Feature

You are a PM at a logistics startup in Bengaluru. Three weeks ago, you launched a route optimization feature for delivery partners. The hypothesis was that it would reduce average delivery time by 15%. Actual result: delivery time dropped by only 4%. Your engineering team spent six weeks building it. The next planning cycle starts Monday.

Your manager asks for your recommendation in tomorrow's review. The data is clear: 4% improvement vs. 15% target. What do you propose?

// learn the judgment

You are PM at Flipkart working on the grocery delivery team (Flipkart Quick). You launched a 'scheduled delivery' feature three weeks ago that lets users book grocery slots up to 3 days in advance. Your dashboard shows: 40,000 scheduled deliveries booked (strong adoption), average slot fill rate 87% (looks healthy), cancellation rate 2.3% (within normal range). Your VP presents this at the monthly review as a clear success. But you have been looking at a different number: same-day reorder rate. Users who used scheduled delivery once are reordering same-day (non-scheduled) at 34% lower frequency than users who never used it. The scheduled delivery adoption metric is up. The engagement depth metric you care about is down.

The call: Do you raise the reorder rate drop in the VP review, or wait until you have more data? What does the divergence between the adoption metric and the reorder metric actually tell you?

Your reasoning:

// practice

Your task: Do you raise the reorder rate drop in the VP review, or wait until you have more data? What does the divergence between the adoption metric and the reorder metric actually tell you?

your reasoning:

0 chars (min 80)

// practice

You're the PM for Zepto's delivery promise. For the past 3 weeks, you've been reporting '8-minute average delivery' in exec stand-ups and external interviews. On a Sunday review you notice your dashboard query was pulling median for completed orders from the past 24 hours, not the 7-day average you intended — and was excluding cancelled orders, which tend to take longer before cancellation. The actual 7-day average including all initiated orders is 11.2 minutes. One press article has already quoted the 8-minute figure based on your team's brief.

Your task: On Monday morning, do you write a clean correction note to your VP and the exec team, or quietly fix the dashboard and re-baseline going forward without flagging the 3-week discrepancy?

your reasoning:

0 chars (min 80)

// practice

You're the PM for Myntra's size recommendation model. Two weeks before the post-launch review, your eng lead tells you privately that the model was trained on return data from 2021-2022, a period when Myntra's return policy was more lenient and return rates were 40% higher. The model is systematically over-recommending larger sizes because the training signal is biased. You know this going into the post-launch review, where leadership is expecting a recommendation model update.

Your task: Do you name the training data bias in the post-launch review before being asked, or present the other launch metrics and let the model performance data reveal the issue over the next quarter?

your reasoning:

0 chars (min 80)

Where to go next

Set the right metrics from the start: Metrics and KPIs
Write measurement into your specs: Writing PRDs
Understand what users actually need: User Research Methods
Build the strategic context for what to measure: Product Vision and Strategy