measuring outcomes
If you cannot quantify it, you are not doing a good job. Without data, it is all a fairy tale — you are just trying to create a story that this awesome feature will create these outcomes.
Most PMs treat launch day as the finish line. The feature ships, the team celebrates, someone posts on Slack, and everyone moves on to the next thing on the roadmap.
Then three months later, someone asks “did that feature actually work?” and nobody has an answer. The dashboard was never set up. The success metric was never defined. The feature is live, consuming engineering maintenance cycles, and nobody knows if a single user benefited from it.
This is the most common failure mode in product execution. Not building the wrong thing — building the right thing and never learning whether it worked.
The measurement gap
Here is what typically happens after a feature launches at most Indian startups I have seen:
Two weeks after launching a new referral program. Monthly business review.
VP Product: “How is the referral feature performing?”
PM: “We have had 2,400 referral links generated since launch.”
VP Product: “Great. How many converted to paying users?”
PM: “I... need to check with analytics on that.”
VP Product: “What was the target we set before launch?”
Long pause. There was no target.
PM: “We were focused on getting it out before the quarter ended. We planned to set up tracking after launch.”
VP Product: “So we shipped a referral program two weeks ago and we do not know if it is working.”
Activity metrics (links generated) are not outcome metrics (paying users acquired). The PM confused output with impact.
The problem is not laziness. The problem is that most teams define success implicitly — in the PM’s head, never written down, never agreed upon — and then measure whatever is easiest to pull from the database after the fact.
The Measurement Contract: define success before you build
Every feature should have a Measurement Contract before engineering writes a single line of code. Not after launch. Not “when we have time to set up the dashboard.” Before.
| Part | What it is | Example |
|---|---|---|
| Primary metric | One number that tells you if it worked | D7 retention |
| Target | Specific threshold | Increase from 32% to 38% within 60 days |
| Baseline | Current value, measured before launch | 32% (measured week before ship) |
| Kill trigger | What happens if it misses | Below 34% after 30 days → kill. 34-37% → iterate. 38%+ → scale |
If you cannot fill all four, you do not understand the feature well enough to build it. The contract has four parts:
1. Primary metric. One number that tells you whether the feature achieved its purpose. Not three numbers. Not a dashboard with twelve charts. One metric.
2. Target. A specific, quantified threshold. “Improve retention” is not a target. “Increase D7 retention from 32% to 38% within 60 days of launch” is a target.
3. Baseline. The current value of that metric, measured before you ship. Without a baseline, you cannot calculate impact. Measure it the week before launch, not the day after.
4. Decision trigger. What you will do based on the result. If the metric hits the target, what happens? (Scale it, invest more.) If it misses by a little? (Iterate.) If it misses badly? (Kill it.) Write these decisions down before you have the data, when you can think clearly without sunk-cost bias.
This is not bureaucracy. It takes thirty minutes. And it saves you the three months of ambiguity where a half-working feature sits in production because nobody agreed on what “working” means.
The Outcome Ladder: three layers of post-launch measurement
Most PMs stop at adoption and call it success. The Outcome Ladder forces you to climb all three layers — because a feature with high adoption and zero business impact is a vanity ship.
| Layer | Question | What it measures | If it fails here… |
|---|---|---|---|
| 1. Adoption | Did anyone use it? | Discovery, activation, first use | Fix distribution, not the feature |
| 2. Effectiveness | Did it solve the problem? | Task completion, error rate, support tickets | Fix the feature design |
| 3. Impact | Did it move the business? | Revenue, retention, cost | Re-evaluate whether the problem matters |
Layer 1: Did anyone use it?
This is adoption. Sounds obvious, but I have seen features launched behind three clicks in a navigation menu that nobody ever discovered. Before you measure whether a feature is effective, measure whether anyone found it.
- Feature discovery rate: What percentage of eligible users encountered the feature?
- Activation rate: Of those who encountered it, what percentage completed the core action?
- Time to first use: How long after the feature went live did users start engaging?
If discovery is below 20%, your feature does not have a quality problem. It has a distribution problem. Fix the entry point before you touch the feature itself.
Layer 2: Did it solve the problem?
This is effectiveness. The user found the feature and used it. Did it actually do what you intended?
- Task completion rate: Did users finish what they started?
- Error rate: How often did users fail or need to retry?
- Support ticket volume: Did tickets related to this problem go down after launch?
One method I have found reliable: compare users who adopted the feature with a matched cohort who did not. If the adopters show better retention, lower churn, or higher transaction frequency, the feature is working. If the cohorts look the same, the feature is noise.
Layer 3: Did it move the business?
This is impact. The feature works for users. Does it matter for the business?
- Revenue impact: Did the feature affect conversion, average order value, or lifetime value?
- Retention impact: Are users who engage with this feature more likely to come back?
- Cost impact: Did this feature reduce support load, operational cost, or acquisition cost?
Most PMs stop at Layer 1 — counting users. Senior PMs get to Layer 3 — connecting feature usage to business outcomes. That connection is what makes your work legible to leadership.
The HEART framework in practice
Google’s HEART framework gives you a systematic way to cover your measurement bases. I have adapted it for how Indian product teams actually operate:
| Dimension | What it measures | Example metric (fintech app) |
|---|---|---|
| Happiness | User satisfaction | CSAT score for UPI payment flow |
| Engagement | Depth of usage | Transactions per active user per week |
| Adoption | New usage | % of MAU who used the new “split bill” feature |
| Retention | Continued usage | D30 retention for users who completed KYC |
| Task success | Efficiency | % of UPI payments completed on first attempt |
You do not need all five for every feature. Pick two or three that match the feature’s purpose. A new onboarding flow cares about Adoption and Task Success. A social feature cares about Engagement and Retention. A checkout redesign cares about Task Success and Happiness.
The trap is measuring all five poorly. Better to measure two well — with baselines, targets, and clean instrumentation — than to have five dashboards full of numbers nobody acts on.
When to iterate vs. when to kill
This is the decision most PMs avoid. The feature launched. It did not hit the target. Now what?
Here is a decision framework that has worked for me:
Iterate when:
- The metric missed the target but the direction is right (moving up, not flat)
- You can identify a specific, fixable cause (discovery, onboarding, a single broken step)
- The fix is small relative to the original investment (days, not months)
- Users who do engage show strong signals (retention, repeat usage, organic sharing)
Kill when:
- The metric is flat or declining after two iteration cycles
- Users who engage show no difference from non-users on downstream metrics
- The only argument for keeping it is sunk cost (“we already built it”)
- Maintaining it creates ongoing engineering burden with no measurable return
Escalate when:
- The data is ambiguous and the stakes are high
- Killing the feature has political consequences (an executive’s pet project)
- The feature serves a strategic purpose that metrics do not capture (market positioning, regulatory compliance)
The hardest part is not making the decision. It is making it quickly. Every week a dying feature stays in production, it consumes maintenance cycles, creates edge cases for other features, and sends the team a signal that shipping matters more than outcomes.
Instrumenting before launch
Measurement does not happen by magic. Someone has to add the event tracking, build the dashboard, and verify the data is clean. This is engineering work, and it needs to be scoped inside the feature work — not as a follow-up ticket that never gets prioritized.
A practical checklist:
- Define events during PRD review. List every user action you need to track. Get engineering agreement that these events are part of the build, not a post-launch task.
- Validate instrumentation in staging. Fire every event manually. Confirm it shows up in your analytics tool with the right properties. I have seen teams launch with broken tracking because nobody tested the events before production.
- Set up the dashboard before launch day. Not after. Before. When the feature goes live, you should be able to open a dashboard and see real-time data within hours.
- Baseline the metric one week before launch. Take a snapshot. Store it somewhere permanent — not a Slack message that will scroll away. Put it in the PRD, the feature ticket, or a shared doc.
If your organization treats measurement as optional post-launch work, you will never measure anything well. Measurement is part of shipping. A feature without instrumentation is not shipped — it is abandoned in production.
Pick a feature you recently shipped (or are about to ship). Write its measurement contract:
- Primary metric: What single number tells you if this worked?
- Baseline: What is the current value? (If you do not know, that is your first problem to solve.)
- Target: What specific number do you need to hit, by when?
- Decision triggers:
- If metric exceeds target by Day 30: ____
- If metric is 50-100% of target by Day 30: ____
- If metric is below 50% of target by Day 14: ____
If you cannot fill in the decision triggers, you do not yet have a measurement plan. You have a dashboard wish.
The iteration loop
When you decide to iterate — not kill, not ship-and-forget, but deliberately improve — you need a structured loop. Otherwise “iterate” becomes “tinker aimlessly for three sprints.”
Step 1: Diagnose. Why did the metric miss? Use your three layers. Is it an adoption problem (nobody found it), an effectiveness problem (they found it but it did not work), or an impact problem (it works but does not move the business)?
Step 2: Hypothesize. Form a single, testable hypothesis. “Moving the entry point from the settings menu to the home feed will increase feature discovery from 11% to 25%.” Not three hypotheses. One. You need clean signal.
Step 3: Scope. The iteration should be smaller than the original build. If your iteration is the same size as the original feature, you are not iterating — you are rebuilding. And rebuilding is a different decision with a different cost-benefit calculation.
Step 4: Timebox. Set a measurement window before you start. Two weeks. Four weeks. Whatever is appropriate for your usage frequency. When the window closes, you make the next decision: iterate again, scale, or kill.
Step 5: Decide. This is the step everyone skips. The timebox ends. The data comes in. And someone has to make a call. Not “let us keep watching it.” A call. Continue, change direction, or stop.
Common traps
Vanity metrics. Page views, app installs, registered users. These numbers go up and to the right and tell you nothing about whether your product is working. If your CEO asks for a dashboard, give them one with engagement and retention metrics, not raw counts.
Metric fixation. Once you pick a metric, you will optimize for it — sometimes at the expense of things you did not measure. A team optimizing for daily active users might build addictive notification patterns that damage long-term retention. Always pair your primary metric with a guardrail metric that catches unintended harm.
Survivorship bias. You survey users who love the feature and conclude it is a success. But you never talk to users who tried it once and abandoned it. The people who left have the information you actually need.
The “more data” delay. “We need more data before we can decide.” Sometimes this is legitimate. More often it is decision avoidance disguised as rigor. If you have two weeks of data and the metric is at 15% of target, you do not need four more weeks to know the feature is struggling.
Test yourself
You are a PM at a logistics startup in Bengaluru. Three weeks ago, you launched a route optimization feature for delivery partners. The hypothesis was that it would reduce average delivery time by 15%. Actual result: delivery time dropped by only 4%. Your engineering team spent six weeks building it. The next planning cycle starts Monday.
Your manager asks for your recommendation in tomorrow's review. The data is clear: 4% improvement vs. 15% target. What do you propose?
your path
You are PM at Flipkart working on the grocery delivery team (Flipkart Quick). You launched a 'scheduled delivery' feature three weeks ago that lets users book grocery slots up to 3 days in advance. Your dashboard shows: 40,000 scheduled deliveries booked (strong adoption), average slot fill rate 87% (looks healthy), cancellation rate 2.3% (within normal range). Your VP presents this at the monthly review as a clear success. But you have been looking at a different number: same-day reorder rate. Users who used scheduled delivery once are reordering same-day (non-scheduled) at 34% lower frequency than users who never used it. The scheduled delivery adoption metric is up. The engagement depth metric you care about is down.
The call: Do you raise the reorder rate drop in the VP review, or wait until you have more data? What does the divergence between the adoption metric and the reorder metric actually tell you?
You are PM at Flipkart working on the grocery delivery team (Flipkart Quick). You launched a 'scheduled delivery' feature three weeks ago that lets users book grocery slots up to 3 days in advance. Your dashboard shows: 40,000 scheduled deliveries booked (strong adoption), average slot fill rate 87% (looks healthy), cancellation rate 2.3% (within normal range). Your VP presents this at the monthly review as a clear success. But you have been looking at a different number: same-day reorder rate. Users who used scheduled delivery once are reordering same-day (non-scheduled) at 34% lower frequency than users who never used it. The scheduled delivery adoption metric is up. The engagement depth metric you care about is down.
The call: Do you raise the reorder rate drop in the VP review, or wait until you have more data? What does the divergence between the adoption metric and the reorder metric actually tell you?
Where to go next
- Set the right metrics from the start: Metrics and KPIs
- Write measurement into your specs: Writing PRDs
- Understand what users actually need: User Research Methods
- Build the strategic context for what to measure: Product Vision and Strategy