AI Product Cases — the pm manual

I have seen a lot of cases where product managers come and tell me they will use AI in whatever product problem I give them. That sounds very cool, but when I ask them any follow-up questions, they become silent.

Talvinder Singh, from a Pragmatic Leaders PM interview workshop, 2024

Every second PM candidate in India now answers every case study with "we will use AI." Recommendation engine. Chatbot. Personalization. GenAI. The words flow easily. The understanding does not.

I have interviewed and trained over 10,000 PMs across Pragmatic Leaders cohorts. The pattern is consistent: candidates who say "AI" without understanding the product decisions behind AI get rejected. Not because AI is wrong — but because they cannot answer the follow-up. What data do you need? How do you measure if the model is working? What happens when the model is wrong? What is the fallback? How expensive is this to run?

AI products are not regular products with a model bolted on. They have fundamentally different failure modes, feedback loops, and success metrics. If you want to build them — or even discuss them intelligently in an interview — you need to understand those differences.

Here are four worked cases that cover the AI product decisions most PMs will face.

Case 1: The recommendation engine that nobody trusted

A mid-size Indian e-commerce company — think Myntra-scale, fashion vertical — launched a recommendation engine. Collaborative filtering, content-based signals, purchase history, browse behavior. The ML team spent four months building it. The model's offline precision was strong. Click-through rate on recommended items in A/B testing was 3.2x higher than the generic "popular items" carousel.

They shipped it. Conversion from recommendations dropped after two weeks.

Why? Because the model worked too well. It surfaced items that were eerily accurate — a user who browsed maternity wear started seeing baby products everywhere. A user who searched for plus-size clothing once saw nothing but plus-size recommendations for weeks. The algorithm had no concept of sensitivity, context, or the difference between "interested" and "actively shopping."

Dimension	Traditional product	AI-powered product
Failure mode	Feature does not work (bug)	Feature works but produces wrong/harmful output
User feedback	Explicit — clicks, ratings, complaints	Implicit — engagement drops, but users do not tell you why
Testing	A/B test on/off states	A/B test plus model quality metrics (precision, recall, diversity)
Iteration cycle	Ship fix → deploy → done	Retrain model → validate offline → A/B test → monitor drift
Edge cases	Definable and testable	Infinite and emergent

// scene:

Weekly product review. The recommendation engine's conversion metrics have declined for the second straight week despite high CTR.

ML Lead: “The model is performing well. Precision at top-5 is 0.34, which is above our benchmark. CTR on reco widgets is still 3x the control.”

PM: “Then why is conversion from recommendations down 18% since launch?”

ML Lead: “Conversion is a product metric, not a model metric. The model recommends relevant items. Whether users buy depends on price, inventory, delivery time...”

PM: “Pull the recommendations for users who clicked but did not buy. What are they seeing?”

Data Analyst: “...a lot of repeat categories. Users who browsed kurtas once are seeing only kurtas. The diversity score is 0.12 — almost zero.”

PM: “So the model is precise but narrow. It is showing users what they already looked at, not what they might want next. That is a product problem, not a model problem. We need a diversity constraint.”

After adding a diversity threshold — at least 3 categories in every 10 recommendations — conversion recovered and surpassed the pre-launch baseline by 11%.

// tension:

The ML team optimized for relevance. The PM should have defined the objective function to include diversity from day one.

The PM lesson: In AI products, the PM owns the objective function — not the model architecture. If you tell your ML team "maximize click-through," they will build a model that maximizes click-through. If that model creates a filter bubble that kills conversion, that is a product failure, not an engineering failure. The PM must define what "good" looks like holistically: relevance AND diversity AND sensitivity AND recency. You do not need to know how gradient descent works. You need to know what the model should optimize for and what constraints it must respect.

The India-specific angle: Indian e-commerce has a unique challenge — the same user shops for themselves, their parents, their in-laws, and their children. A single account is often a household. Recommendation engines trained on individual user behavior break down when the "user" is actually four people with different tastes, sizes, and price sensitivities. Flipkart and Myntra both had to build profile-switching features to solve this. If you are building recommendations for the Indian market, ask: is this one user or one household?

Case 2: The chatbot that could not say "I don't know"

An Indian insurtech startup — Series B, 2 million customers — replaced their FAQ section with an LLM-powered chatbot. The pitch was compelling: reduce support ticket volume by 40%, handle policy queries in natural language, available 24/7 in Hindi and English.

The chatbot launched and performed well on common queries. "What is my policy number?" "When is my premium due?" "How do I file a claim?" It handled these with 92% accuracy.

Then a user asked: "My father just died. What do I do about his life insurance policy?"

The chatbot responded with a cheerful: "I can help you with that! To make changes to a policy, please provide the policy number and the policyholder's date of birth."

No empathy. No sensitivity flag. No escalation to a human agent. Just a standard process flow applied to a grief-stricken customer.

This is the fundamental challenge of LLM-powered products: they are fluent but not intelligent. They generate responses that sound right, and sound right even when they are catastrophically wrong. The failure mode is not a crash or an error message — it is confident wrongness that erodes trust.

// thread: #ai-product-team — The team shipped a 'know when to shut up' update within 48 hours. Support tickets for sensitive queries actually got faster resolution because they were pre-categorized by the bot before handoff.

Support LeadWe got a complaint on Twitter. User asked the chatbot about a deceased policyholder's claim. Bot asked for the deceased person's date of birth 'to verify identity.' Screenshot is going viral.

PMHow did this get through testing?

ML EngineerWe tested on 500 sample queries. None of them involved death claims. The model defaults to the standard verification flow for any policy query.

PMSo we shipped a product that handles the easiest 80% of queries well and the hardest 20% — which are also the most emotionally sensitive — with zero safeguards?

Support LeadThe 20% is also the 20% where a wrong answer has legal consequences. Claim amounts, policy terms, nominee disputes.warning

PMNew rule: the chatbot does not answer any query involving claims, death, disputes, or legal terms. It says 'I want to make sure you get the right help' and routes to a human. We are a safety net, not a replacement.

The PM lesson: The most important feature of an AI product is knowing its own limits. An LLM chatbot that confidently answers questions it should not answer is worse than no chatbot at all. As the PM, your job is to define the boundary between "the AI handles this" and "a human handles this" — and that boundary is not about accuracy alone. It is about consequence. A wrong answer to "what are your office hours" is annoying. A wrong answer to "is my mother's cancer treatment covered" is devastating. Map your query space by consequence severity, not just by frequency.

The India-specific angle: India has two chatbot challenges that Western markets do not. First, code-switching — users in Indian cities routinely mix Hindi and English in the same sentence ("Mera premium kab due hai?"). Most LLMs handle pure Hindi or pure English well but stumble on Hinglish. Second, the trust gap. Indian users who interact with a chatbot and get a wrong answer do not just leave — they call the customer support number, tell the agent the chatbot gave wrong information, and create a more expensive support interaction than if the chatbot had never existed. The chatbot must be right or silent. There is no middle ground.

Case 3: Content generation — when AI writes and humans curate

A content platform — edtech, focused on competitive exam preparation — decided to use generative AI to scale their content library. The logic: they had 200 human-written practice questions per topic. They needed 2,000. At their current rate of manual creation, that would take 18 months. GPT-4 could generate 2,000 questions in a week.

The generated questions looked good. Grammar was correct. Difficulty levels seemed appropriate. The team did a spot check — 50 random questions reviewed by subject experts. 46 were usable. 92% accuracy. They shipped the full set.

Within a month, students started complaining. Not about wrong answers — about subtle wrongness. A physics question that used the right formula but set up an impossible scenario (a ball thrown at 500 m/s). A history question where two options were technically correct depending on which textbook you referenced. A math question where the "correct" answer had a rounding error that only mattered if you were solving it properly instead of plugging in options.

The 92% accuracy from the spot check was misleading because the spot check was done by subject experts who unconsciously corrected minor errors while reading. When students — who are learning, not reviewing — encountered those same errors, they were confused and lost trust in the platform.

Content quality dimension	Human-written	AI-generated	PM implication
Grammar and structure	High	High	Not a differentiator
Factual accuracy	High (expert reviewed)	Variable (confidently wrong)	Needs expert review pipeline, not spot checks
Pedagogical quality	Designed for learning	Designed to look correct	Rubric needed: does this question teach or just test?
Edge case handling	Expert catches impossible scenarios	Model does not know what is impossible	Domain-specific validation rules required
Cost per question	Rs 150-300	Rs 1-3	100x cheaper but useless without review

// scene:

Content ops review after student NPS drops 12 points. The team is diagnosing whether AI-generated questions are the cause.

Content Lead: “We added 1,800 AI-generated questions last month. Student complaints about 'wrong questions' are up 4x.”

PM: “What is our review process for AI-generated content?”

Content Lead: “Spot check. 5% sample reviewed by subject experts before publishing.”

PM: “So 95% of AI-generated questions went live without any human review?”

Content Lead: “The alternative is reviewing all 1,800. That takes the same time as writing them manually.”

PM: “Not exactly. Reviewing is faster than writing. But the real question is: can we build a validation layer? Auto-check for impossible physical values, duplicate answer options, ambiguous phrasing. That catches the mechanical errors. Experts review only what passes the auto-check — and they review for pedagogical quality, not grammar.”

The team built a rule-based validator that caught 23% of AI-generated questions as potentially flawed. Expert review of the remaining 77% took one-third the time of full manual creation. Net result: 3x content throughput at 98% quality.

// tension:

AI-generated content is not free content. The cost just shifts from creation to validation. A PM who does not plan for that shift ships garbage at scale.

The PM lesson: AI content generation does not eliminate humans from the loop — it changes what humans do. The PM must design the human-AI workflow before the AI generates a single piece of content. The workflow is: AI generates, automated rules validate, humans review for judgment calls. If you skip the middle layer (automated validation), humans drown in review volume and start rubber-stamping. If you skip the last layer (human review), you ship confidently wrong content at scale. Neither is acceptable.

The India-specific angle: India's competitive exam ecosystem — JEE, NEET, UPSC, CAT — has zero tolerance for question errors. A single wrong question in a mock test can send thousands of students down an incorrect preparation path. The stakes are not "user engagement drops." The stakes are "a student spends 3 hours mastering a concept that was incorrectly represented in your practice set." If you are building AI-generated educational content for the Indian market, your quality bar is not "good enough." It is "would you bet a student's rank on this question?"

Case 4: OYO's dynamic pricing — when the model is the product

OYO built a dynamic pricing engine that adjusts room rates based on demand signals: booking patterns, local events, seasonality, day of week, competitor pricing. The model runs for over 100,000 properties. It is not a feature within the product — it IS the product for the supply side.

The PM challenge with dynamic pricing is not the model — it is the stakeholder management. Hotel owners do not understand why their room rate changed from Rs 1,200 to Rs 800 on a Tuesday. They see a 33% revenue cut. The model sees optimal occupancy-rate tradeoff. Same data, completely different interpretation.

// thread: #oyo-pricing-product — OYO redesigned the owner dashboard to show weekly revenue trends and occupancy-rate tradeoff visualizations. Partner churn in pilot cities dropped 9% in the following quarter.

Supply PMHotel partner churn in Jaipur is up 15% this quarter. Exit survey says 'OYO controls my pricing and I don't understand why.'

Pricing PMThe model is working. Jaipur properties on dynamic pricing have 22% higher occupancy than those on fixed rates.

Supply PMHigher occupancy at lower rates. The owners see lower RevPAR on their daily dashboard. They don't care about occupancy if revenue per room is down.

Pricing PMTotal revenue is up 8% because occupancy gains outweigh rate reductions.

Supply PMThen why can't they see that? Why does the owner dashboard show rate per night — which is down — instead of total weekly revenue — which is up?bulb

Pricing PMBecause we built the dashboard before we built the pricing model. The dashboard shows inputs, not outcomes.

The PM lesson: When AI makes decisions that affect stakeholders who did not ask for AI, explainability is not a nice-to-have — it is a retention feature. OYO's pricing model was mathematically correct and commercially sound. It still caused churn because the affected stakeholders (hotel owners) could not see why. The PM's job is not just to ship the model. It is to ship the model with the explanation layer, the override mechanism, and the trust-building dashboard. If your AI changes someone's revenue and they cannot understand why, you have built a black box that humans will reject — no matter how accurate it is.

The India-specific angle: OYO operates across 800+ cities in India. A hotel owner in Varanasi has a fundamentally different mental model of pricing than a hotel owner in Gurgaon. The Varanasi owner prices based on season and pilgrimage dates — fixed rates that have not changed in years. The Gurgaon owner already does informal dynamic pricing based on corporate demand. A single pricing model with a single explanation UI cannot serve both. The PM must segment the supply side by pricing sophistication, not just geography. The model might be the same. The explanation — and the degree of owner control — must vary.

Four cases. One discipline. AI products succeed when the PM gets five things right:

Own the objective function. The ML team builds the model. You define what "good" means. If the model optimizes for the wrong thing, that is your failure, not theirs. Write down what the model should maximize, what it should constrain, and what it must never do — before the first line of training code is written.
Design the failure mode. Every AI product will be wrong sometimes. The question is not "how do we prevent errors" — it is "what happens when the error occurs?" A recommendation that is slightly off is forgivable. A chatbot that gives wrong medical advice is a lawsuit. Map your error space by consequence, not probability.
Build the human-AI boundary. Decide what the AI handles autonomously, what it handles with human oversight, and what it must not handle at all. This boundary is the most important product decision in any AI feature. Get it wrong and you either waste the AI's capabilities (too conservative) or create trust-destroying failures (too aggressive).
Ship the explanation, not just the output. If your AI makes a decision that affects a user or stakeholder, they must be able to understand why. "The algorithm decided" is never an acceptable explanation. Even if the full technical explanation is complex, a simplified narrative ("We lowered the rate because Tuesday demand in Jaipur is typically 40% below weekend demand, and lower rates fill rooms that would otherwise stay empty") builds trust.
Measure the whole system, not just the model. Precision, recall, and F1 scores are model metrics. They are necessary but not sufficient. Product metrics — conversion, retention, NPS, support ticket volume, stakeholder churn — tell you whether the AI product is working. A model with 95% accuracy that drives users away is a failed product.

// exercise: · 40 min

Design an AI feature for a real Indian product

Context: Pick one of these real Indian products: Zepto (10-minute grocery delivery), Practo (doctor appointments), Policybazaar (insurance comparison), Urban Company (home services), or CRED (credit card management).

Your brief:

Identify one user problem in the product that could benefit from AI (recommendation, prediction, generation, or classification).
Define the objective function: what should the AI maximize? What constraints must it respect? What must it never do?
Map the failure modes by consequence:
- Low consequence failures — wrong but harmless (e.g., recommending a mop when the user wanted a broom)
- Medium consequence failures — wrong and annoying (e.g., predicting a 10-minute delivery that takes 40 minutes)
- High consequence failures — wrong and damaging (e.g., recommending a medication interaction that is dangerous)
Design the human-AI boundary: what does the AI handle alone? What requires human oversight? What is off-limits for AI entirely?
Define three product metrics (not model metrics) that tell you whether this AI feature is working.

Constraint: You must specify what happens when the AI is wrong. "The model will be accurate" is not an answer. "When the model is wrong, the user sees X and can do Y" is an answer.

// interactive:

The AI Chatbot Launch

You are a PM at an Indian neobank (think Jupiter or Fi Money). The CEO wants to launch an LLM-powered financial assistant that helps users manage spending, understand their transactions, and get personalized savings advice. The CEO says: 'Every fintech will have this in 6 months. We need to be first.' You have an ML team of 4 and a 10-week timeline.

The CEO wants a full-featured AI financial advisor. Your ML team says they can fine-tune an LLM on your transaction data in 6 weeks, leaving 4 weeks for testing and launch. The compliance team has not been consulted yet.

// learn the judgment

You are PM on Zerodha's Nudge feature—an AI system that warns traders about risky options positions before they execute. The model flags 30% of all orders as 'high risk,' leading users to dismiss warnings habitually. The ML team wants 2 more months to retrain the model to reduce false positives.

The call: Do you wait 2 months for the better model, or ship a rule-based threshold now to stop warning fatigue?

Your reasoning:

// practice

Your task: Do you wait 2 months for the better model, or ship a rule-based threshold now to stop warning fatigue?

your reasoning:

0 chars (min 80)

Where to go next

Understand AI fundamentals for PMs: AI Fundamentals
Build AI into your product strategy: AI Product Strategy
Practice other case study types: How to Approach Any Case Study
Indian market context for your cases: Indian Market Cases
Metrics that matter for AI products: Metrics and KPIs