Working with Data Science & ML Teams — the pm manual

When you work with engineering, you argue about scope. When you work with data science, you argue about reality. The model is not wrong — your expectations are. That is the first lesson every PM working with ML needs to internalise.

Talvinder Singh, from a Pragmatic Leaders cohort on AI product management

Here is the uncomfortable truth about ML features: they are probabilistic. Not sometimes. Always. Every prediction your model makes comes with a confidence score, and that confidence score is not 100%. It will never be 100%. If you cannot accept that, you should not be building ML features.

I have watched dozens of PMs walk into their first ML project treating it like a software feature. They write a spec, hand it to the data science team, and expect a deliverable in two sprints. Six months later, they are still waiting, the stakeholders are furious, and the data scientist has quit. The PM blames the DS team for being slow. The DS team blames the PM for not understanding the problem.

Both are right. And both could have avoided it.

ML timelines are not engineering timelines

When you ask an engineer to build a feature, they can give you a rough estimate. It might be wrong by 30-50%, but the shape of the work is knowable. Requirements in, working software out. The path has uncertainty, but the destination is clear.

Data science does not work this way. The path and the destination are both uncertain. You are asking the team to answer a question that might not have an answer: can we predict X from the data we have?

The typical ML project has three phases that PMs consistently underestimate:

Phase 1: Data exploration (2-6 weeks). Before anyone builds a model, they need to understand the data. Is it clean? Is it representative? Does it contain the signal you think it does? I have seen teams spend four weeks just discovering that their training data has a label imbalance so severe that any model trained on it will predict the majority class 95% of the time and be useless. This phase cannot be skipped or shortened. It is the foundation.

Phase 2: Experimentation (4-12 weeks). The team tries multiple approaches. Some fail. This is not waste — it is the scientific method. You try logistic regression. It gets 72% accuracy. You try a gradient-boosted tree. It gets 78%. You try a neural network. It gets 79% but takes 10x longer to train and 50x longer to run in production. Now you have a decision to make, and the PM needs to be in the room for it.

Phase 3: Productionisation (2-8 weeks). The model that works in a Jupyter notebook is not the model that works in production. Latency, throughput, monitoring, retraining pipelines, feature stores, A/B testing infrastructure — this is where ML engineering meets software engineering, and it is where most timelines blow up.

If you are planning an ML feature and your timeline is "6 weeks," you are lying to yourself. The honest answer is "3-6 months for the first version, with significant uncertainty about the outcome."

Stop writing deterministic requirements for probabilistic systems

The single most damaging mistake PMs make with ML features is writing requirements as if the output is deterministic. "The system will recommend the right product to the user." No, it will not. It will recommend a product that, based on historical patterns, has a higher probability of being relevant. Sometimes it will be wrong.

Here is what a bad ML requirement looks like:

The fraud detection system will flag all fraudulent transactions and allow all legitimate ones.

Here is what a useful ML requirement looks like:

The fraud detection system will catch at least 90% of fraudulent transactions (recall >= 0.90) while keeping false positives below 5% of total flagged transactions (precision >= 0.95). We accept that approximately 10% of fraudulent transactions will go undetected, and we will handle those through manual review.

The second version does three things the first does not:

It acknowledges imperfection. Every ML system has errors. The question is not whether it will be wrong, but how often and in which direction.
It specifies the tradeoff. More fraud caught means more legitimate transactions falsely flagged. The PM decides where to draw that line — not the data scientist.
It defines the fallback. When the model is wrong (and it will be), what happens? Manual review, customer support escalation, a refund process. The system around the model matters as much as the model itself.

The accuracy-coverage tradeoff will define your career in ML

Every ML PM faces this tradeoff, and most handle it poorly because nobody taught them the vocabulary.

Accuracy (or precision): Of the predictions the model makes, how many are correct? Coverage (or recall): Of all the cases where the model should act, how many does it actually catch?

You cannot maximise both. If you want the model to catch every possible case (high coverage), it will also flag many false positives (low accuracy). If you want every prediction to be correct (high accuracy), the model will be conservative and miss many real cases (low coverage).

A concrete example from Indian fintech. You are building a credit risk model for a digital lending product. Two options:

Option A: High accuracy, low coverage. The model approves only applicants it is very confident about. 95% of approved applicants repay on time. But you reject 60% of applicants who would have repaid — they go to a competitor.

Option B: High coverage, low accuracy. The model approves most applicants who might be creditworthy. You capture 90% of the addressable market. But 15% of approved applicants default, and your NPAs eat into margins.

Neither is correct. The right answer depends on your business model, your margin structure, your collection infrastructure, and your competitive position. That is a PM decision, not a data science decision. The DS team can build either model. Your job is to tell them which tradeoff to optimise for.

Offline metrics lie. Online metrics reveal.

Your data scientist comes to you excited. "The model achieved 92% accuracy on the test set!" You present this to leadership. The feature launches. Conversion does not move. Usage is flat. What happened?

The test set is not the real world. Offline evaluation tells you whether the model learned the patterns in historical data. Online evaluation tells you whether those patterns matter to users.

Three reasons offline metrics mislead:

1. Distribution shift. The model was trained on data from January to June. You launch in September. User behaviour changed — maybe a competitor launched, maybe there is a seasonal pattern, maybe a pandemic started. The model is optimising for a world that no longer exists.

2. Proxy metrics. Accuracy on a test set is a proxy for what you actually care about, which is a business outcome. A recommendation engine with 85% accuracy on click prediction might produce worse revenue than one with 75% accuracy, if the 75% model recommends higher-margin items.

3. User interaction effects. In offline evaluation, the model sees static data. In production, users react to the model's output, which changes their behaviour, which changes the data, which changes the model's performance. A content ranking model that promotes clickbait will train users to click on clickbait, which will reinforce the model's belief that clickbait is good content. This is a feedback loop, and it can destroy your product.

// scene:

Quarterly business review. The PM is presenting an ML-powered feature to the CEO.

PM: “We are launching smart product recommendations next month. The model achieves 88% accuracy on our test set.”

CEO: “88%? So 12% of the time it shows users the wrong product? That is terrible. Our manual curation team gets it right almost every time.”

PM: “The manual team curates 200 products. We have 50,000 SKUs. The model covers the entire catalogue at 88%. The manual team covers 0.4% at near-100%.”

CEO: “I want 95% accuracy before we launch this to customers.”

PM: “We can get to 95% by restricting recommendations to the 5,000 most popular products. But that defeats the purpose — the long tail is where the margin is.”

CEO: “Then make the model better.”

Data Science Lead: “To go from 88% to 95%, we need three things: six months of additional behavioural data, a dedicated ML engineer for real-time feature serving, and a labelling team to annotate edge cases. We estimate five to six months and roughly forty lakhs in compute and labelling costs.”

The room goes quiet. The CEO did not expect that going from 88% to 95% would cost more than the first 88%.

PM: “Here is what I recommend. We launch at 88% for the long tail, keep manual curation for the top 200, and measure conversion lift over 8 weeks. If the lift justifies the investment, we fund the push to 95%.”

// tension:

The CEO expected ML to be a simple upgrade. The PM had to explain that the last 7% of accuracy costs more than the first 88%.

This meeting happens in some form at every company building ML features. The CEO hears "accuracy" and thinks it should be 100%. The PM's job is to reframe the conversation: not "how accurate is it?" but "what is the cost of being wrong, and does the value of being right exceed that cost?"

The "good enough" threshold ships products

I have a phrase I repeat in every ML product workshop: 85% accuracy that ships beats 95% accuracy that never does.

This is not an argument for sloppy work. It is an argument against perfectionism in a domain where perfection is mathematically impossible. Every percentage point of improvement past a certain threshold costs exponentially more — more data, more compute, more time, more specialised talent. The marginal cost of improvement increases while the marginal user benefit often does not.

The PM's job is to define the "good enough" threshold before the project starts. This threshold should be based on three things:

The cost of errors. A wrong movie recommendation wastes a user's evening. A wrong medical diagnosis kills someone. The stakes define the threshold.
The current baseline. If users currently have no recommendation and you give them one that is right 80% of the time, that is a massive improvement. If they currently have a rule-based system that is right 75% of the time and you give them one that is right 78%, they may not notice.
The alternative. What happens if you do not launch? The status quo has a cost too. Waiting six months for 95% accuracy means six months of users suffering with no recommendations at all.

Write the threshold into your requirements document. Make the DS team agree to it before they start. Revisit it only if new information changes the cost-benefit calculation — not because someone in leadership "feels like it should be higher."

Feature flags are non-negotiable for ML

You would not launch a deterministic software feature without a kill switch. For ML features, the need is even greater, because the failure mode is not "it crashes" — it is "it slowly degrades and nobody notices until revenue drops."

Every ML feature should launch with:

1. A feature flag with percentage rollout. Start at 5%. Watch the metrics. Ramp to 10%, 25%, 50%, 100%. If something breaks at 25%, you roll back to 10% while you investigate. This is more important for ML than for regular features because ML failures are often statistical — they only show up at scale.

2. A shadow mode option. The model runs in production but does not affect the user experience. You log what it would have done and compare it to what actually happened. This costs compute but buys you confidence before you expose users to a model that might be wrong.

3. Real-time monitoring dashboards. Not just model accuracy — business metrics that the model should affect. If your recommendation engine launches and average order value drops, you need to know within hours, not weeks.

4. An automatic circuit breaker. If the model's prediction distribution shifts beyond a defined threshold (indicating something has gone wrong with the input data or the model itself), the system falls back to a rule-based default. No human needs to be awake at 3am for this.

The India-specific reality

If you are building ML products in India, there are realities that Western ML playbooks do not cover.

The talent landscape is deep but narrow. India produces exceptional ML researchers — the IIT-to-ML pipeline is real, and Indian data scientists regularly publish at top conferences. But the gap between ML research talent and ML engineering talent is wide. The person who can design a novel architecture in a Kaggle competition may not know how to deploy a model behind an API with 99.9% uptime. You need both, and finding both in one person is rare.

Distributed DS teams are the norm, not the exception. Your data scientist might be in Bangalore, your ML engineer in Hyderabad, your data engineer in Pune, and your PM in Mumbai. This is standard at Indian startups and MNCs alike. The collaboration patterns that work when your DS team sits next to you — whiteboard sessions, quick "can you check this?" conversations — do not work here. You need written experiment briefs, documented model cards, and async review processes. Invest in documentation infrastructure early.

Compute costs hit differently. At a Bay Area startup with $50M in funding, spinning up GPU clusters for training is a line item. At an Indian startup with $5M in Series A, every GPU hour matters. This changes your technical choices. Fine-tuning a large language model might be out of reach. Training a smaller, task-specific model with careful feature engineering might be the better path — and your DS team knows this. Listen to them when they say "we do not need a transformer for this, a gradient-boosted tree will do." They are often right, and they are saving you money.

Data quality is a bigger problem than model quality. India's digital ecosystem generates massive data — UPI alone processes billions of transactions. But the data is messy. Addresses are inconsistent. Names are transliterated differently across systems. Phone numbers change frequently. Regional language data is scarce. If your model depends on clean, structured Indian user data, budget 40% of your timeline for data cleaning and preprocessing. This is not a failure of planning — it is the reality of building on Indian data infrastructure.

// thread: ##ml-product — PMs sharing lessons from ML feature launches that did not go as planned

Priya (Fintech PM, Bangalore)Launched our credit scoring model last quarter. Offline accuracy was 91%. In production, approval rates tanked because we trained on urban users and launched in Tier 2 cities. Completely different income patterns, completely different spending behaviour. Had to retrain on regional cohorts.

Arjun (E-commerce PM, Mumbai)Same story, different domain. Our search ranking model was trained on English queries. Launched Hindi search and it was garbage. The model had never seen Devanagari. We ended up running two separate models for 8 months.😬 6

Sneha (HealthTech PM, Hyderabad)Our symptom checker had 87% accuracy in testing. Launched it. Users started describing symptoms in Telugu mixed with English. The NLP layer could not parse it. We added a structured input form as a fallback — ugly but functional. Accuracy went back up because we stopped relying on free-text parsing.💡 9

Rahul (Logistics PM, Delhi)Shipping time prediction model. Tested beautifully on historical data. Launched during Diwali season. Every prediction was wrong because delivery times triple during festivals and the model had never seen festival-season data. Now we have a seasonal override that kicks in during peak periods.

Priya (Fintech PM, Bangalore)The pattern is clear: every ML launch failure I have seen comes from a gap between training data and production reality. Not from the model being bad — from the world being different than the data said it was.🎯 12

How to work with your DS team without driving them insane

Data scientists are not engineers. They are closer to researchers. This is not a judgment — it is a workflow observation that changes how you collaborate.

Give them the problem, not the solution. Do not say "build me a recommendation engine using collaborative filtering." Say "users are not discovering relevant products beyond the top 50. Here is the data we have on browsing and purchase behaviour. What approaches could surface relevant long-tail products?" Let them choose the method. They know more about algorithms than you do.

Define success metrics before they start experimenting. "Make it good" is not a metric. "Increase click-through rate on recommended products from 3% to 5% within 60 days" is a metric. The DS team needs a target to optimise for. Without one, they will optimise for the metric that is easiest to improve — which may not be the one your business cares about.

Set checkpoints, not deadlines. Instead of "deliver a model in 8 weeks," say "Week 2: share data exploration findings and feasibility assessment. Week 4: share baseline model results. Week 6: share improved model results with error analysis. Week 8: production readiness review." If the Week 2 checkpoint reveals that the data does not contain enough signal, you have saved six weeks. That is the whole point.

Learn to read a confusion matrix. You do not need to understand backpropagation. But you must understand precision, recall, F1 score, and how they relate to your business problem. If your DS team shows you a confusion matrix and you cannot interpret it, you cannot make product decisions about the model. Spend two hours learning this. It will be the highest-ROI two hours of your ML PM career.

Respect the "I do not know yet." In engineering, uncertainty is a planning failure. In data science, uncertainty is the normal state. When your DS lead says "I am not sure if this approach will work — I need to run an experiment," that is not incompetence. That is honesty. The PM who punishes honesty gets optimistic estimates and late surprises. The PM who rewards honesty gets early signals and better decisions.

// interactive:

The Precision-Recall Tradeoff

You are the PM for a large Indian e-commerce company. Your recommendation engine currently has 78% precision — meaning 78% of recommended products are relevant to the user. The VP of Product wants 90% precision before the next board meeting in 4 months. Your DS lead says reaching 90% requires 6 more months of behavioural data collection plus a new real-time feature pipeline. You have a board meeting, a DS team of three, and a VP who does not understand why 'AI is so slow.'

Monday morning standup. The VP just sent a message: 'Where are we on the 90% target? The board expects an update.' Your DS lead looks at you. The team has been at 78-80% for three weeks despite multiple experiments.

// exercise: · 20 min

Write ML requirements using the probabilistic framework

Pick an ML feature — either one you are building, or choose from this list:

A spam filter for a messaging app
A dynamic pricing engine for a ride-hailing service
A content moderation system for a social platform
A churn prediction model for a subscription product

Now write requirements using this structure:

What the model predicts: (one sentence, specific)
Primary metric and target: (precision, recall, F1, AUC — pick one, set a number)
Secondary metric and constraint: (the tradeoff metric — what you are willing to sacrifice and by how much)
Acceptable error rate and error type: (false positives vs false negatives — which is worse for your business and why?)
Fallback behaviour: (when the model is uncertain or wrong, what happens?)
Monitoring trigger: (what threshold, if breached, should pause the feature?)
Baseline comparison: (what is the current state without ML, and what improvement justifies the investment?)

If you cannot fill in every field, you are not ready to brief your DS team. The gaps in your requirements document are the gaps that will become production incidents.

The questions you should be asking in every ML review

Stop asking "what is the accuracy?" Start asking these:

What does the confusion matrix look like? Where is the model wrong, and what kind of wrong is it?
What is the performance on edge cases and minority classes? A model that is 95% accurate overall but 40% accurate on your most valuable user segment is a problem.
How does performance change over time? Is there evidence of model drift?
What happens when the model encounters data it has never seen before? Does it fail gracefully or catastrophically?
What is the latency in production? A model that takes 3 seconds to return a recommendation is a model that users will never wait for.
How often does the model need to be retrained, and what does that pipeline look like?
What are the biggest risks if we launch this tomorrow?

The DS team will respect you for asking these questions. It shows you understand their world without pretending to be one of them.

// learn the judgment

You are PM at Stashfin (the credit app). The data science team has built a credit risk model that they say is 94% accurate. You want to launch a new instant loan product using this model. The risk head says 94% accuracy is not enough when 6% false negatives mean bad loans.

The call: Do you launch with the 94% model, delay until the model improves, or launch with additional safeguards?

Your reasoning:

// practice

Your task: Do you launch with the 94% model, delay until the model improves, or launch with additional safeguards?

your reasoning:

0 chars (min 80)

Where to go next

Understand the broader AI product landscape: Building AI Features
Connect ML strategy to product strategy: AI Product Strategy
Apply experimentation rigour to ML launches: Experimentation
See how ML collaboration fits into the engineering relationship: Working with Engineering