AI Fundamentals for PMs — the pm manual

AI is nothing but a bunch of algorithms that make decisions about data. As a product manager, your job is not to build the model — it is to know what the model can and cannot do, and to build the right product around it.

Talvinder Singh, from a Pragmatic Leaders AI masterclass

You do not need a PhD in machine learning to be an AI PM. You also cannot fake it. The PMs who fail at AI products are not the ones who lack technical depth — they are the ones who either treat AI as magic ("just add AI to it") or treat it as someone else's problem ("the data science team will figure it out").

This page gives you the mental model you need. Not to build models. To make product decisions about them.

The AI stack in plain language

Think of AI as a hierarchy. Each layer builds on the one below it:

Layer	What it is	PM cares about
Artificial Intelligence	Machines doing things that normally require human intelligence	The product promise to users
Machine Learning	Algorithms that learn patterns from data instead of being explicitly programmed	Data quality, training costs, iteration cycles
Deep Learning	ML using neural networks with many layers — good at unstructured data (images, text, audio)	Compute costs, latency, accuracy tradeoffs
Large Language Models (LLMs)	Deep learning models trained on massive text corpora — GPT-4, Claude, Gemini	Prompt design, context windows, hallucination risk
Agents	LLMs that can take actions, use tools, and chain multiple steps together	Reliability, error handling, user trust

Most AI product conversations in 2026 happen at the LLM and agent layers. But understanding the layers below them tells you why things break and what you can actually control.

Training vs inference: the two phases that matter

Every ML system has two phases. Confusing them is the fastest way to make bad product decisions.

Training is when the model learns. You feed it data — millions of documents, images, transactions — and the model adjusts its internal parameters to find patterns. Training is expensive, slow, and done infrequently. GPT-4 cost over $100 million to train. You will almost certainly never train a foundation model. You might fine-tune one, which is cheaper but still non-trivial.

Inference is when the model answers. A user asks a question, the model processes it, and returns a response. Inference happens every time someone uses your product. It is fast (seconds) but costs money per request — typically $0.001 to $0.05 per call depending on the model and input size.

Why does this matter for PMs?

Training costs are fixed, inference costs scale with usage. If your product gets popular, your inference bill grows linearly. This changes unit economics in ways traditional SaaS does not face.
Training data determines what the model knows. If your training data has no examples of Indian GST invoices, the model will hallucinate GST rules. No amount of prompt engineering fixes a data gap.
You can influence inference (prompts, context, guardrails) much more easily than training. This is where the PM's influence lives.

// scene:

Sprint planning. The team is discussing a new AI feature for an ed-tech product.

PM: “Can we train the model to understand our specific curriculum?”

ML Engineer: “Fine-tuning would take 3-4 weeks and we need at least 10,000 labeled examples. Do we have that?”

PM: “We have maybe 500 lesson plans. What if we use RAG instead — feed the curriculum docs as context at inference time?”

ML Engineer: “That could work for Q&A. Retrieval is faster to ship and we can iterate on the document corpus without retraining.”

The PM understood the training-inference distinction well enough to propose an alternative. That saved the team a month.

// tension:

Knowing the difference between training and inference is not academic — it directly determines your build timeline and approach.

Prompts, context windows, and RAG

When you use an LLM, you are not programming it. You are prompting it — giving it instructions in natural language and hoping it follows them. This is fundamentally different from traditional software where code executes deterministically.

Prompts are the instructions you give the model. A good prompt is specific, structured, and includes examples. "Summarize this document" is a weak prompt. "Summarize this document in three bullet points, each under 20 words, focusing on action items for the engineering team" is a strong prompt. Prompt quality is the single highest-impact thing a PM can influence in an LLM product.

Context windows are the model's working memory. GPT-4 can hold about 128,000 tokens (roughly 100,000 words) in a single conversation. Claude can hold up to a million. Anything outside the context window does not exist to the model. This is why LLMs "forget" things from earlier in long conversations — the earlier content falls out of the window.

RAG (Retrieval-Augmented Generation) is the pattern that solves the context window limitation. Instead of stuffing everything into the prompt, you:

Store your documents in a searchable database (vector store)
When a user asks a question, search for the most relevant documents
Inject those documents into the prompt as context
The LLM answers using that context

RAG is how most enterprise AI products work today. It is cheaper than fine-tuning, faster to iterate, and lets you control what the model knows by controlling what documents you feed it.

The PM implication: If your AI product gives wrong answers, the first question is not "is the model bad?" It is "are we retrieving the right context?" Garbage in, garbage out applies to RAG pipelines just as much as it applies to ML training data.

// thread: #ai-product — After a customer reports the AI gave wrong pricing information

Support LeadCustomer says the AI quoted them last year's pricing. They are not happy.

PMIs this a model problem or a retrieval problem? Are we pulling from the current pricing doc or the archived one?

ML EngineerChecking... the vector store has both the 2025 and 2026 pricing PDFs indexed. The 2025 one has a higher relevance score because it has more detail.

PMWe need to either remove stale docs from the index or add a date filter to retrieval. This is not a model fix, it is a data pipeline fix.

ML EngineerDate filter is a 2-hour change. I can ship it today.shipitparrot

Hallucinations and why they are a product problem

LLMs hallucinate. They generate confident, plausible text that is factually wrong. This is not a bug that will be fixed in the next model version — it is an inherent property of how these models work. They predict the next likely token based on patterns, not based on truth.

As a PM, you need to design around hallucinations, not wish them away:

1. Never let the model be the sole authority. If your product shows AI-generated medical advice, legal guidance, or financial calculations without a verification layer, you have a liability, not a feature.

2. Give the model ground truth. RAG, function calling, and database lookups reduce hallucinations by giving the model facts to reference instead of forcing it to recall from training data.

3. Design the UI to signal uncertainty. Show sources. Show confidence. Let users verify. "Based on your pricing document (uploaded Jan 2026)" is better than presenting information as if the AI simply knows it.

4. Measure hallucination rate as a product metric. Sample AI outputs weekly. Have a human check them against ground truth. Track the percentage that contain factual errors. If you are not measuring it, you are not managing it.

In India specifically, hallucination risk compounds in vernacular contexts. Models trained primarily on English text will hallucinate more when handling Hindi, Tamil, or Kannada content — whether that is customer support, document processing, or regional regulatory compliance. If your product serves Indian users, test in the languages they actually use, not just English.

Agents: the next layer

An LLM that just answers questions is a chatbot. An LLM that can take actions — search the web, call APIs, update databases, send emails — is an agent.

Agents are where AI products get genuinely powerful and genuinely dangerous. A recommendation chatbot that gives a bad suggestion wastes the user's time. An agent that executes a bad suggestion — cancels the wrong order, sends the wrong email, updates the wrong record — causes real damage.

The PM's job with agents is to design the control surface:

What actions can the agent take autonomously? Low-risk, reversible actions (searching, drafting, summarizing) are safe to automate. High-risk, irreversible actions (sending money, deleting data, publishing content) need human approval.
What happens when the agent is wrong? Every agent action needs an undo path or a confirmation step for high-stakes operations.
How do you debug failures? Agent chains can be five or ten steps long. When the final output is wrong, you need to trace which step failed. Build observability from day one, not after the first production incident.

Building agent-based products requires a fundamentally different reliability mindset than traditional software. Traditional software fails predictably — the same input produces the same error. Agents fail probabilistically — the same input might work nine times and fail on the tenth. Your QA process, your error handling, and your user expectations all need to account for this.

// exercise: · 15 min

Map your AI risk surface

Think about an AI feature you are building or considering. For each action the AI can take, classify it:

Green — automate fully: Low risk, reversible, low cost of error. (Example: drafting a summary, suggesting search results)
Yellow — automate with guardrails: Medium risk, needs validation or rate limiting. (Example: sending a notification, updating a user profile)
Red — human in the loop required: High risk, irreversible, or high cost of error. (Example: processing a refund, publishing content, deleting records)

If you have more red actions than green, your product is not ready for full automation. Start with the green actions, prove reliability there, then graduate to yellow.

What PMs get wrong about AI products

Mistake 1: Treating AI as a feature, not an architecture decision.

Adding "AI-powered" to your product is not like adding dark mode. It changes your cost structure (inference costs), your reliability model (probabilistic failures), your data requirements (you need training and evaluation data), and your testing process (you cannot write deterministic unit tests for model outputs). If you treat it as a feature toggle, you will be surprised by all of these downstream effects.

Mistake 2: Launching without an evaluation framework.

In traditional products, you ship and measure conversion rates. In AI products, you also need to measure output quality — and quality is harder to define. Is a summary "good"? Is a recommendation "relevant"? You need rubrics, evaluation datasets, and regular human review. Building this before launch is not optional.

Mistake 3: Ignoring data pipeline quality.

The sexiest part of AI is the model. The part that actually determines success is the data pipeline — how you collect, clean, label, store, and retrieve data. In my experience building AI products, 70% of quality issues trace back to data problems, not model problems. The PM who obsesses over data quality will outperform the PM who obsesses over model selection every single time.

Mistake 4: Using AI where a rule works.

If your business logic is "when order value exceeds 10,000 rupees, apply a 5% discount" — that is an if-statement, not an AI use case. Use AI for genuinely unstructured problems: understanding natural language, classifying images, generating content, predicting behavior from sparse signals. Using AI where deterministic logic works is more expensive, less reliable, and harder to debug.

The vocabulary you need in the room

You do not need to explain backpropagation. You do need to use these terms correctly when talking to your ML team:

Term	What it means	When you will use it
Fine-tuning	Retraining a pre-trained model on your specific data	When off-the-shelf models do not understand your domain
Embedding	A numerical representation of text/images that captures meaning	When discussing search, recommendations, or RAG
Vector store	A database optimized for storing and searching embeddings	When designing RAG or semantic search features
Token	The unit LLMs process — roughly 3/4 of a word	When estimating costs and context window limits
Temperature	Controls randomness in model output (0 = deterministic, 1 = creative)	When tuning output style for your use case
Guardrails	Rules that constrain model behavior — content filters, format enforcement, topic boundaries	When designing safety and reliability features
Latency	Time between request and response	When users complain the AI is "slow"
Function calling	LLM ability to invoke external tools and APIs	When building agent features

Test yourself

// interactive:

The AI Feature Decision

You are PM at a fintech startup in Bangalore. Your product helps small businesses manage invoices and GST compliance. The CEO wants to add an AI assistant that answers tax questions from users. You have a 3-person engineering team and a launch target of 8 weeks.

The CEO has seen competitors announce AI features and wants to move fast. Your ML engineer says she can integrate an LLM API in a week. How do you approach this?

// learn the judgment

You are PM at a 60-person Series B HRtech startup in Bengaluru (Darwinbox competitor, 200 SME clients). Your team wants to add an AI-powered attendance anomaly detection feature — flagging unusual patterns like a warehouse employee clocking in from a location 50 km away. Your engineering lead recommends GPT-4o for this because it handles reasoning well and context is complex. Your data team proposes Claude Haiku instead, at roughly 15x lower inference cost. The feature will process 50,000 attendance records daily at launch, scaling to 500,000 within 12 months.

The call: Which model do you pick, and how do you make the call without waiting three months for a benchmark study?

Your reasoning:

// practice

Your task: Which model do you pick, and how do you make the call without waiting three months for a benchmark study?

your reasoning:

0 chars (min 80)

Where to go next

Apply AI thinking to product strategy: AI for Product Strategy
Build AI features the right way: Building AI Features
Use AI tools in your own PM work: AI Tools for PMs
Understand the ethical dimensions: AI Ethics for PMs
Back to product fundamentals: Product Thinking