Building AI Features — the pm manual

The gap between an AI demo and an AI product is the same gap between a prototype and a business. One works in controlled conditions. The other has to survive users.

Talvinder Singh, from a Pragmatic Leaders AI cohort

In our last six AI-focused cohorts at Pragmatic Leaders, every PM asks the same question within the first week: "Should we add AI to this?" The answer is almost always "it depends" — and the thing it depends on is whether you understand the building blocks well enough to make that call.

This page gives you those building blocks. Not the theory — the practical, decision-level understanding of prompts, RAG, agents, and fine-tuning that lets you evaluate engineering proposals, challenge vendor pitches, and spec AI features that actually work in production.

The four building blocks

Think of AI features as sitting on a spectrum of complexity and control:

Building block	What it does	PM analogy	When to reach for it
Prompts	Instructs a foundation model to perform a task	Writing a brief for a freelancer	Single-turn tasks with clear inputs and outputs
RAG	Retrieves your data and feeds it to the model	Giving the freelancer a reference folder	Tasks requiring your proprietary data or current information
Agents	Chains multiple steps, tools, and decisions together	Hiring a contractor who manages their own workflow	Multi-step tasks where the path depends on intermediate results
Fine-tuning	Trains the model on your specific data	Training a full-time employee on your domain	Tasks requiring consistent style, domain language, or specialized behavior

Most teams jump straight to agents because the demos look impressive. This is the AI equivalent of the solution trap. Start with prompts. Graduate to RAG when you need your own data. Consider agents when the task genuinely requires multi-step reasoning. Fine-tune only when the other three cannot get you to production quality.

Prompts: the foundation you cannot skip

A prompt is not "talking to AI." A prompt is a specification. It defines the task, the constraints, the format, and the quality bar — just like a PRD defines a feature.

The difference between a prompt that works in a demo and one that works in production comes down to three things:

1. Specificity. "Summarize this document" fails at scale. "Extract the three most actionable recommendations from this document, each in one sentence, with the page number where the recommendation appears" succeeds. The more specific your instruction, the less variance in the output.

2. Structure. Give the model a format to follow. JSON schemas, markdown templates, numbered steps. When the output is structured, you can validate it programmatically. When it is freeform, you are shipping hope.

3. Constraints. Tell the model what NOT to do. "Do not invent information not present in the source text." "Do not exceed 200 words." "If the answer is not in the provided context, say 'I don't have enough information.'" Constraints are the guardrails that prevent your AI feature from hallucinating its way into a support ticket.

// scene:

Sprint review. The team is demoing an AI-powered customer support feature.

Engineer: “Here's the support bot. Ask it anything about our product.”

PM: “What's our refund policy for enterprise customers?”

Engineer: “See — it gave a perfect answer. Pulled it right from the docs.”

PM: “Now ask it about a competitor's pricing.”

Engineer: “...it made up competitor pricing that looks plausible but is completely wrong.”

PM: “That is a hallucination. In production, a customer sees that and we have a credibility problem. We need a constraint: if the question is outside our docs, it says 'I can only answer questions about our product.'”

The demo worked. The product did not. The difference was one missing constraint in the prompt.

// tension:

AI demos optimize for the happy path. Production requires you to design for every path.

As a PM, you do not need to write production prompts. But you need to review them the way you review copy — checking for specificity, structure, and constraints. If your engineering team shows you a prompt that says "be helpful and accurate," push back. That is not a spec. That is a wish.

RAG: when the model needs your data

Foundation models know the internet. They do not know your product documentation, your internal policies, your customer data, or anything that happened after their training cutoff. RAG — Retrieval-Augmented Generation — solves this by fetching relevant information from your data and injecting it into the prompt before the model generates a response.

The architecture is straightforward:

Index your data. Break documents into chunks, convert them to vector embeddings, store them in a vector database.
Retrieve on query. When a user asks a question, convert their query to an embedding and find the most similar chunks.
Generate with context. Feed the retrieved chunks to the model as context, along with the user's question.

What PMs get wrong about RAG:

"More data is better." It is not. Retrieving twenty irrelevant chunks is worse than retrieving three relevant ones. The model gets confused by noise. Your engineering team should be optimizing for retrieval precision, not recall.

"It eliminates hallucination." It reduces hallucination. It does not eliminate it. The model can still hallucinate within the context it is given — misinterpreting a passage, combining information from two chunks incorrectly, or filling gaps with plausible-sounding fabrication. You still need output validation.

"It is a one-time setup." Your data changes. Your documentation gets updated. Customer policies evolve. If your RAG pipeline does not re-index regularly, your AI feature is answering questions with stale information. Treat the index like a cache — it needs a refresh strategy.

// thread: #product-ai — Three weeks after launching a RAG-powered help center

Support LeadCustomers are getting wrong answers about our new pricing tiers. The AI is quoting the old pricing page.

PMWhen was the knowledge base last re-indexed?

Engineer...at launch. Three weeks ago.

PMWe changed pricing two weeks ago. The AI doesn't know. We need automated re-indexing on every docs deploy, not a manual process.

Support LeadSo for the last two weeks, the bot has been confidently wrong?grimacing

The PM question for RAG is not "should we use it?" — it is "what is our data freshness requirement, and can we meet it?" If your data changes weekly and your users expect real-time accuracy, you have an engineering problem to solve before you ship.

Agents: when one model call is not enough

An agent is a system that uses a foundation model to decide what to do next, takes an action, observes the result, and repeats until the task is complete. Instead of one prompt-and-response cycle, you get a loop.

The appeal is obvious. An agent can research a topic across multiple sources, draft a report, check it against guidelines, revise it, and format it for publication — all without human intervention at each step.

The risk is equally obvious. Every step in the loop is a place where the model can go wrong. And unlike a single prompt where a bad output is one bad output, an agent's bad decision at step 3 compounds through steps 4, 5, and 6.

When agents make sense:

The task has multiple steps that depend on each other
The steps require different tools (search, calculation, API calls)
A human doing the task would need to make judgment calls along the way

When agents are overkill:

The task can be accomplished with a single well-crafted prompt
The steps are predictable and do not branch based on intermediate results
The cost of a wrong intermediate step is high (financial transactions, medical advice, legal documents)

The PM's job with agents is to define the guardrails, not the implementation. What tools can the agent access? What actions require human approval? What is the maximum number of steps before it should stop and ask for help? What happens when it gets stuck?

In India, I have seen teams build agent systems for tasks that a good prompt template and a database query could handle. The agent architecture added latency, cost, and failure modes — all for a feature that did not need multi-step reasoning. Always ask: does this task actually require an agent, or does it just look cool as one?

// exercise: · 10 min

Agent or not?

For each scenario below, decide whether you would use a simple prompt, RAG, or an agent. Write one sentence justifying your choice.

Summarizing a customer call transcript into action items. The transcript is provided in full.
Answering employee questions about the company leave policy. The policy document is 40 pages.
Generating a weekly competitive intelligence report. Requires searching news, company blogs, and social media, then synthesizing findings.
Translating product UI strings from English to Hindi. The strings are provided in a spreadsheet.
Debugging a failed payment by checking three internal systems and suggesting a fix.

No framework will give you the "right" answer. The exercise is in your reasoning — what made you choose one approach over another?

Fine-tuning: the last resort that is sometimes the first choice

Fine-tuning means training a foundation model on your specific data to change its behavior. Unlike RAG, which gives the model information at query time, fine-tuning bakes the information into the model's weights.

Fine-tune when:

You need a consistent voice or style that prompts cannot reliably reproduce
The task requires domain-specific reasoning (medical coding, legal clause identification, financial categorization)
You are making thousands of similar API calls and want to reduce cost by using a smaller, specialized model
Latency matters and you cannot afford the extra retrieval step of RAG

Do not fine-tune when:

Your data changes frequently (the fine-tuned model is frozen at training time)
You need source attribution (fine-tuned models cannot tell you where their knowledge came from)
A good prompt with examples achieves the same quality (test this first — always)
You do not have at least a few hundred high-quality training examples

The cost of fine-tuning is not just compute. It is maintenance. Every time your domain knowledge changes, you need to re-fine-tune. Every time the base model gets updated, you need to evaluate whether your fine-tune still works. You are taking on a training pipeline as a permanent operational cost.

For most product teams in India building their first AI features, fine-tuning is premature. Start with prompts. Add RAG. Consider agents. Fine-tune only when you have exhausted the other options and have the data quality to justify it.

The reliability gap

Here is the thing nobody tells you in vendor demos: AI features fail differently from traditional software. A REST API either returns the right data or throws an error. An AI feature can return confidently wrong data with no error code.

This means your quality strategy for AI features must include:

Output validation. Check the model's output programmatically before showing it to users. Does the JSON parse? Are the required fields present? Is the response within expected length bounds? Does it contain any of your banned phrases?

Confidence thresholds. If the model's confidence is below a threshold, route to a human instead of showing the output. The threshold is a product decision, not an engineering one. You decide the acceptable failure rate.

Feedback loops. Give users a way to flag bad outputs. Thumbs up/down, report buttons, correction flows. This data is how you improve the system over time — and how you detect regressions.

Evaluation sets. Maintain a set of test cases with known-good outputs. Run your AI feature against these every time you change a prompt, update the RAG index, or modify the agent flow. This is your regression suite. Without it, you are shipping blind.

// learn the judgment

Your team has shipped an AI feature that summarizes customer support tickets and suggests responses for support agents. After 3 weeks in production, you see: CSAT is unchanged, average handle time dropped 18%, but three customers escalated after the AI suggested incorrect refund policies. Engineering says they can fix the specific refund cases with better RAG chunking. The support lead wants to roll back the feature.

The call: Do you roll back? What do you fix first?

Your reasoning:

// practice

Your task: Do you roll back? What do you fix first?

your reasoning:

0 chars (min 80)

Test yourself

// interactive:

The AI Feature Spec

You are the PM for a fintech app serving small businesses in India. The CEO wants to add an AI feature that reads bank statements (PDF uploads) and automatically categorizes expenses for GST filing. Your engineering team has proposed three approaches.

The engineering lead presents three options in the sprint planning meeting. You need to choose an approach and justify it.

Where to go next

Understand the strategic context: AI Product Strategy — when to build AI features vs. when to wait
Get the fundamentals right: AI Fundamentals — what PMs need to know about how models work
Navigate the ethics: AI Ethics for PMs — bias, fairness, and responsible AI development
Use AI in your own workflow: AI Tools for PMs — practical tools that make you faster today