The Gap Between Demos and Production
Prompt engineering advice online is full of clever tricks that work beautifully in a notebook and fall apart the moment real users start throwing unexpected inputs at them. After shipping a dozen LLM-powered features, here are the patterns I actually rely on.
1. Role + Task + Constraints + Format (RTCF)
The most reliable system prompt structure I've found:
You are [role].
Your task is [task].
Constraints:
- ·[constraint 1]
- ·[constraint 2]
Output format: [format specification]
This isn't glamorous, but it's predictable. Models internalize the four components independently, which means you can debug each one when output breaks.
2. Chain-of-Thought as Quality Control, Not Just Reasoning
Adding "Think step by step" doesn't just improve accuracy — it gives you an audit trail. In production, I often log the model's reasoning separately from its final output. When a user reports a bad answer, the chain of thought usually makes the failure mode obvious immediately.
3. Structured Output via Schema Specification
Instead of asking for JSON and hoping, I specify the schema explicitly:
Return a JSON object with exactly these fields:
- ·"summary": string (1-2 sentences)
- ·"confidence": number (0.0 to 1.0)
- ·"flags": string[] (empty array if none)
Pair this with schema validation (Zod works well) and retry logic. Expect a 2–5% malformed output rate and handle it gracefully.
4. Explicit Uncertainty Handling
Tell the model what to do when it doesn't know:
If you don't have enough information to answer confidently,
say exactly: "I don't have enough information to answer this."
Do not guess.
Without this, models hallucinate. With it, they surface uncertainty in a way your UI can handle.
5. Few-Shot Examples Are Underused
Most developers jump to fine-tuning when behavior isn't quite right. Before that, try 2–4 high-quality few-shot examples in your system prompt. This works surprisingly well for output format, tone, and edge case handling.
6. The One-Paragraph Sanity Check
Before deploying a prompt, I write one paragraph that describes the worst-case user behavior: what's the most adversarial, off-topic, or malformed input someone might send? Then I test against that. Prompts that look good in happy-path testing often fail badly here.
What Doesn't Work
- ·Prompt stuffing. Dumping everything you can think of into a system prompt. The model starts ignoring parts of it. Keep system prompts focused.
- ·Relying on implicit behavior. If you want the model to do something specific, say it explicitly. Every time.
- ·Prompt-as-magic. Prompts are code. Version them, test them, review them when you update the model.
Closing Thought
The best prompts I've written are boring. They're clear, explicit, and handle failure cases. The fancy ones are usually compensating for a design flaw somewhere else in the system.