Community Article

Prompt Engineering Patterns That Survived Six Months of Prod

The five prompting techniques that have actually held up across model upgrades, the four that I tried and dropped, and the eval discipline that lets me tell which is which.

Prompt Engineering Patterns That Survived Six Months of Prod

The five prompting techniques that have actually held up across model upgrades, the four that I tried and dropped, and the eval discipline that lets me tell which is which.

machine-learning
openai
ai-safety
craftsmanship
ethandubois

By @ethandubois

March 10, 2026

·

Updated May 18, 2026

1,152 views

31

4.3 (10)

Prompt engineering goes through fashion cycles faster than almost any other corner of software engineering. The techniques that go viral on Twitter are usually demonstrated on a single example, the technique that wins this month often loses next month, and the model providers ship updates that quietly invalidate clever workarounds you spent a week building. After six months of running a few LLM-backed features in production and tracking what survives across model upgrades and what does not, I have a much shorter list of techniques I trust than the conventional discourse implies.

This article is the five patterns that have held up, the four I have tried and stopped using, and the eval discipline that has let me actually tell the difference. I will hedge on specific model names because the ground keeps shifting; the patterns themselves have been stable.

Why most prompt-engineering blog posts age badly

The shape of the typical prompt-engineering blog post: "I tried X technique, it worked great on this example, here is the prompt, you should use it too." Six months later, the model has been updated, the technique no longer makes a measurable difference (because the new model already does that thing internally), and the example was a single anecdote that may or may not have generalized in the first place.

The two filters I now apply before adopting any prompting technique:

  1. Does it improve a measurement on a representative eval set, or just a hand-picked example? If a technique only shows up on cherry-picked cases, it does not survive contact with real traffic.
  2. Does it still work on the latest model? Some "clever" prompt patterns were workarounds for limitations that no longer exist. Chain-of-thought used to need explicit instructions on smaller models; on frontier models, much of that reasoning is built in.

What survives those filters is shorter than the typical "top 20 prompting techniques" article. Five patterns, in order of how much value they have given me.

Pattern 1: structured output, with schema

The single highest-value technique I have shipped. Tell the model exactly what shape the output should be, and (on supported providers) constrain it to that shape at decode time.

// asking for unstructured prose, then parsing it, is fragile
const messy = await llm.complete(`Extract the user's intent from: ${input}.
Reply with the intent.`);
// 80% of the time you get "The user's intent is X". 20% you get a polite preamble.

// asking for a schema-constrained response is robust
const schema = {
  type: 'object',
  properties: {
    intent: { type: 'string', enum: ['cancel', 'refund', 'change_address', 'other'] },
    confidence: { type: 'number', minimum: 0, maximum: 1 },
    rationale: { type: 'string' },
  },
  required: ['intent', 'confidence'],
};

const clean = await llm.complete({
  prompt: `Extract the user's intent from: ${input}`,
  responseFormat: { type: 'json_schema', schema },
});
// You always get JSON that matches the schema. The provider enforces it.

Why this beats free-form prose by a wide margin: every parsing failure I had in earlier systems was the model adding a polite "Here is the result:" before the JSON, or appending "Let me know if you need anything else." after. Schema-constrained generation eliminates this entirely. Production code stops needing brittle string-cleanup logic, and the eval set's accuracy goes up because the parse step never silently drops a malformed response.

This pattern is supported under different names across providers (JSON mode, structured output, response schema, function calling). They all work essentially the same way: the decode step is constrained to produce valid JSON matching the schema. Use whichever your provider offers.

Pattern 2: few-shot examples, but only the hard cases

Few-shot prompting (showing the model a few input-output pairs as examples before asking the real question) helps consistently, but only when the examples are well-chosen.

The rule that has worked: include 3 to 5 examples, all from the difficult end of the input distribution. The model already handles the easy cases. The examples should demonstrate edge behavior:

  • Inputs the model would otherwise misclassify.
  • Inputs where the format of the output matters and is non-obvious.
  • Inputs that demonstrate the refusal behavior you want.

A worked example for a customer-support intent classifier:

Instead of three easy examples, three deliberately tricky ones
  Example 1 (boundary case)
    Input:  "I want to cancel my subscription, but only after my next payment."
    Output: { intent: "change_billing", confidence: 0.7,
              rationale: "User wants conditional cancellation, not immediate." }

  Example 2 (mixed intent, ambiguous)
    Input:  "Refund my last charge and update my card."
    Output: { intent: "refund", confidence: 0.6,
              rationale: "Two intents present; refund mentioned first and is
                          higher impact, but a follow-up should ask about the card." }

  Example 3 (refusal-style)
    Input:  "What is the weather in Paris?"
    Output: { intent: "other", confidence: 0.95,
              rationale: "Out-of-scope for billing support." }

Three well-chosen tricky examples beat ten easy ones. The model learns the boundary, not the centre.

The failure mode to watch: examples that bias the output. If all your examples have confidence: 0.8, the model will tend to output 0.8 even when it is more or less certain. Vary the rationale and confidence in the examples to avoid pattern-locking.

Pattern 3: explicit refusal handling

Production systems need a way for the model to say "I do not have enough information" without inventing one. The pattern that works: explicit instructions and an example of refusal in the prompt.

System-prompt clause that has held up across model versions
  If the provided context does not contain the answer, respond exactly with:
  "I do not have enough information to answer that."
  Do not guess. Do not use your general knowledge. Use only the context above.

The specific phrasing matters more than I expected. "Try to use the context" did not work; "Use only the context" worked; "Use ONLY the context" with the emphasis worked slightly better. "Do not guess" is a redundant-feeling sentence that empirically reduces hallucination rates by a noticeable margin in evals.

The single biggest jump in faithfulness in any RAG system I have shipped came from adding this clause and a refusal example. The model went from inventing answers when retrieval failed to refusing them in roughly 90% of those cases. Refusing is the right behavior; the alternative is a confidently wrong answer.

Pattern 4: split the task into roles, in one prompt

For multi-step tasks, the pattern that has worked is to write the prompt as a small dialogue between roles, all in a single LLM call. The model produces both sides.

A concrete example for an extraction task with verification:

A single-call multi-role prompt structure I keep reusing
  ROLE 1 (Extractor): Read the document and produce a JSON object with
  fields {topic, date, action_required}.

  ROLE 2 (Verifier): Read the JSON from ROLE 1 and the original document.
  For each field, answer: is this supported by the document? Yes or no, and
  cite the exact sentence.

  ROLE 3 (Finalizer): If any field was not supported, replace it with null.
  Output the final JSON.

This is a one-call version of the multi-call "chain of agents" patterns that get a lot of attention. It is faster, cheaper, and almost as good as actually orchestrating three calls. The model produces all three role outputs in sequence; you parse out the final JSON and discard the intermediate reasoning.

When this beats actually splitting into three calls: when the latency budget matters and the cost matters. When it loses: when one of the roles needs different parameters (lower temperature, different model) than the others, or when you want to log each role's output for debugging. I default to single-call multi-role; I escalate to actual multi-call orchestration only when the eval shows the single-call version has a measurable accuracy gap.

Pattern 5: the failure_reason field

A pattern from a teammate that I have adopted everywhere. Add a failure_reason field to the response schema that the model fills in when it could not complete the task confidently.

const schema = {
  type: 'object',
  properties: {
    intent: { type: ['string', 'null'] },
    failure_reason: { type: ['string', 'null'] },
  },
  required: ['intent', 'failure_reason'],
};

The instruction: if the model can complete the task, intent is filled and failure_reason is null. If it cannot, intent is null and failure_reason explains why.

This is the simplest mechanism I have found for getting the model to actually express uncertainty. Without it, the model fabricates a confident answer. With it, the model has a structured way to say "I am not sure because the input was ambiguous on dimension X". The downstream code can route uncertain cases to a human or to a fallback prompt.

The payoff is in the production logs. A weekly aggregation of failure_reason values is a near-perfect bug list: "the input was multilingual", "the user mentioned three intents", "the date format was unparseable". Each one becomes a candidate for a few-shot example or a separate handler.

Four patterns I have stopped using

"Take a deep breath and think step by step." Worked on smaller models in 2023. Frontier models internalized the technique; explicit invocation rarely changes the eval scores anymore. I have removed all such phrases from my prompts and seen no measurable regression.

Role-playing as a domain expert. "You are a senior physician with 30 years of experience." In my evals, this almost always made things worse, because it nudged the model toward authoritative-sounding answers without actually changing the underlying knowledge. I now describe the task plainly and let the model's actual capabilities do the work.

Threats and bribes. "This is very important to me" or "Lives depend on this". Funny, occasionally moves the score by a tiny margin in a hand-picked example, never moved my eval set in a measurable way. Stopped using.

Long, formal, multi-paragraph instructions. Length adds tokens, latency, and cost without proportional gains. The shortest instruction that still does the job has won every test I have run. "Answer using only the provided context" beats a four-paragraph version of the same idea, both in token count and in faithfulness scores.

How I actually pick a technique

The process that has worked, in five steps:

  1. Build a small eval set (50 to 200 representative cases) with expected outputs.
  2. Run the baseline prompt; record the metric (correctness, faithfulness, refusal rate, whatever matters).
  3. Apply the technique under consideration. Run again.
  4. Compare. If the gain is below 2-3 percentage points, do not adopt; the technique is in the noise.
  5. Re-run the eval after every model upgrade. Drop techniques whose value disappeared.

This is laborious. It is also the only discipline I have found that actually distinguishes techniques that work from techniques that look like they work on a demo. The teams I have worked with who skip step 1 end up adopting prompts because they read about them on a blog, then quietly removing them six months later when they realize nothing got better.

What I would tell a team building their first prompt

The single most useful sentence I can leave with a team starting their first LLM-backed feature: build the eval set on day one, write the simplest possible prompt that works, and only adopt clever techniques when the eval set says they help. The reverse order (start with a clever prompt, build the eval set later) is how teams end up with a 600-token system prompt that everybody is afraid to change because nobody knows which clauses are doing useful work. The boring prompt with an eval set will outperform the elaborate prompt without one, every time. The clever techniques are useful; they are useful in service of measurement, not as a substitute for it.

Back to Articles