Prompt Regression Testing: A Real‑World Step‑by‑Step Guide for QA Engineers

Prompt Regression Testing: A Real‑World Step‑by‑Step Guide for QA Engineers
The Total Guide from 0-100
Audience: QA/QE engineers who ship software and want a concrete, repeatable workflow for testing LLM prompts like any other production surface.
Outcome: You’ll implement a minimal, production‑ready prompt regression suite (golden tests + oracle + CI gate) using a realistic customer‑service example. Copy/paste the snippets into a repo and adapt.

Why prompt regression testing?

Prompts are code. They change, drift, and regress. Without a regression suite you’re flying blind: a wording tweak to “be more friendly” can silently flip a decision, break JSON, or leak PII. The fix is the same playbook you already know: golden tests + an oracle + CI gates.

This post walks you end‑to‑end on a real workflow: we’ll test a Refund Eligibility Assistant for an e‑commerce app.


The real‑world scenario: Refund Eligibility Assistant

Goal: Given a free‑form customer message and structured order data, the assistant must output a strict JSON decision that the app can enforce.

Business policy (simplified):

  • Refunds allowed within 30 days if unopened.
  • Between 31–60 days: store_credit if unopened; otherwise deny.
  • >60 days: deny.
  • VIP customers (loyalty_tier ≥ 3) get one exception refund per year even if 31–60 days.
  • If item is defective on arrival (DOA) with evidence, allow refund regardless of days.

Required output (contract): JSON with decision ∈ {"refund","store_credit","deny"}, reason, and policy_rule id.


Step 1 - Write the production prompt (baseline)

Create a system prompt that:

  • Defines the JSON contract.
  • Lists the policy rules with IDs.
  • Demands valid JSON only, no prose.
  • Asks the model to choose exactly one decision and reference the rule.

prompts/base.prompt.md

You are RefundBot. Decide refund eligibility.


**Output:** Return ONLY a JSON object: {"decision": "refund|store_credit|deny", "reason": string, "policy_rule": "R#"}
No markdown. No commentary. If uncertain, choose the safest policy‑compliant decision.


**Policy rules**
- R1: If days_since_delivery ≤ 30 and unopened → decision=refund.
- R2: If 31 ≤ days_since_delivery ≤ 60 and unopened → decision=store_credit.
- R3: If opened and days_since_delivery ≤ 60 → decision=deny.
- R4: If days_since_delivery > 60 → decision=deny.
- R5: If defective_on_arrival=true with evidence → decision=refund.
- R6: VIP customers (loyalty_tier ≥ 3) may receive one exception refund per calendar year if 31–60 days.


**Respond in JSON only.**

Step 2 - Define the oracle (how we decide right/wrong)

Your oracle is code that validates outputs against:

  1. Schema: The model must emit the required fields.
  2. Invariants: Policy rules must be applied correctly.
  3. Rubric (optional): If free‑text fields matter (e.g., tone), a lightweight LLM rubric can score them; we keep it optional for strict CI gates.

oracle/schema.json

{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["decision", "reason", "policy_rule"],
"additionalProperties": false,
"properties": {
"decision": {"type": "string", "enum": ["refund", "store_credit", "deny"]},
"reason": {"type": "string", "minLength": 3},
"policy_rule": {"type": "string", "pattern": "^R[1-9][0-9]*$"}
}
}

Invariants to encode:

  • If days_since_delivery > 60 → decision must be deny (R4).
  • If defective_on_arrivaldecision must be refund (R5).
  • If 31–60 days and unopened → store_credit unless VIP exception used → refund (R6).
  • If opened and ≤ 60 days (and not DOA) → deny (R3).

Step 3 - Create a golden dataset (start small, then grow)

Begin with 15–25 canonical cases that cover happy paths and edge cases. Store as JSONL; each line is a case.

data/golden.jsonl (excerpt)

{"id":"G01","input":{"message":"Package arrived last week, still sealed. I'd like a refund.","days_since_delivery":7,"unopened":true,"defective_on_arrival":false,"loyalty_tier":1,"vip_refunds_used":0},"expect":{"decision":"refund","policy_rule":"R1"}}
{"id":"G02","input":{"message":"It’s been 45 days but never opened.","days_since_delivery":45,"unopened":true,"defective_on_arrival":false,"loyalty_tier":1,"vip_refunds_used":0},"expect":{"decision":"store_credit","policy_rule":"R2"}}
{"id":"G03","input":{"message":"I opened it and tried it, didn’t like it.","days_since_delivery":12,"unopened":false,"defective_on_arrival":false,"loyalty_tier":1,"vip_refunds_used":0},"expect":{"decision":"deny","policy_rule":"R3"}}
{"id":"G04","input":{"message":"Item arrived broken, attaching photos.","days_since_delivery":80,"unopened":false,"defective_on_arrival":true,"loyalty_tier":1,"vip_refunds_used":0},"expect":{"decision":"refund","policy_rule":"R5"}}
{"id":"G05","input":{"message":"I’m VIP gold. 52 days since delivery, box unopened.","days_since_delivery":52,"unopened":true,"defective_on_arrival":false,"loyalty_tier":3,"vip_refunds_used":0},"expect":{"decision":"refund","policy_rule":"R6"}}
{"id":"G06","input":{"message":"I’m VIP, 52 days, unopened, used an exception earlier this year.","days_since_delivery":52,"unopened":true,"defective_on_arrival":false,"loyalty_tier":3,"vip_refunds_used":1},"expect":{"decision":"store_credit","policy_rule":"R2"}}
{"id":"G07","input":{"message":"Hi, it’s been 75 days.","days_since_delivery":75,"unopened":true,"defective_on_arrival":false,"loyalty_tier":2,"vip_refunds_used":0},"expect":{"decision":"deny","policy_rule":"R4"}}
Tip: Tag cases with coverage labels (e.g., tier: high_risk, policy: R6) to audit gaps later.

Step 4 - Make the prompt robust and deterministic

Small knobs matter:

  • Use a function‑call/JSON mode if your provider supports it. Otherwise, add strict formatting guards and a regex post‑validator.
  • Set temperature = 0–0.2 and top_p = 1 for regressions.
  • Provide example I/O for the JSON format in the prompt.
  • Add stop sequences after } to avoid trailing prose.

prompts/examples.md (append after rules)

Example output:

Example output:
{"decision":"store_credit","reason":"Unopened and 45 days since delivery; per R2, issue store credit.","policy_rule":"R2"}

Step 5 - Build the evaluator (Node.js + TypeScript)

Folder layout

prompt-regression/
├─ prompts/
│ ├─ base.prompt.md
│ └─ examples.md
├─ data/
│ └─ golden.jsonl
├─ oracle/
│ ├─ schema.json
│ └─ invariants.ts
├─ evaluator/
│ ├─ judge.ts
│ ├─ run.ts
│ └─ report.ts
├─ package.json
└─ ci/
└─ github-actions.yml

oracle/invariants.ts

export type CaseInput = {
message: string;
days_since_delivery: number;
unopened: boolean;
defective_on_arrival: boolean;
loyalty_tier: number;
vip_refunds_used: number;
};


export type ModelOut = { decision: "refund"|"store_credit"|"deny"; reason: string; policy_rule: string };


export function checkInvariants(inp: CaseInput, out: ModelOut): string[] {
const errs: string[] = [];
if (inp.defective_on_arrival && out.decision !== "refund") {
errs.push("DOA must refund (R5)");
}
if (inp.days_since_delivery > 60 && out.decision !== "deny") {
errs.push(">60 days must deny (R4)");
}
if (inp.days_since_delivery <= 60 && !inp.unopened && !inp.defective_on_arrival && out.decision !== "deny") {
errs.push("Opened ≤60 days must deny (R3)");
}
if (inp.days_since_delivery >= 31 && inp.days_since_delivery <= 60 && inp.unopened) {
const vipException = inp.loyalty_tier >= 3 && inp.vip_refunds_used === 0;
const expected = vipException ? "refund" : "store_credit";
if (out.decision !== expected) {
errs.push(`31–60 unopened → ${expected} (R${vipException ? 6 : 2})`);
}
}
return errs;
}

evaluator/judge.ts (schema + invariants)

import Ajv from "ajv";
import { checkInvariants, CaseInput, ModelOut } from "../oracle/invariants";
import schema from "../oracle/schema.json" assert { type: "json" };


const ajv = new Ajv({ allErrors: true, strict: false });
const validate = ajv.compile(schema);


export function judge(inp: CaseInput, raw: string) {
// Extract JSON (robust to stray text)
const match = raw.match(/\{[\s\S]*\}$/);
if (!match) return { pass: false, reason: "No JSON object found" };
let out: ModelOut;
try { out = JSON.parse(match[0]); } catch { return { pass: false, reason: "Invalid JSON" }; }


if (!validate(out)) {
const msg = ajv.errorsText(validate.errors);
return { pass: false, reason: `Schema fail: ${msg}` };
}


const inv = checkInvariants(inp, out);
if (inv.length) return { pass: false, reason: inv.join("; ") };


return { pass: true, reason: `OK (${out.policy_rule})` };
}

evaluator/run.ts (executes test set)

import fs from "fs";
import path from "path";
import readline from "readline";
import { judge } from "./judge";


// Mock: replace with your model call
async function callModel(prompt: string): Promise<string> {
// TODO: integrate your provider SDK, temperature=0.2, stop after '}'
return "{\"decision\":\"store_credit\",\"reason\":\"Stub\",\"policy_rule\":\"R2\"}";
}


function buildPrompt(system: string, inp: any) {
const user = `Customer message: ${inp.message}\nStructured data: ${JSON.stringify(inp)}`;
return `${system}\n\n${user}`;
}


async function* readJsonl(file: string) {
const rl = readline.createInterface({ input: fs.createReadStream(file), crlfDelay: Infinity });
for await (const line of rl) if (line.trim()) yield JSON.parse(line);
}


(async () => {
const system = fs.readFileSync(path.join("prompts", "base.prompt.md"), "utf8") +
"\n\n" + fs.readFileSync(path.join("prompts", "examples.md"), "utf8");
const file = path.join("data", "golden.jsonl");


let pass = 0, fail = 0;
const results: any[] = [];


for await (const tc of readJsonl(file)) {
const prompt = buildPrompt(system, tc.input);
const raw = await callModel(prompt);
const verdict = judge(tc.input, raw);


const expected = tc.expect.decision;
const ok = verdict.pass && (!expected || raw.includes(`\"decision\":\"${expected}\"`));
if (ok) pass++; else fail++;


results.push({ id: tc.id, pass: ok, reason: verdict.reason, raw });
}


fs.writeFileSync("results.json", JSON.stringify(results, null, 2));
console.log(`PASS: ${pass}, FAIL: ${fail}`);
if (fail > 0) process.exit(1); // CI gate
})();

In production, replace the stubbed callModel with your provider, set temperature=0–0.2, and plumb API keys via env vars.


Step 6 - Run locally

npm init -y
npm i typescript ts-node @types/node ajv
npx tsc --init
npx ts-node evaluator/run.ts

You should see a summary (PASS/FAIL) and a results.json you can archive as a CI artifact.


Step 7 - Wire it into CI with a release gate

ci/github-actions.yml

name: Prompt Regression
on:
push:
branches: [ main ]
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm ci || npm i
- run: npx ts-node evaluator/run.ts

The run.ts exits non‑zero if any golden fails, so PRs that regress the prompt are blocked. Add a nightly workflow that runs against a larger random sample of real production messages (with PII safely redacted) to catch drift.


Step 8 - Grow coverage with a scenario factory

Once your core suite is green, scale up thoughtfully:

  • Templating: Generate paraphrases ("refund pls", "would like my money back"), typos, different punctuations.
  • Metamorphic transforms: Preserve the label while changing surface features (swap “7 days” → “a week”, change case, add benign chit‑chat).
  • Real‑world replay: Sample anonymized, consented production messages and attach structured order data.
  • Risk‑tier quotas: Ensure N cases per policy rule (R1–R6) and per edge condition (e.g., 59–61 days).

Keep the oracle the same; only expand inputs.


Step 9 - Reporting & metrics for decision‑makers

Track and trend:

  • CFR (Critical Failure Rate): % of tests where the decision violates policy.
  • SFR (Schema Failure Rate): % of outputs that are not valid JSON/contract.
  • Rule coverage: How many cases per policy rule.
  • Latency & cost: Avg time and tokens per decision at regression settings.

Export a simple CSV/JSON and chart it in your existing dashboards.


Step 10 - Updating goldens safely

Treat golden changes like code changes:

  • Two‑step flow: first open a PR with evidence (logs, policy link) that the expected label is wrong; reviewers approve; then merge.
  • Quarantine flaky cases: If a test is nondeterministic due to upstream noise, skip with reason and file an issue—don’t silently change expected labels.
  • Version policies: If business rules change, bump a policy version (policy_v2) and migrate cases with a script so history remains explainable.

Step 11 - Common pitfalls & fixes

  • Free‑form output: Enforce JSON with examples, schema validation, and stop sequences. If your provider has JSON mode, use it.
  • High temperature: Set temp ≤ 0.2 for regressions; use higher only in exploration/creative contexts.
  • Oracle too weak: Encode invariants in code; don’t rely solely on string match or another LLM’s opinion.
  • Ambiguous policies: If humans disagree, the model will too. Clarify rules before writing tests.
  • Data leakage: Never include PII in goldens; sanitize replays; store datasets in a restricted repo.

What you’ve built

  • A prompt contract (system prompt + JSON schema).
  • A golden dataset that exercises critical policy edges.
  • A deterministic evaluator with schema + invariants.
  • A CI gate that blocks regressions.

From here, you can scale to 5k–50k cases with a scenario factory, add an LLM‑judge rubric for subjective fields (tone, empathy), and wire a nightly continuous evaluation against production samples. Same playbook, bigger surface area.


Appendix A - Optional LLM rubric for free‑text fields

If you need to enforce tone or safety, you can add a second, narrow model to grade the reason text on a 0–2 scale using a tiny rubric. Fail the case if the grade < 2. Keep this advisory in CI or apply higher thresholds only on high‑risk tiers.

Rubric sketch:

  • 2 = Clear, cites rule (R#), no empathy errors, no PII.
  • 1 = Vague or missing rule reference.
  • 0 = Hallucinated policy, rude, or unsafe.

Appendix B - Minimal package.json

{
"type": "module",
"dependencies": {
"ajv": "^8.12.0"
},
"devDependencies": {
"@types/node": "^20.0.0",
"ts-node": "^10.9.2",
"typescript": "^5.4.0"
}
}

Steal this and ship. Your QA instincts already map to LLM systems; prompt regression testing just gives you the harness to enforce them every time you change a word.


👉 Want more posts like this? Subscribe and get the next one straight to your inbox.  Subscribe to the Blog or Follow me on LinkedIn