Building a Scenario Factory to Power Your AI Oracle

When you start testing AI agents at enterprise scale, you quickly run into a problem: the scenarios never end. Real users don’t stick to happy paths, they wander, improvise, and sometimes even break your system in ways you didn’t imagine.
If you’re building an oracle; a system that can automatically judge whether 50,000+ agent outputs are correct, safe, or aligned; you need raw material. You need scenarios. Not a few hundred hand-crafted test cases, but a factory that can reliably produce tens of thousands of diverse, meaningful, and evolving situations.
That’s what a Scenario Factory is: an engine for generating, curating, and managing test cases at scale. In this post, we’ll go deep into what it takes to design one, why it’s essential for AI agent testing, and how to put the pieces together.
Why a Scenario Factory?
Traditional QA practices rely on regression suites: a fixed set of scripted tests run after every change. For classic web apps or APIs, this works fine. But agents and LLM-based systems break this model.
- Output space explosion: An agent can respond in thousands of valid ways to the same prompt.
- Open-world variability: Inputs aren’t deterministic; users bring slang, typos, and edge cases.
- Continuous drift: Models evolve, datasets shift, regulations tighten; what was “correct” last month may not be correct now.
The oracle gives you a way to automatically score outputs. The scenario factory ensures you’re feeding the oracle enough coverage, diversity, and challenge to make those scores meaningful.
Think of it like building a gym for your agent: the oracle is the judge, the scenario factory supplies the weights.
Principles of a Scenario Factory
Before diving into architecture, let’s anchor on principles:
- Coverage over Completeness
You’ll never cover everything. Aim for broad slices of behavior across features, domains, and personas. - Reproducibility
Each scenario must be reconstructable. If a failure happens, you need to replay it exactly. - Evolvability
Scenarios aren’t static. They must evolve as products, models, and user bases change. - Balance: Synthetic + Real
Use synthetic generators for breadth, but ground them in real-world data and user logs. - Automation First
Manual crafting is fine for seeds, but scale comes from pipelines.
Core Components
A robust scenario factory usually has five layers:
1. Scenario Seeds
The starting point. Seeds are minimal examples that define the essence of a user situation.
- Example seed for a travel agent:
“Book a one-way ticket from NYC to LA for tomorrow.” - Example seed for a healthcare chatbot:
“I need to refill my blood pressure medication.”
Seeds are curated by humans (domain experts, QA, compliance) to ensure grounding. They form the DNA of the scenario space.
2. Generators
Where scale comes from. Generators take seeds and expand them into many variations.
Techniques:
- Parametric generation: Vary key fields (cities, dates, medication names).
- Paraphrasing with LLMs: Rewrite in different tones, styles, or with typos.
- Persona conditioning: Apply user personas (elderly user, teenager, non-native speaker).
- Adversarial mutation: Inject edge cases (nonsense, injections, conflicting requests).
Each generator must log its randomization seeds for reproducibility.
3. Scenario Metadata
Every scenario must carry metadata for traceability and slicing.
Typical fields:
- Feature / capability being tested
- Persona / demographic
- Difficulty level (easy, ambiguous, adversarial)
- Compliance tags (HIPAA, PCI, safety-critical)
- Source (synthetic vs. real-world log)
This metadata lets you build balanced evaluation sets; e.g., 20% adversarial, 30% critical path, etc.
4. Scenario Warehouse
A versioned store for all scenarios. Think of it as your GitHub for test inputs.
Requirements:
- Version control (every commit stores scenarios + metadata).
- Ability to shard into evaluation sets (nightly regression, adversarial stress test, compliance audit).
- Auditability: who added what, when, and why.
Some teams store scenarios in JSONL with schema validation; others use databases with APIs. The key is queryability and immutability of history.
5. Scenario Lifecycle Management
Factories don’t just generate; they curate.
Processes to include:
- Scenario reviews: Human-in-the-loop curation of seeds, adversarial findings, and compliance cases.
- Archiving & retirement: Old scenarios may lose relevance; tag and retire them.
- Evolution hooks: Auto-generate new scenarios when product features change or new regulations land.
Integrating with the Oracle
The scenario factory isn’t useful unless it connects to your oracle pipeline. Integration looks like this:
- Scenario → Execution
The agent is run against each scenario, producing outputs. - Execution → Oracle
The oracle evaluates outputs against expectations (schemas, invariants, LLM judges). - Oracle → Metrics
Scores are aggregated into coverage, pass rate, CFR (critical failure rate), SFR (safety failure rate). - Metrics → CI/CD Gates
Code ships only if metrics meet thresholds.
In other words:
- Scenario factory supplies the inputs.
- Oracle supplies the judgments.
- Together, they form your evaluation loop.
Building the Factory: Step by Step
Here’s a pragmatic roadmap:
Step 1. Define Your Seeds
Start small. Hand-craft 50–100 seeds that reflect your most critical user flows and compliance requirements. These become the “golden set.”
Step 2. Build Generators
Use templates + LLM-based paraphrasers to expand seeds. Add noise, personas, and edge cases. Automate generation into pipelines.
Step 3. Store in a Warehouse
Pick a durable format (JSONL + S3, or a small database). Add schema validation and enforce metadata fields.
Step 4. Connect to the Oracle
Pipe generated scenarios into your oracle evaluation pipeline. Run nightly or weekly at first.
Step 5. Add Lifecycle Processes
Formalize how new seeds get added, how adversarial cases are integrated, and how compliance scenarios are kept up to date.
Step 6. Scale
Move from hundreds → thousands → tens of thousands of scenarios. Add dashboards to track coverage and trend lines.
Patterns and Best Practices
- Gold sets vs. Silver sets:
- Gold: Hand-curated, reviewed, stable (used for regression gates).
- Silver: Generated at scale, noisier (used for monitoring trends).
- Bias auditing: Include demographic diversity in persona generation to detect fairness issues.
- Mutation testing: Borrowed from software testing; slightly perturb scenarios to test resilience.
- Continuous refresh: Log production traffic, sample interesting real-world inputs, feed back into the factory.
- Feedback loops: Failures from the oracle automatically become new seeds for the factory.
Example: Travel Booking Agent
Imagine you’re testing a travel booking agent.
Seed:
“Book a round-trip flight from SF to Paris next month.”
Generated Scenarios:
- Parametric: Change cities, dates, trip types.
- Paraphrased:
“Can you get me a flight out of San Fran to Paris in about a month?” - Persona:
Elderly traveler with accessibility needs. - Adversarial:
“Book me a flight to PArisssss tomorrow!!!!”
Each scenario is tagged (feature: booking, persona: elderly, difficulty: adversarial). Stored in the warehouse, then executed against the oracle to validate correctness, price accuracy, and compliance.
Why It Matters
Without a scenario factory:
- Your oracle starves. You test the same 200 prompts forever.
- Drift and regressions slip by unnoticed.
- Compliance auditors ask for evidence, and you scramble.
With a scenario factory:
- You run 50,000+ evaluations nightly with full traceability.
- Failures automatically surface new seeds, keeping your suite fresh.
- You can tell your CTO (and auditors): “Every release was tested against critical paths, adversarial cases, and compliance scenarios - here are the metrics.”
It’s not just QA. It’s risk management, compliance, and product velocity all in one.
The Future is Now
The future of QA isn’t about armies of testers clicking buttons; it’s about factories and oracles. The factory creates scenarios; the oracle judges them. Together, they let teams ship agents at scale with confidence.
If you’re building an agent platform today, don’t wait until you hit a wall of regressions and outages. Start small: craft seeds, build your first generator, and set up a warehouse. Over time, your scenario factory will become as critical to your engineering pipeline as CI/CD itself.
In a world where agents interact with millions of users in unpredictable ways, the scenario factory is how you stay ahead. It’s not optional; it’s the foundation of trustworthy AI systems.
👉 Want more posts like this? Subscribe and get the next one straight to your inbox. Subscribe to the Blog or Follow me on LinkedIn
Comments ()