Icarus: Build Your Own AI QA Test Agent

Most test engineers know the pain of keeping automation in sync with a fast-moving product. Locators change, flows shift, features evolve; and suddenly your test suite is brittle.

That’s where Icarus comes in.

Icarus is an AI-powered test agent I built that converts annotated screenshots → structured test cases using OpenAI models like GPT-4o or GPT-5. It’s designed to slot into a self-healing pipeline, so test coverage stays alive even as the product changes.

This post walks you through the idea, architecture, implementation, and full project layout so you can build your own version.

Why I Built Icarus

Traditional automation struggles with two recurring problems:

Manual test case creation is slow. Every new feature needs dozens of edge cases documented.
Automation breaks silently. A renamed button or new flow can invalidate dozens of tests overnight.

The vision for Icarus is simple:

Let humans mark what matters (e.g., clickable elements in a screenshot).
Let AI generate structured test cases from that context.
Feed those cases back into your suite, where a self-healing pipeline can decide whether to patch, flag, or escalate.

The Icarus Workflow

Icarus runs in three stages:

1. Screenshots + Annotations

Capture screenshots of your app flows.
Use red boxes to mark interactive elements (buttons, fields, toggles).
Organize them by module and flow (e.g., awv/01-login.png).

2. Generate Test Cases

A Python script (generate.py) sends each screenshot and a structured system prompt into GPT-4o (or GPT-4.1 mini / GPT-5).
The model returns structured manual test cases in JSON or CSV.

{
  "module": "awv",
  "flow": "login",
  "step": "User clicks Login button",
  "expected": "System navigates to dashboard"
}

Output from generate.py in JSON

3. Integration Into Self-Healing Pipeline

Output test cases flow into your test management tool (Allure, Zephyr, Xray, etc.).
When automation fails, Icarus can:
- Patch: Suggest updated selectors.
- Flag: Mark a test as unstable.
- Recommend: Propose new cases for uncovered flows.

Example in Action

You capture 01-login.png with the Login button boxed.

GPT-4o returns:

Verify Login button is visible
Verify clicking Login navigates to Roster Page

Next sprint, the button label changes from “Login” → “Sign In.”

Your automation fails. Icarus flags the change and proposes:

Update locator from button[name=Login] to button[name=Sign In].

That’s the self-healing loop in action. 🚨🚨🚨 (DO NOT Auto Apply to your repo. Instead, make a gate for the PR to be reviewed by a human first.)

Minimal Example (for the “aha” moment)

Here’s the simplest possible generate.py:

cases = openai.ChatCompletion.create(
    model="gpt-4o",   # or gpt-4.1-mini for cheaper, or gpt-5 (text-only)
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Screenshot: awv/01-login.png"}
    ]
)

print(cases.choices[0].message.content)

💡 This shows how little glue you need to get screenshots → test cases.

Real-Time generate.py

In reality, Icarus uses a larger, configurable script:

import argparse, os, openai, yaml
from scripts.utils import parse_test_case, save_output, validate_screenshot_path

# Load config
with open(os.path.join(os.path.dirname(__file__), "..", "config.yaml")) as f:
    config = yaml.safe_load(f)

MODEL = config["model"]
TEMPERATURE = config["temperature"]
MAX_TOKENS = config["max_tokens"]

# Load prompts
with open(config["prompt_template_path"]) as f:
    prompt_template = f.read()
with open(config["system_prompt_path"]) as f:
    system_prompt = f.read()

def generate_test_case(screenshot_path, title):
    prompt = prompt_template.replace("{screenshot}", screenshot_path).replace("{title}", title)
    response = openai.ChatCompletion.create(
        model=MODEL,
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS,
        messages=[{"role": "system", "content": system_prompt},
                  {"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate test case from screenshot")
    parser.add_argument("--screenshot", required=True)
    parser.add_argument("--title", required=True)
    args = parser.parse_args()

    if not validate_screenshot_path(args.screenshot, config["screenshot_dir"]):
        print("❌ Screenshot not found")
        exit(1)

    result = generate_test_case(args.screenshot, args.title)
    print("\n✅ Generated Test Case:\n", result)
    filepath = save_output(result, config["output_dir"], args.title)
    print(f"\n📝 Saved to: {filepath}")

Full Project Layout

Here’s the recommended directory structure:

icarus/
├── config.yaml                 # Model, tokens, directories
├── generate.py                 # Core script
├── scripts/
│   ├── utils.py                 # parse/save/validation helpers
│   └── openai_finetune.py       # optional fine-tuning script
├── prompts/
│   ├── system_message.txt       # canonical system prompt
│   └── prompt_template.txt      # structured test case template
├── dataset/
│   ├── screenshots/             # annotated UI screenshots
│   ├── train_data.jsonl         # seed data for fine-tuning
│   └── feedback_loop.jsonl      # corrections & human feedback
├── test_cases/                  # generated cases in JSON/CSV
└── output/                      # exported test case files

Supporting Files

config.yaml — centralizes model + token settings. Example:

model: gpt-4o
temperature: 0
max_tokens: 1000
prompt_template_path: prompts/prompt_template.txt
system_prompt_path: prompts/system_message.txt
screenshot_dir: dataset/screenshots
output_dir: test_cases

system_message.txt — controls Icarus’ “personality.” Example:

You are a senior QA engineer who specializes in writing detailed manual test cases from UI screenshots.

Your job is to:
- Analyze a screenshot of a web or mobile UI
- Interpret the visible elements (inputs, buttons, labels, messages)
- Write a clear, concise, and structured manual test case
- Use consistent formatting so the test case can be easily copied into a test management tool

Use the following format:
- Precondition
- Numbered Steps
- Expected Result

Only include what can be observed from the screenshot. Do not make assumptions about functionality that isn’t visible.

prompt_template.txt — defines the user message. Example:

You are a senior QA engineer reviewing a UI screen.

Screenshot: {screenshot}
Test Case Title: {title}

Write a **manual test case** in the following format:

**Precondition:**
State the screen the user is on or any setup required.

**Steps:**
1. Start each step with a clear user action.
2. Use simple, direct language.
3. Cover all visible elements, including form inputs, buttons, modals, etc.

**Expected Result:**
Describe what should happen after the last step.

Use Markdown-style formatting. Avoid assumptions that aren't visible in the screenshot.

train_data.jsonl / feedback_loop.jsonl — store examples and corrections for fine-tuning.
openai_finetune.py — lets you train a specialist model for your UI flows.

Why This Matters

Icarus isn’t “AI magic.” It’s a practical loop:
visual context → structured cases → automation → self-healing.

The benefits:

Less churn from brittle tests.
Higher coverage of new flows.
QA engineers focus on exploration & strategy, not locator babysitting.

Don't Wait, Start Now

You don’t need to wait for a vendor to sell you “AI-powered testing.” You can build your own.

Icarus started as a weekend experiment. Now it’s evolving into a core piece of my self-healing pipeline and it writes all our test cases.

👉 If you’re serious about AI in QA, try annotating screenshots, run them through GPT-4o or GPT-5, and see what happens.

The point isn’t to replace humans; it’s to free us from brittle, repetitive maintenance so we can test like humans again.

👉 Want more posts like this? Subscribe and get the next one straight to your inbox. Subscribe to the Blog or Follow me on LinkedIn

Icarus: Build Your Own AI QA Test Agent

Why I Built Icarus

The Icarus Workflow

1. Screenshots + Annotations

2. Generate Test Cases

3. Integration Into Self-Healing Pipeline

Example in Action

Minimal Example (for the “aha” moment)

Real-Time generate.py

Full Project Layout

Supporting Files

Why This Matters

Don't Wait, Start Now

Read next

The Difference Between AI Assisting Quality vs. Deciding Quality

When Prompts Attack: Building Secure AI Pipelines

The Oracle Problem: QA’s Oldest Pain Point

Comments ()

Why I Built Icarus

The Icarus Workflow

1. Screenshots + Annotations

2. Generate Test Cases

3. Integration Into Self-Healing Pipeline

Example in Action

Minimal Example (for the “aha” moment)

Real-Time generate.py

Full Project Layout

Supporting Files

Why This Matters

Don't Wait, Start Now

Read next

Comments ( )

Comments ()