360° Test Intelligence Playbook for QE Directors

Blunt truth: If you don’t instrument usage and wire it into CI, you’re guessing. This playbook turns production reality into test strategy, automation ROI, and faster, safer releases.
0) Outcomes (What “Good” Looks Like)
- Automate the right things: High‑usage, high‑criticality, deterministic flows.
- Explore the right things: Low‑usage but high‑risk, volatile, or novel areas.
- Shrink risk fast: Top‑N usage × low‑coverage gaps closed each sprint.
- Cut waste: Quarantine and fix flakes within an SLA; stop rerunning noise.
- Prove value: Dashboards link user behavior → test coverage → defects → cycle time.
1) Architecture at a Glance
Prod (web/API/mobile) ──> Event/Trace Logs (Mixpanel/Amplitude/OTel/APM/NGINX)
│
▼
Usage Warehouse (BigQuery/Snowflake/Redshift)
│
▼
Test Meta (repo) + Results (CI) + Coverage (JaCoCo/nyc/coverage.py)
│
▼
Join & Score (dbt/SQL/Python batch)
│
▼
Dashboards (Grafana/Looker/Metabase/Sheets)
No Mixpanel? Use API gateway/access logs (NGINX/ALB/CloudFront/Kong), APM traces (Datadog/New Relic), or server telemetry via OpenTelemetry. You do not need a paid analytics tool to start.
2) Instrumentation: Minimum Viable Signals
- Define canonical events (5–15 to start):
Login Started/Completed
,Checkout Started/Completed
,Search Performed
,Profile Updated
,File Uploaded
. - Attach context:
device
,browser
,app_version
,plan
,customer_id/tier
,locale
,experiment_id
. - Backstop with access logs: ensure every critical API has structured logs (status, latency, route name, user/tenant).
- Opt‑in privacy: hash PII; never store secrets in events; follow DPA/SOC2.
Rule: If a flow is not observable in prod, it’s not critical enough to gate releases.
3) Test Taxonomy & Tagging (Source of Truth)
Keep it small, stable, and enforced.
Tags (required unless noted):
flow:
primary user journey (signup
,checkout
,search
) (required)feature:
subsystem (payments
,profile
)risk:
critical|high|medium|low
(required)mixpanel:
event name or ID (Checkout Completed
) (orevent:
if not using Mixpanel)persona:
(enterprise_admin
,smb_user
,api_client
)owner:
team or person (team_payments
,qe_thomas
)platform:
(web
,api
,ios
,android
)component:
code area (svc-billing
,ui-cart
)
Playwright TypeScript (title tags + annotations)
import { test, expect } from '@playwright/test';
const tags = (t: Record<string,string>) => Object.entries(t).map(([k,v])=>`[${k}:${v}]`).join('');
test(`${tags({flow:'checkout', feature:'payments', risk:'critical', mixpanel:'Checkout Completed', persona:'smb_user'})} user completes checkout`,
async ({ page }, testInfo) => {
for (const [k,v] of Object.entries({flow:'checkout', feature:'payments', risk:'critical', mixpanel:'Checkout Completed', persona:'smb_user'})) {
testInfo.annotations.push({ type: k, description: String(v) });
}
// test body
});
Run subsets
npx playwright test --grep "flow:checkout" --grep "risk:critical"
Pytest (markers)
# pytest.ini
[pytest]
markers =
flow(name): primary user flow
feature(name): functional slice
mixpanel(event): linked analytics event
risk(level): critical|high|medium|low
persona(name): persona
owner(name): owner
platform(name): runtime
component(name): code area
import pytest
@pytest.mark.flow("checkout")
@pytest.mark.feature("payments")
@pytest.mark.mixpanel("Checkout Completed")
@pytest.mark.risk("critical")
@pytest.mark.persona("smb_user")
def test_checkout_happy_path(page):
...
Enforce in CI (pytest)
# conftest.py
def pytest_collection_modifyitems(config, items):
missing = []
for it in items:
names = {m.name for m in it.own_markers}
if 'flow' not in names or 'risk' not in names:
missing.append(it.nodeid)
if missing:
raise SystemExit(f"Tests missing required tags: {missing}")
4) Data Model (Join Keys)
- Usage:
flow
,event
(Mixpanel or access‑log route name),platform
,persona
,version
- Tests:
test_id
,flow
,risk
,owner
,component
- Coverage:
file
,component
,%_covered
- Results:
test_id
,pass/fail
,duration_ms
,flaky
(boolean),timestamp
,sha
- Defects:
flow
,component
,severity
,created_at
Unify flow
names across analytics, tests, and defects. Maintain a small flow_aliases
map to normalize.
5) Scoring: What to Automate vs Explore
Risk Score R (0–100) per flow/component:
R = w1*Usage + w2*BusinessCriticality + w3*FailureImpact + w4*DefectDensity + w5*ChangeRate
+ w6*(1 - Coverage) + w7*PlatformDominance + w8*CustomerTierWeight - w9*Determinism
- Usage: normalized event rate (p95 over 7–14 days).
- BusinessCriticality: static (checkout/login=1.0; marketing page=0.2).
- FailureImpact: revenue/support/SLA blast radius.
- DefectDensity: defects per KLOC or per week.
- ChangeRate: commits/touches over last 30 days.
- Coverage: line/branch/function coverage blended (cap at 90%).
- PlatformDominance: e.g., if 70% Safari usage → weight Safari tests.
- CustomerTierWeight: enterprise or regulated tenants get higher weight.
- Determinism: how stable is the surface (lower determinism → less automation now).
Decision Matrix
- Automate now:
R ≥ 70
and surface is deterministic (stable DOM/APIs, mockable data). - Automate partial + explore:
50 ≤ R < 70
or moderately volatile. - Explore only:
R < 50
or highly volatile, one‑off admin tooling. - Don’t invest: low R, sunset candidates.
Calibrate weights quarterly; publish the thresholds so product/eng can predict gate expectations.
6) Pipelines: How to Get the Data
6.1 Usage (3 paths)
A) Mixpanel/Amplitude API
- Nightly job pulls Top‑N events per
flow
,platform
,persona
(last 14d). - Store to
usage_daily(flow, event, count, platform, persona, date)
.
B) Access/APM Logs (no analytics tool)
- Parse NGINX/ALB logs → map
route
→flow
via lookup table. - Datadog/New Relic traces: aggregate spans by
service, route, status
.
C) Product DB (last resort)
- Derive funnels from state transitions (e.g.,
orders(status)
), but expect lag and noise.
6.2 Coverage
- JS/TS:
nyc
(Istanbul) JSON + per‑file coverage. - Python:
coverage.py xml
. - Java:
JaCoCo
XML. - Normalize to
coverage(component, file, pct_lines, pct_branches, sha, date)
.
6.3 Test Results
- Playwright JSON report or JUnit XML →
results(test_id, pass, duration_ms, flaky, date, sha)
. - A test is flaky if: failed→passed without code change or fails <X% over last 10 runs.
6.4 Defects
- Pull from Jira/Linear:
defects(flow, component, severity, created/resolved)
.
6.5 Join Job
- Use dbt or a Python batch to compute per‑flow scores and the Top‑N Gaps:
- High Usage × Low Coverage
- High Usage × High Defect Rate
- High Revenue × Any Risk
7) CI Integration & Quality Gates
Pre‑merge (fast)
- Run critical + high risk tests for Top‑N flows (from last 14d usage).
- Fail if: pass‑rate < 99%, new flakes introduced, or coverage drop on touched components.
Nightly (broad)
- Rotate flows by usage bucket; run cross‑browser matrix aligned with platform dominance.
Weekly (deep)
- Full suite + accessibility + performance budgets + contract tests.
Gates
- Block merge if a test lacks required tags (
flow
,risk
). - Block release if Top‑N flows coverage < 80% or flake budget exceeded.
Sharding & Stability
- Deterministic test ordering; timing‑aware sharding; per‑shard parallelism; JUnit/Allure artifacts per shard; flake reruns off by default.
8) Flake Management
- Detect: statistical flake detector (failures across unrelated SHAs) labels tests as
@flaky
temporarily. - Quarantine lane: stable pipeline runs quarantined tests in parallel but cannot block merges; dashboard shows debt.
- SLA: critical flakes fixed in 48h, others in 7d; after SLA breach, raise to owner’s EM.
- Root causes: async waits, test data races, network timeouts, shared state, clock, 3rd‑party.
- Fix patterns: strict locators,
await expect
with realistic timeouts, idempotent fixtures, hermetic data.
9) Dashboards (what to show)
Exec Tab
- Release readiness: pass‑rate, flake budget, Top‑N flow coverage, time‑to‑green, median CI time.
- Risk heatmap by flow × platform.
QE Lead Tab
- Top‑N Usage × Low Coverage (click → create Jira ticket template).
- Flake trends by owner/component.
- Slowest 20 tests; impact if fixed.
Engineer Tab
- My changes: tests impacted, coverage delta, historical failures.
- Ownership: tests w/
owner:me|myteam
failing or flaky.
SLOs
≤ 2%
flake in blocking pipelines.≥ 95%
stability for critical flows.≥ 80%
coverage on Top‑N flows.
10) Example Queries & Jobs
Top flows last 14d (Mixpanel)
SELECT flow, platform, SUM(events) AS cnt
FROM mixpanel_daily
WHERE date >= CURRENT_DATE - INTERVAL '14 DAY'
GROUP BY 1,2
ORDER BY cnt DESC
LIMIT 20;
Coverage gap
SELECT u.flow, u.platform, u.cnt, COALESCE(c.pct_lines,0) AS coverage
FROM top_usage u
LEFT JOIN coverage_by_flow c USING (flow)
WHERE COALESCE(c.pct_lines,0) < 0.8
ORDER BY u.cnt DESC;
Flake rate
SELECT test_id,
SUM(CASE WHEN status='fail' THEN 1 ELSE 0 END)::float / COUNT(*) AS flake_rate
FROM ci_results
WHERE date >= CURRENT_DATE - INTERVAL '14 DAY'
GROUP BY test_id
HAVING COUNT(*) >= 5 AND flake_rate BETWEEN 0.1 AND 0.9;
11) Templates
Jira ticket: Coverage Gap
Title: Automate {flow} happy path on {platform}
Desc: {flow} is Top‑N by usage (p95 {usage}/day) with coverage {coverage}.
AC:
- Playwright e2e covers start→complete
- Deterministic locators, hermetic data
- Cross‑browser parity as per platform mix
- Allure labels: flow, risk, owner
Release Gate Policy (snippet)
- Block release if any Top‑N flow fails in pre‑release run.
- Block release if flake budget > 2% over rolling 7 days.
- Block release if coverage on Top‑N flows < 80%.
12) People & Process
- Single owner per flow (QE/Dev co‑ownership). No orphaned flows.
- Weekly risk review: update Top‑N from usage; pick 3 gap tickets.
- Exploratory charters driven by anomalies (rage clicks, drop‑offs, support spikes).
- Security/Compliance: validate event schemas; quarterly privacy review.
13) Tooling Options (pick 1 per category to start)
- Usage: Mixpanel / Amplitude / Heap / (DIY: NGINX + BigQuery) / OTel + Datadog.
- Coverage: Istanbul/nyc (JS), coverage.py (Py), JaCoCo (JVM).
- Results: JUnit XML + Allure / Playwright JSON report.
- Dashboards: Grafana / Looker / Metabase / Google Sheets (MVP).
- Pipelines: dbt / Airflow / GitHub Actions nightly / Python cron.
14) 30‑60‑90 Rollout
Days 0–30 (MVP)
- Define
flow
catalog. Tag top 10 tests withflow
+risk
. - Add one usage source (Mixpanel or access logs). Nightly Top‑N export.
- Parse CI JUnit + coverage into a single table. Ship first dashboard.
Days 31–60
- Enforce tag lint in CI. Add flake detector + quarantine lane.
- Establish release gates for Top‑N flows.
- Automate 5 highest R flows end‑to‑end.
Days 61–90
- Expand to platform/persona splits. Add defect join.
- Publish quarterly weights & thresholds. Socialize with PM/EMs.
- Make Top‑N coverage an OKR.
15) FAQ / Reality Checks
- Do we need Mixpanel? Helpful, not mandatory. Access/APM logs + OTel traces get you 80%.
- 100% automation? Wasteful. Optimize for risk reduction per engineer hour.
- Flakes are inevitable. Measure, quarantine fast, and burn them down weekly.
- Coverage % is a proxy. Combine with usage and defect density or it will lie to you.
16) Appendix: Useful Snippets
Playwright JSON → DB (Node, sketch)
import fs from 'fs';
const report = JSON.parse(fs.readFileSync('playwright-report/output.json','utf8'));
// Map to rows {test_id, status, duration_ms, annotations}
Mixpanel Export (Python, sketch)
import requests, datetime as dt
start = (dt.date.today()-dt.timedelta(days=14)).isoformat()
# hit JQL/Export API for event counts by name/platform/persona
Access Log Parser (Python, sketch)
import re
rx = re.compile(r'"(GET|POST) (?P<route>[^ ]+) [^"]+" (?P<status>\d{3})')
# map route->flow via lookup; aggregate counts per day
A Few Last Words...
This is not a tooling project; it’s governance + data + discipline. Instrument usage, tag tests, join the data, score risk, enforce gates, and iterate. Do this, and your automation budget will go where customers actually live; and your releases will tell the truth before your users do.
Comments ()