Case Study
Designing a three-agent AI system for semiconductor fleet diagnostics and why the hardest problem was teaching it to say "I don't know."
The Problem
Sarah is a field service engineer (FSE) with 7 years of experience. Her shift starts at 8 AM. She opens her inbox to 47 alerts that fired overnight. She doesn't know which are critical, which are noise, or which five are the same root cause repeating. She spends 45 minutes manually triaging before she even walks to a tool.
Meanwhile, production waits. That gap between having data and knowing what to do with it was what I set out to solve.
Discovery
12 interviews across four roles, a week embedded in the fab, 468,000 alarm records analyzed, and a fragmented tool landscape audited.
V1 Foundation
V1 was a unified fleet monitoring platform with health-gradient tiles, a Query Builder replacing a 4-6 hour data team cycle with self-serve access, and drill-down from Tool to Error. Shipped to four enterprise customers.
V1 solved data access. But it didn't solve the harder problem.
The Pivot
Real results but GlobalFoundries Dresden was still seeing 210 interrupts per week. Engineers had visibility. They were still drowning. The problem wasn't data access. It was cognitive overload.
The design question: How do you bring AI into a workflow where a wrong answer costs millions in damaged silicon without undermining the engineer's expertise? Our PM wanted full automation. I pushed back: 8 of 12 interviewees explicitly rejected autonomous decision-making. They wanted an advisor, not an autopilot. Built advisor prototypes. Customers validated. Leadership invested.
The System
I led the end-to-end design: research, interaction design, prototyping, and validation. The AI/ML engineers built the models; I defined what they should optimize for and how their outputs surface to users.
Three agents, intentionally separated so each agent's reasoning is independently traceable. The design principle: the AI's authority scales with its confidence, and human overrides strengthen the model rather than override it.
The same platform serves all three personas through a role dropdown in the context bar. Same data, different information architecture:
Service Requests follow the 8D methodology (the manufacturing standard for root-cause problem solving). The agent pre-populates the problem statement from the Diagnose screen. D7 (Prevent) triggers the Learning Agent's fleet-wide update.
What should be configurable versus fixed is a design position. I designed the Configuration screen to give operators control over agent behavior without requiring engineering changes:
Design decision: the 65% threshold. Below it, no hypothesis shown (State 3). Above it, diagnosis with uncertainty markers. Configurable per subsystem because a lamp failure and a wafer-in-chamber situation have different risk profiles.
Alert Intelligence
The Monitor Agent uses ISA-18.2 temporal correlation (the alarm management standard) to compress raw signals:
The FSE at shift start needs one thing: "what happened overnight, what's urgent, where do I start?" The Monitor Agent answers that in 10 seconds with a structured briefing and 24-hour timeline.
Designing for Uncertainty
From Toshiba repair logs, every incident follows: occurrence → response → repair start → repair complete → return to normal. I mapped these stages to five confidence states each requiring a fundamentally different UI.
"Accept" isn't the default. The FSE must scroll through the evidence cascade first. Acceptance is informed, not automatic.
Design time: 2 daysIf the agent showed a 30% guess, the engineer would anchor to it. By showing nothing, the engineer approaches fresh. A wrong diagnosis means replacing the wrong part while the actual failure continues damaging wafers.
Design time: 2 weeksTwo of four data channels are unavailable. "Accept" is disabled. The FSE sees exactly which channels need restoration.
Not "disagree" or "provide feedback." Structured fields: 15 root cause options, resolution, repair time, parts, and 7 categories for why the agent was wrong. Data the Learning Agent can act on.
The Learning Loop
In every KB system I studied, corrections are unstructured feedback. The system doesn't learn. I designed override as input.
Real example: From patterns across overrides, the Learning Agent adjusted the lamp threshold from 4,000h to 3,800h across all 47 tools. One FSE's correction improved preventive maintenance for the entire fleet.
Guardrails: Three safeguards prevent bad corrections from cascading fleet-wide: concordance thresholds, configurable staging windows, and contradiction detection.
Query Builder
In V1, engineers manually constructed boolean queries across four data channels (Autotest, FDC, Health Index, Metrology). In V2, the engineer types a natural-language question. The agent translates it into structured, editable field chips each parameter individually adjustable. A "View SQL" toggle shows the raw query. One sentence replaces four manual conditions.
Results
| Metric | Before | After | |
|---|---|---|---|
| Defect resolution | 6 months | → | 2-3 days |
| Triage time | 45 minutes | → | Under 2 minutes |
| GF Dresden interrupts | 210 / week | → | 50 / week |
| Efficiency | Baseline | → | 30% improvement |
| Agent accuracy | N/A | → | 90%+ top-1 precision |
| Alarm fatigue | 85% ignore rate | → | Eliminated |
| Pre-sales impact | → | 25% conversion · 4 customers |
Tested with 8 FSEs and 2 PMs 80% positive. Key refinement: override path streamlined to be accessible from any state.
Sarah's Monday morning now starts with 3 priorities instead of 400+ alarms. She resolves two before walking to the fab floor.
Methodology: Top-1 precision against 200+ resolved SRs. We tracked precision over recall because a withheld diagnosis (State 3) is a designed outcome, not a failure.
Failure Modes
Designing for failure shaped more of this product than designing for success. Each failure mode was stress-tested during shadow deployment before any recommendation surfaced to FSEs.
The evidence cascade shows first-out alarm, downstream signals, and match percentages. The confidence score is context, not a command. Override is always accessible.
The three-agent separation makes this traceable. Each agent logs independently; the Apps Engineer can audit the full chain.
Two additional failure modes (Monitor suppression, Learning propagation) were stress-tested with corresponding detection metrics.
Reflections
400+ to 47, but the FSE has no visibility into what was filtered. I'd add a "353 alarms rationalized" view. Transparency about what the AI removed is as critical as what it shows.
At 5,000+ tools across 12 fabs, the flat tile grid breaks down. I'd move to a fab, zone, bay hierarchy with aggregated health scores.
Designed and validated for cleanroom constraints: WCAG AA contrast throughout, color-blind safe encoding (text labels and directional arrows alongside color, never color alone), 44px touch targets for gloved interaction, ARIA semantics validated with the accessibility team, and monospace signal names sized for arm's length readability.
Five states acknowledge the agent isn't always right. Override gives FSE authority. The agent recommends never commands.
Structured corrections enable retraining. A comment field gives text. Structured fields give data the Learning Agent can act on.
Agent cards use identical styling to every other card. No glowing borders. The AI is a tool, not a feature demo.
State 3 prevents anchoring. State 4 prevents premature commitment. State 5 captures knowledge. The happy path is obvious the edge cases are where decisions matter.