Incorrect password. Try again.

Reach out to me at pradnya10.chacharkar@gmail.com for password.

Protected Project

Case Study

Fleet Intelligence AI

Designing a three-agent AI system for semiconductor fleet diagnostics and why the hardest problem was teaching it to say "I don't know."

Role
Lead Product Designer
Company
Onto Innovation
Domain
Semiconductor Manufacturing
Shipped To
Micron, Intel, GlobalFoundries, Toshiba
Scope
9 interactive screens · 3 role views · V1 + V2
Team
1 UX Director, 2 Product Managers, AI/ML Engineers, Data Engineers, Software Engineers

The Problem

Sarah's Monday morning

Sarah is a field service engineer (FSE) with 7 years of experience. Her shift starts at 8 AM. She opens her inbox to 47 alerts that fired overnight. She doesn't know which are critical, which are noise, or which five are the same root cause repeating. She spends 45 minutes manually triaging before she even walks to a tool.

Meanwhile, production waits. That gap between having data and knowing what to do with it was what I set out to solve.

217
Equipment across 4 zones
1,116
Interrupts / week from one tool
85%
Alarm ignore rate (ISA-18.2 flags >10%)
$1.2M
Annual waste from inaction

Discovery

Going forensic

12 interviews across four roles, a week embedded in the fab, 468,000 alarm records analyzed, and a fragmented tool landscape audited.

Three users, one platform

Field Service Engineer
Sarah Chen
"Which tool do I fix first?"
Diagnose Agent tool-level investigation with evidence
Fleet Manager
Rene Schmidt
"Are we meeting SLA across zones?"
Monitor Agent fleet grouping and zone SLA
Apps Engineer
James Okafor
"Where is the system failing and why?"
Learning Agent override patterns, accuracy, KB health

V1 Foundation

Building the data layer first

V1 was a unified fleet monitoring platform with health-gradient tiles, a Query Builder replacing a 4-6 hour data team cycle with self-serve access, and drill-down from Tool to Error. Shipped to four enterprise customers.

6 mo → 2-3 days
Defect resolution
4-6 hrs → <5 min
Query time
30%
Efficiency improvement
25%
Pre-sales conversion

V1 solved data access. But it didn't solve the harder problem.

The Pivot

V1 shipped. It wasn't enough.

Real results but GlobalFoundries Dresden was still seeing 210 interrupts per week. Engineers had visibility. They were still drowning. The problem wasn't data access. It was cognitive overload.

The design question: How do you bring AI into a workflow where a wrong answer costs millions in damaged silicon without undermining the engineer's expertise? Our PM wanted full automation. I pushed back: 8 of 12 interviewees explicitly rejected autonomous decision-making. They wanted an advisor, not an autopilot. Built advisor prototypes. Customers validated. Leadership invested.

The System

Three agents, nine screens, three role views

I led the end-to-end design: research, interaction design, prototyping, and validation. The AI/ML engineers built the models; I defined what they should optimize for and how their outputs surface to users.

Three agents, intentionally separated so each agent's reasoning is independently traceable. The design principle: the AI's authority scales with its confidence, and human overrides strengthen the model rather than override it.

FEEDBACK LOOP 📡 Monitor Watches signals Groups correlated alarms into situations 400+→47 🔬 Diagnose Investigates tools Scores confidence Shows evidence cascade 5 STATES 🧠 Learning Captures corrections Updates knowledge base Improves fleet-wide 1 FIX→47 TOOLS

The full product: 9 screens, 3 workflows

Fleet Intelligence · Micron F10 Singapore
Agent activeAsk Agent
Fleet Intelligence
Onto Innovation
📊Fleet Overview
🔔Alerts47
🔬Diagnose
📋Service Requests
👥Assignment
🔧Maintenance
📦Parts & Inventory
📈Query Builder
📖Knowledge Base
📊Reports
Onboard Equipment
Configuration
SC
Sarah Chen
FSE · Micron F10
My Assigned
3
of 217 fleet
Active Diagnoses
2
1 high · 1 medium
Fleet Alerts
47
+18 vs yesterday
Fleet Uptime
94.2%
-1.3pp this week
Fleet MTBI
127h
+11h vs Q4
Parts Pending
1
Xe Lamp · S-C17
Monitor Agent50 errors overnight · 13 tools · 3 priorities
P1
L-A01
Down · 4.2h
Trigger board lockup
P2
S-C17
Critical · 4.0h
Lamp thermal degrade
P3
L-B09
Warning · 2.1h
Turret position drift
Tool Summary · 30 of 217 need attention
L-A01
Litho-A
DOWN
L-A02
Litho-A
DOWN
S-C17
SE-C
CRITICAL
L-A17
Litho-A
ALARM
L-B09
Litho-B
WARNING
L-A04
Litho-A
CAUTION
L-B06
Litho-B
FAIR
S-C03
SE-C
GOOD
L-B22
Litho-B
HEALTHY
Root Cause Summary
1
Trigger board lockup19 alerts
2
Turret position error13 alerts
3
Xe lamp thermal degradation4 alerts
FSE Dashboard 12 navigation items across 4 workflow groups. Only 30 problem tools shown; 187 healthy tools are invisible by design.

Role-based views from one dropdown

The same platform serves all three personas through a role dropdown in the context bar. Same data, different information architecture:

Fleet Overview Fleet Manager View
ED
Ellen Dong · Fleet Manager
Zone Performance vs SLA
Litho-A
91.2%
Target: 95% · ▼ 3.8pp
Litho-B
95.4%
Target: 95% · ▲ 0.4pp
SE-C
92.8%
Target: 95% · ▼ 2.2pp
BTF-D
98.7%
Target: 95% · ▲ 3.7pp
Technician Utilization
Sarah Chen3 active
Kevin Wong1 active
Ya Ching Chang1 active
Escalation Tracker
Trigger Board Cluster L3 recommended
Turret FW Fleet-wide Awaiting approval
Fleet Manager sees zone SLA, technician utilization, 8D stage distribution, and escalation tracker no tool-level detail.
Fleet Overview Apps Engineer View
JO
James Okafor · Apps Eng
Agent Performance · 30 Days
Accuracy
94%
+3% vs prior
FSE Overrides
23
this period
Fleet Updates
12
from Learning Agent
KB Cases
1,250
total captured
Override Analysis
Root cause mismatch12 / 23
Severity overestimate6 / 23
Missing secondary cause3 / 23
Historical case mismatch2 / 23
Knowledge Base Coverage
Lamp92%
Trigger Board78%
Turret71%
Stage54%
Network38%
Apps Engineer view this is effectively an alignment monitoring dashboard. Override analysis categorizes WHY the AI fails. KB coverage shows WHERE it has gaps.

Service Requests: 8D lifecycle

Service Requests follow the 8D methodology (the manufacturing standard for root-cause problem solving). The agent pre-populates the problem statement from the Diagnose screen. D7 (Prevent) triggers the Learning Agent's fleet-wide update.

SR #151204Xe Lamp Thermal Degradation · S-C17
D1
Team
D2
Problem
D3
Contain
D4
Root Cause
D5
Corrective
D6
Validate
D7
Prevent
D8
Review
8D lifecycle tracker D2 auto-populated from Diagnose. D7 triggers Learning Agent fleet-wide update. D8 creates Knowledge Base entry.

Configuration: AI governance controls

What should be configurable versus fixed is a design position. I designed the Configuration screen to give operators control over agent behavior without requiring engineering changes:

Monitor Agent Settings
Alarm rationalization
Grouping sensitivityMedium (45-min correlation window)
Predicted alert horizon72 hours
Diagnose Agent Settings
Minimum confidence threshold65%
Cross-fab case matching
Learning Agent Settings
Fleet-wide updates
Override review period24h review
KB auto-capture
Configuration each agent's behavior is tunable. The 65% confidence threshold determines when State 3 (no hypothesis) activates. The override review period (24h) is a safety valve for the Learning Agent.

Design decision: the 65% threshold. Below it, no hypothesis shown (State 3). Above it, diagnosis with uncertainty markers. Configurable per subsystem because a lamp failure and a wafer-in-chamber situation have different risk profiles.

Alert Intelligence

From 400+ alarms to 3 priorities

The Monitor Agent uses ISA-18.2 temporal correlation (the alarm management standard) to compress raw signals:

400+ RAW ALARMS 47 GROUPED 8 SITUATIONS 3 PRIORITIES ISA-18.2 CORRELATION

The FSE at shift start needs one thing: "what happened overnight, what's urgent, where do I start?" The Monitor Agent answers that in 10 seconds with a structured briefing and 24-hour timeline.

Alerts Priority Order
Agent View
Trigger 4
Turret 9
Lamp 3
Focus 2
Priority Order · 19 tools
L-A01Down · 4.2hTrigger
L-A02Down · 3.8hTrigger
S-C17Critical · 4.0hLamp
L-B04Warning · 2.6hTrigger
L-B09Warning · 2.1hTurret
Monitor Agent Briefing
5
Tools down
23.4h
Downtime
12
Degraded
3
Predicted
Recommendation: Start with L-A01 (trigger board, 4.2h down). Highest production impact. Historical match at 94%.
8 Root Cause Situations
Trigger board lockup · 19 alerts
Turret position error · 13 alerts
Xe lamp thermal · 4 alerts
+ 5 more situations
Agent Accuracy: 94% over last 30 days
Alerts screen split panel with filter pills (root cause groups), priority-sorted tool list, and Monitor Agent briefing with production impact KPIs and situation summary.

Designing for Uncertainty

Five confidence states

From Toshiba repair logs, every incident follows: occurrence → response → repair start → repair complete → return to normal. I mapped these stages to five confidence states each requiring a fundamentally different UI.

State 1 · High Confidence

Shows hypothesis + evidence cascade

94%
Confidence · High · 6 of 7 cases matched
Xe Lamp Thermal Degradation
Lamp hours at 4,012h (threshold: 4,000h). Spectral intensity dropped 12% in 48h. Matches Case #1247 (97%).
Accept DiagnosisOverride

"Accept" isn't the default. The FSE must scroll through the evidence cascade first. Acceptance is informed, not automatic.

Design time: 2 days
State 3 · Insufficient Data My most important design decision

Shows NO hypothesis. Prevents anchoring bias.

⚠ Insufficient Pattern Match FSE Assessment Required
Agent has insufficient data to form a hypothesis. Closest match: 31% (below threshold). Manual assessment recommended.
Raw Signals Unfiltered
No first-out alarm identified
Closest match: 31%
FSE Assessment
Describe what you observe at the tool...
Submit Assessment

If the agent showed a 30% guess, the engineer would anchor to it. By showing nothing, the engineer approaches fresh. A wrong diagnosis means replacing the wrong part while the actual failure continues damaging wafers.

Design time: 2 weeks
State 4 · Data Blocked

Diagnosis blocked missing data channels

☁ Diagnosis Blocked: Missing Data
Autotest Active
FDC Stale 6h behind
Health Index Active
Metrology Disconnected
Request Data Sync Escalate to IT

Two of four data channels are unavailable. "Accept" is disabled. The FSE sees exactly which channels need restoration.

State 5 · Override

Agent was wrong structured correction feeds Learning Agent

✏ Override Agent Diagnosis
Your correction updates the KB and improves diagnoses across 47 similar tools.
Actual Root Cause
Xe Lamp Failure ▾
Resolution Applied
Replaced Xe lamp assembly and recalibrated spectral baseline.
Repair Time
42 min
Parts Used
Xe Lamp Assembly (1)
Why Was the Agent Wrong?
Root cause mismatch ▾
Submit Correction → Knowledge Base

Not "disagree" or "provide feedback." Structured fields: 15 root cause options, resolution, repair time, parts, and 7 categories for why the agent was wrong. Data the Learning Agent can act on.

"State 1 took 2 days. State 3 took 2 weeks. The edge cases aren't exceptions to the design they are the design."
Try it live:
1. You're viewing State 1 · High Confidence (94%) click Accept Diagnosis to see the Plan of Action
2. Click State 3 · Low notice: no hypothesis shown (prevents anchoring bias)
3. Click State 4 · Blocked see channel stoplights (Active/Stale/Disconnected)
4. Click State 5 · Correction the structured override form (15 root causes, 7 "why wrong" categories)
5. Click Run Autotest (right panel) watch the 12-point diagnostic run live
Diagnose Screen Interactive Prototype
5 confidence states · clickable
S-C17
Atlas II · Xe Lamp Failure Investigation
SE Zone C · Micron F10 · Assigned: Sarah Chen · SR #151204
FSE
Apps Eng
TPS
L3
Unscheduled Down
94%
Confidence · High
6 of 7 historical cases matched
Xe Lamp Thermal Degradation
Lamp at 3,890h of 4,000h rated life. SE_CLTC_TEMP rising steadily over 7 days, autotest_intensity declining. Pattern matches 6 historical cases, all resolved by lamp replacement. Expected repair: 35 to 45 min.
checklist
Plan of Action · Pre-populated from 6 matched cases
1
Pause tool · complete current wafer step
2
Replace Xe Lamp · P/N 4710-SE · 3 in stock, Micron F10 cage
3
Run 12-point Autotest · verify SE_CLTC_TEMP below 31°C
4
Run 3 QC wafers · confirm measurement within spec
5
Return to production · update Stoplight to green · close SR #151204
Evidence · First-Out Alarm + Cascade
Primary trigger identified · 7 downstream
First-Out Alarm · Primary Trigger
SE_CLTC_TEMP
34.2°C
baseline 30.0°C
+4.2°C ▲ · 8.4σ
Crossed 3σ four days ago · Accelerating
SE_CLTC_TEMP · 7-day trendBaseline: 30.0°C
30°C 34.2°C Apr 15 Today
DOWNSTREAM
├─
autotest_intensity
82%
base 95%
−13pp ▼
consequence
├─
sw_log_warnings
23/day
base 2/day
+21 ▲
consequence
├─
Focus_drift
0.8µm
base 0.2µm
+0.6µm ▲
downstream
├─
Stage_Wedge_Z
0.02mm
base 0.01mm
+0.01 ▲
secondary
UNRELATED · Normal range
Network_Latency
12ms
base 12ms
0 ●
normal
Dashed line = baseline · FDC real-time + Autotest daily
Historical Precedent · 6 of 7 Matched
Ranked by embedding similarityinfo
How similarity is calculated
Three comparison vectors:

Signal pattern · 7-day trend shape
Error codes · codes fired and sequence
Tool model · equipment type and config

97% = nearly identical signals, same error codes, same tool model.
97%
SR #150422 · S-C17 · Lamp replacement at 3,200h
Kevin Wong · 38 min · Mar 2, 2026 · Same tool, same failure mode
Signal
Error codes
Tool model
94%
SR #151089 · S-C06 · Lamp thermal at 3,920h
Ya Ching Chang · 42 min · Jan 2026 · Different tool, same model Atlas II
91%
SR #149066 · QATSL11 · IDE controller + lamp overheat
TPS escalation · 1,339h downtime · Feb 2019 · Toshiba Y5 · Required IDE sandwich replacement
89%
SR #130344 · QATSL05 · Halogen lamp focus error
FSE heard noise from Y-stage · 72.5h downtime · Mar 2018 · Toshiba Y5
86%
GF-275562 · AMI1400 · Pressure alarm after lamp thermal event
Rene Schmidt · WW19-24 · GF Dresden · Shielded cable replacement resolved
82%
QATSG05 · Halogen lamp QC fail
Toshiba Y5 · 78h downtime · Atlas II+ · Lamp replacement + optics cleaning
Diagnose Agent
High confidence. SE_CLTC_TEMP is the first-out alarm, all 7 downstream signals trace to lamp thermal degradation. Replaced twice before on S-C17 at similar hours. Avg resolution across 6 cases: 42 min. Not an ANALYSISENGEER software crash. Recommend lamp replacement.
Stoplight Chart · S-C17
Daily tracking · Owner: Rene Schmidt
Xe Lamp replacement
Open
SR #151204 · Pending FSE action · Opened today
D1D3D8
Auto Focus optics cal
SR #151089 · D4 confirmed · Ya Ching Chang
On track
Stage alignment
SR #150891 · D8 closed · Sarah Chen
Closed
Remote Actions
Lamp subsystem · Atlas II · S-C17
Connected to S-C17 via PLC · Tool state: Down · Safe to execute
Run Autotest
12-point diagnostic · ~15 min
Starting...0/12
Pause Tool
Complete current wafer step, then idle
Trigger Calibration
Optics + stage recal · Tool must be idle
Restart Lamp Controller
Soft restart lamp subsystem · No wafer impact
Tool Context
Lamp Hours3,890h / 4,000h
Last PMApr 10, 2026
MTBI (30d)142h
Fleet MTBI (4wk avg)123h
Last Lamp ReplaceMar 2, 2026 at 3,200h
Total SRs (Quarter)3
Data FreshnessFDC real-time
MTBI · 4-week rolling avg · S-C17 vs Fleet
W14 W17 142h Fleet 123h
FSE Notes
Document your observations

The Learning Loop

One correction improves 47 tools

In every KB system I studied, corrections are unstructured feedback. The system doesn't learn. I designed override as input.

FSE overrides
diagnosis
Learning Agent captures
structured correction
KB + thresholds
updated fleet-wide
Fewer overrides
over time

Real example: From patterns across overrides, the Learning Agent adjusted the lamp threshold from 4,000h to 3,800h across all 47 tools. One FSE's correction improved preventive maintenance for the entire fleet.

Guardrails: Three safeguards prevent bad corrections from cascading fleet-wide: concordance thresholds, configurable staging windows, and contradiction detection.

Query Builder

Query Builder V2: natural language meets structured editing

In V1, engineers manually constructed boolean queries across four data channels (Autotest, FDC, Health Index, Metrology). In V2, the engineer types a natural-language question. The agent translates it into structured, editable field chips each parameter individually adjustable. A "View SQL" toggle shows the raw query. One sentence replaces four manual conditions.

Try it live:
1. Click Run Query watch the agent reason through your question step by step
2. Click View SQL › see the raw query the agent generated
3. Expand a row (click on L-B09) drill into readings and sparkline chart
4. Switch to Chart tab SPC trace, bar comparison, and configurable chart playground
5. Toggle V1: Boolean Builder tab see what the same query looked like before AI
Query Builder
4 data channels · 12,847 signals · 217 tools
4 data channels · 12,847 signals · 217 tools
Query Agent
Reading query... identifying "Litho tools", "turret position drift", "above 0.05°", "14 days"
Selecting channels: FDC (turret position), Health Index (tool model), Autotest (calibration)
Building query... 9 tools matched across Litho-A and Litho-B

Results

That $1.2M cost of inaction this is the response.

MetricBeforeAfter
Defect resolution6 months2-3 days
Triage time45 minutesUnder 2 minutes
GF Dresden interrupts210 / week50 / week
EfficiencyBaseline30% improvement
Agent accuracyN/A90%+ top-1 precision
Alarm fatigue85% ignore rateEliminated
Pre-sales impact25% conversion · 4 customers

Tested with 8 FSEs and 2 PMs 80% positive. Key refinement: override path streamlined to be accessible from any state.

Sarah's Monday morning now starts with 3 priorities instead of 400+ alarms. She resolves two before walking to the fab floor.

Methodology: Top-1 precision against 200+ resolved SRs. We tracked precision over recall because a withheld diagnosis (State 3) is a designed outcome, not a failure.

Failure Modes

What happens when the AI is wrong

Designing for failure shaped more of this product than designing for success. Each failure mode was stress-tested during shadow deployment before any recommendation surfaced to FSEs.

Diagnose Agent: confidently wrong at 94%

What if it shows high confidence for the wrong root cause?

The evidence cascade shows first-out alarm, downstream signals, and match percentages. The confidence score is context, not a command. Override is always accessible.

Cross-agent failure: cascading errors

What if Monitor groups incorrectly, causing Diagnose to match the wrong pattern?

The three-agent separation makes this traceable. Each agent logs independently; the Apps Engineer can audit the full chain.

Two additional failure modes (Monitor suppression, Learning propagation) were stress-tested with corresponding detection metrics.

Reflections

"Designing for AI failure is harder than designing for AI success."

What I'd do differently

Suppressed alarm transparency

400+ to 47, but the FSE has no visibility into what was filtered. I'd add a "353 alarms rationalized" view. Transparency about what the AI removed is as critical as what it shows.

Scalability beyond 217 tools

At 5,000+ tools across 12 fabs, the flat tile grid breaks down. I'd move to a fab, zone, bay hierarchy with aggregated health scores.

Accessibility in a fab environment

Designed and validated for cleanroom constraints: WCAG AA contrast throughout, color-blind safe encoding (text labels and directional arrows alongside color, never color alone), 44px touch targets for gloved interaction, ARIA semantics validated with the accessibility team, and monospace signal names sized for arm's length readability.

Design principles

Trust through transparency

Five states acknowledge the agent isn't always right. Override gives FSE authority. The agent recommends never commands.

Override as input, not feedback

Structured corrections enable retraining. A comment field gives text. Structured fields give data the Learning Agent can act on.

AI features feel native

Agent cards use identical styling to every other card. No glowing borders. The AI is a tool, not a feature demo.

The edge cases are the design

State 3 prevents anchoring. State 4 prevents premature commitment. State 5 captures knowledge. The happy path is obvious the edge cases are where decisions matter.