OEE, Little's Law, cost of ownership, and the ROI translation that determines whether your model gets deployed.
Layer 5.1
Overall Equipment Effectiveness
A data scientist presents a model that improves etch yield by 0.5%. The Fab Director asks: "Will this slow down the tool?" If the answer is yes, the model may be rejected despite the yield gain. In a fab running at capacity, time is the ultimate currency. A tool that produces perfect wafers but runs at 80% of its rated throughput is less valuable than a tool that produces wafers with 1% more scrap but runs at full speed. Understanding this tradeoff is what separates a data scientist who ships models from one who builds impressive demos that never reach production.
The OEE Equation
OEE measures the percentage of manufacturing time that is truly productive. It is the product of three factors: Availability, Performance, and Quality.
1
Availability
The fraction of scheduled time that the tool is physically ready to process wafers.
ML impact: An ML model that generates false positive FDC alarms directly reduces Availability. Every spurious alarm halts the tool, triggers an engineering investigation, and burns planned uptime.
2
Performance
The fraction of theoretical maximum speed the tool achieves while running.
ML impact: An ML model that requires an extra 2 seconds per wafer for cloud inference directly reduces Performance. At 200 wafers per day, that is 400 seconds of lost throughput daily - on a tool that may cost $10,000 per hour to run.
3
Quality
The fraction of processed wafers that pass specification without rework.
ML impact: An ML model that detects excursions early directly increases Quality. This is the component where good ML models create value - but it must outweigh the Availability and Performance costs.
The Tradeoff Calculus
If an ML model improves Quality by 1% but requires an extra measurement step that reduces Performance by 2%, the OEE decreases. In a sold-out fab where every wafer slot is immediately filled by demand, a drop in OEE means a drop in factory revenue. The only exception: if the tool is not the factory bottleneck, Performance losses at that tool do not constrain factory output. The model's deployment priority and latency requirements are therefore determined by whether the tool is a bottleneck tool.
Layer 5.2
Little's Law and WIP Management
Work-In-Progress (WIP) is the inventory of unfinished wafers on the fab floor. Cycle time is the total time from blank silicon to finished product. Throughput is the number of wafers completing the process per day. These three quantities are related by Little's Law:
WIP = Throughput x Cycle Time
e.g., 2,000 wafers in progress = 100 wafers/day x 20 days cycle time
If you want to increase Throughput without increasing Cycle Time, you must run the factory more efficiently. If you simply increase wafer starts (increasing WIP) without clearing bottlenecks, Cycle Time will increase proportionally and Throughput will flatline at the bottleneck's capacity. Adding WIP to a constrained factory does not increase output - it only increases the financial exposure tied up in partially processed silicon.
The Bottleneck Tool
In every fab, one tool group defines the maximum throughput of the entire factory. Typically this is the most expensive equipment: the advanced lithography scanners. An hour of downtime on a non-bottleneck tool (like a wet bench with excess capacity) costs nothing in lost factory output - the WIP simply accumulates at the queue and the downstream tool catches up. An hour of downtime on the bottleneck scanner costs the fab one hour of output at the scanner's rated wafer-per-hour rate.
Deployment Rule
Models deployed on bottleneck tools have absolute priority for IT resources, the lowest tolerance for inference latency, and require the highest threshold of proof that they will not impact Performance or Availability. A model that reduces scanner throughput by 0.5% may cost more in lost output than it saves in yield improvement. Run the OEE arithmetic before proposing deployment on a bottleneck.
Layer 5.3
Cost of Ownership and the ROI Translation
A data science project is ultimately evaluated by its financial return on investment. Fab accounting translates engineering metrics into dollars using Cost of Ownership (CoO) models. Understanding this translation is what allows a data scientist to size up a project before building it and to present results in terms that matter to the people who fund the work.
Scrap vs. Yield Loss
Scrap
A wafer physically destroyed or deemed unrecoverable mid-process. The financial loss is the cumulative manufacturing cost up to that step.
A lot scrapped at metal 1 (step 350 of 500) loses approximately 70% of the wafer's final value - all the processing cost incurred up to that point, with zero revenue recovered.
Yield Loss
A wafer completes the entire process, but some percentage of dice fail at electrical test. The financial loss is the opportunity cost: dice that were manufactured but cannot be sold.
A 5% yield loss on a wafer with 400 dice at $200 per die = $4,000 per wafer in lost revenue, at the full cost of the completed manufacturing process already sunk.
An ML model that catches a fault early saves the fab the cost of processing a doomed wafer through the remaining steps. This is the financial justification for Virtual Metrology and in-line FDC: the earlier you detect a bad wafer, the less manufacturing cost is sunk into it before it is scrapped. A model that detects the fault at step 50 saves 450 steps of processing cost. A model that detects it at step 490 saves only 10 steps - but still prevents the wafer from shipping as a defective product.
The ROI Translation Matrix
Every data science metric has a direct translation to a fab engineering metric, and every fab engineering metric has a direct translation to a financial value. This matrix is the language you must speak when presenting to a Fab Director.
Data Science Metric
Fab Engineering Metric
Financial Value (ROI)
Model Precision improves by 10%
False Alarm Rate drops by 5%
Recovered tool Availability - hours per week of unneeded engineering investigation eliminated
Model Recall improves by 10%
Undetected Excursions drop by 2 per year
Avoided Scrap and Yield Loss - each caught excursion saves processing cost of the remaining steps
Inference Latency drops by 500ms
Wafer-to-Wafer time drops by 500ms
Gained Throughput on bottleneck tools - at $10K/hr, 500ms x 200 wafers/day = $1,400/day
R-squared improves by 0.05
Process Cpk improves by 0.1
Higher-bin selling price - tighter VM predictions enable tighter recipe control, shifting die distribution toward faster bins
Layer 5.4
The Business Case Template
Before writing a line of code for a new predictive model, a fab data scientist must draft a business case. If the business case cannot show a positive ROI, the model should not be built regardless of how interesting the math is. The template has five required sections:
1. Problem statement
What specific failure mode, yield loss, or inefficiency does this model address? State it in engineering terms, not data science terms. "Predict endpoint to within 50ms" not "build a 1D CNN."
2. Current state baseline
What is the current false alarm rate, yield, or OEE at the target tool? What is the financial cost of the current state per year? This is your denominator.
3. Target state
What specific improvement in the engineering metric is realistic? Base this on the theoretical ceiling (Gauge R&R for VM, information-theoretic limits for classification) not aspirational accuracy. What OEE impact (positive or negative) will deploying the model have?
4. ROI calculation
Translate the delta between current state and target state into dollars per year using the Translation Matrix. Subtract the model development cost (data scientist time, infrastructure, validation) and the ongoing maintenance cost. Show the payback period.
5. Risk and counterfactual
What happens if the model fails silently? What is the degradation path (from Layer 2.1)? What A/B test or hold-out group will you use to prove the model is causal, not correlated with a concurrent improvement? How long will the validation period run before full deployment?
Appendix C
Field Guide to Common Excursions
Six rapid-reference scenarios. Each maps symptoms to root cause to the theory section that prevents recurrence. Use this when you have a live excursion and need to know where to look.
Scenario 1
The Clock Skew That Destroyed a Lot
$2.5M (50 wafers at metal 1)
Symptoms
Model predicted normal etch depth. Physical measurement showed 40% over-etch. Post-hoc analysis shows the model trained on Chamber ETCH-08 was applied to wafers from Chamber ETCH-09. Both chambers run identical recipes.
Investigation
The MES dispatch log shows wafers were correctly routed. The FDC trace data shows normal chamber pressure. The model inputs look correct. The issue is the timestamp join. ETCH-08's tool controller clock was 3.2 seconds fast relative to the FDC server. ETCH-09 was 1.8 seconds slow.
Root cause
Tool controller CMOS battery degradation. Both clocks drifted independently over 18 months since last NTP sync.
Prevention
See Layer 3.1: Timestamp Alignment Strategies. Implement cross-correlation alignment using RF power spike as a physics anchor. Do not rely on declared timestamps for multi-chamber models.
Scenario 2
The Phantom Micro-Arc
$4.5M (entire lot scrapped after subsequent implant step)
Symptoms
Transistor gate oxide leakage fails on 12% of dies across a 25-wafer lot. FDC models for RF parameters show zero alarms during the gate etch step.
Investigation
The data scientist checks the 1Hz trace data from the historian. No anomalies. The process engineer pulls the raw 100Hz Interface A data. A 40-millisecond spike in RF_Reflected_Power is clearly visible on every failing wafer.
Root cause
Feature extraction erased the physics. The FDC pipeline was configured to calculate the Step Mean and Step StdDev of the RF trace. A 40ms spike in a 60-second step does not move the mean enough to trigger a statistical alarm.
Prevention
See Layer 3.3: FDC Architecture. Always include max(), min(), and 99th_percentile() in feature extraction for RF and plasma parameters. Averages hide transients. Transients kill transistors.
Scenario 3
The Confidence Interval Catastrophe
$4.5M (lot scrapped after implant, VM model accepted zero-film wafer)
Symptoms
Virtual Metrology model deployed to predict CMP thickness. For three months, it performed flawlessly. On Tuesday, it predicted 45.0nm (perfectly on target) for a lot that actually had 0.0nm of film deposited due to an upstream CVD failure.
Investigation
The CVD tool precursor gas line was empty. The MFC reported 0.0 sccm flow. The VM model had never seen 0.0 sccm in its training data (minimum was 42.0 sccm). A random forest cannot extrapolate outside its training domain. It routed the anomalous input down a leaf that happened to predict 45.0nm.
Root cause
Deploying a regressor without an Out-Of-Distribution detector.
Prevention
See Layer 3.4: The Physics of Ground Truth. Never deploy a VM model without a parallel Mahalanobis distance calculator measuring the Reliability Index. If the input feature vector is outside the training envelope, the model must output UNKNOWN and force a physical measurement.
Scenario 4
The Recipe Revision Trap
$180K (wasted metrology time clearing false alarms)
Symptoms
An anomaly detection model trained to monitor an anneal furnace starts generating 100% false positives immediately after a scheduled maintenance window.
Investigation
The data scientist checks the model inputs. The temperature profile looks normal. The flow rates look normal. However, the process engineer reveals they updated the recipe from ANNEAL_V2 to ANNEAL_V3 during the PM. The only change was a 5-second extension to a stabilization step.
Root cause
The ML pipeline used Step_Duration as a feature. The 5-second shift altered the dynamic time warping alignment of the entire trace, making a normal V3 trace look like a catastrophic anomaly against the V2 baseline.
Prevention
See Layer 3.6: Fab MLOps. Model inference pipelines must include an explicit check: IF incoming_recipe_rev != trained_recipe_rev THEN suspend_model(). Never assume a minor process tweak is mathematically minor to an algorithm.
Scenario 5
The Coordinate Mismatch
$850K (lost tool availability during unnecessary teardown)
Symptoms
A spatial clustering algorithm flags a dense "donut" pattern of defects on the wafer map. The equipment engineer pulls the chamber apart looking for a showerhead blockage. Nothing is found. Yield is perfectly normal.
Investigation
The data scientist overlays the spatial defect map with the die layout. The "defects" perfectly align with the scribe lines - the empty streets between dice where test structures are printed.
Root cause
The inspection tool origin (0,0) was calibrated to the center of the wafer. The ML pipeline assumed (0,0) was the bottom-left notch. A simple affine translation error mapped benign scribe-line test structures into the active die area.
Prevention
See Layer 4.4: Spatial Statistics and Yield Geometry. Always establish physical anchors (notch and center) before applying any ML to spatial data. Run a known "golden wafer" with deliberate perimeter marks through the entire pipeline to verify matrix transformations.
Scenario 6
The Physics Model Failure (Correlation vs. Causation)
$0 in scrap - but catastrophic, unrecoverable loss of credibility with the executive yield team
Symptoms
Yield on Chamber ETCH-04 dropped to 85%. A new pressure-compensation model was deployed. Over three days, yield recovered to 92%. The data science team claimed a $2M annualized win.
Investigation
The Fab Director asked to see the control group data. A Difference-in-Differences analysis against Chamber ETCH-05 (which received no new model) revealed that ETCH-05 yield also recovered from 86% to 91% during the same three-day window.
Root cause
Confounding variables and regression to the mean. A chemical supplier had shipped a slightly degraded batch of photoresist affecting all chambers. When a fresh batch arrived, yield recovered universally. The model had zero causal impact.
Prevention
See Layer 3.7: Causal Inference. Never report an ROI without a counterfactual. Always use a synthetic control, a hold-out chamber, or an A/B split for Difference-in-Differences analysis. Do not take credit for improvements that would have happened without your model.