Layer 6

The ML Arsenal

The algorithms used in production fabs, their failure modes, and the seven physically-coupled failure patterns that end data science careers.

Layer 6.1

Tree Ensembles (XGBoost, LightGBM, Random Forest)

Primary fab applications

Virtual Metrology (VM) and Fault Detection and Classification (FDC). Predicting wafer thickness, critical dimension (CD), or binary pass/fail based on tabular equipment trace data: pressures, temperatures, gas flows, RF matching networks.

Why tree ensembles win in fabs

Fabs generate massively wide, highly collinear tabular data. A modern etch chamber logs 200+ sensor traces at 1Hz. Tree ensembles handle this collinearity natively, require no strict feature scaling, and provide interpretability that satisfies regulatory compliance requirements. A process engineer who needs to understand why the model flagged a lot can follow a decision path. They cannot follow a gradient descent update rule.

Tuning XGBoost for fab reality

Standard sklearn defaults will fail in a fab environment. Three parameters require deliberate overrides:

Parameter	Standard Default	Fab Reality	Why It Matters
scale_pos_weight	1	50 to 100	FDC failures are severely imbalanced (1:100 fault-to-normal ratio). Default weighting produces a model that predicts "normal" on every wafer.
max_depth	6	3 to 4	Prevents overfitting to high-frequency sensor noise. Deep trees memorize PM-cycle artifacts as features.
subsample	1.0	0.8	Handles mechanical concept drift between PM cycles. Full-sample trees overfit to the last PM interval.

The Hardware Trap: The Christmas Tree Graph

Tree ensembles cannot extrapolate. They are piecewise constant functions with hard boundaries at their training data min/max. If a chamber component ages (say, a mass flow controller drifts slowly higher than anything in your two-year training history), XGBoost does not gracefully extrapolate. It clips the prediction at the value associated with the last leaf node it saw near that boundary. The model will confidently predict a plausible value while the actual process is in an uncharted regime.

The OOD Blast Shield

Tree ensembles must be paired with Out-of-Distribution detection. The pattern is to run an Isolation Forest (or Mahalanobis distance check) on every incoming feature vector before passing it to the prediction model. If the input is anomalous relative to the training distribution, the model returns an explicit safety signal rather than a confident wrong answer:

Fab defensive pattern: OOD blast shield

1

Score the incoming feature vector with an Isolation Forest trained on normal production data

2

If anomaly_score exceeds NOVELTY_THRESHOLD: trigger equipment alarm, return "UNSAFE_TO_PREDICT", route wafer to physical metrology

3

Only if input is within the training envelope: pass to XGBoost, return the prediction

Warning condition requiring mandatory action

Warning Condition	Required Action	Physical Rationale
Model confidence dropped from 99% to 65%	STOP. Inspect manually.	OOD detected. Do not trust the prediction. Unseen mechanical state.

Theory Library

→ Gradient Boosting / XGBoost → LightGBM → Random Forest → SHAP and Explainable AI

Layer 6.2

Geostatistics (Kriging, Gaussian Processes)

Primary fab applications

Spatial interpolation of wafer metrology maps, yield surface modeling, and defect density estimation across the wafer plane.

Why spatial models win where tabular models fail

Wafers are physical circles. If a die at coordinates (X: 12, Y: 14) has a defect, the die at (X: 12, Y: 15) has a non-random probability of having the same defect due to physical phenomena: a CMP scratch, a lithography defocus ring, or a material droplet. Standard regression and tree ensembles treat every die as an independent observation. This independence assumption is physically wrong and throws away the spatial correlation signal that is often the most informative diagnostic.

The Hardware Trap: The Edge Effect

Kriging mathematically assumes an infinite, isotropic plane. The wafer edge violently violates this assumption. A cluster of defects at the exact center indicates a showerhead gas distribution issue (radial symmetry). A ring of defects at the wafer edge indicates focus ring wear. These physically distinct patterns have completely different spatial correlation structures. Fitting a single isotropic variogram to the entire wafer blends these patterns and produces systematically wrong interpolations at both the center and the edge.

Survival Strategy: Universal Kriging

Use Universal Kriging with drift terms for known physical patterns: radial basis functions for center-edge gradients, sinusoidal terms for reticle-pitch periodic patterns, and linear terms for stage scan direction gradients. Or explicitly segment the wafer into zones (center, middle, edge) before modeling, fitting separate variograms per zone. The Bevel Mask pattern:

Fab defensive pattern: bevel mask

Before any spatial analysis: compute Euclidean distance from each die to the wafer center. If distance exceeds wafer_radius - exclusion_mm, set the die value to null. This prevents edge dies from biasing the variogram fit, the interpolation, and the yield model. The exclusion zone is typically 3mm for advanced-node logic, 5mm for memory.

Theory Library

→ Kriging / Gaussian Process Regression → DBSCAN / Spatial Clustering → CNN for Wafer Maps

Layer 6.3

Survival Analysis (Kaplan-Meier, Cox PH, Competing Risks)

Primary fab applications

Equipment health monitoring, consumable life prediction (CMP pads, focus rings, quartz liners), and field reliability modeling for shipped devices.

Standard classification asks "Will it fail today? Yes or No." This is useless for maintenance scheduling. Survival analysis asks: "What is the probability this pump survives another 400 hours?" Fabs run on maintenance windows. Evaluating risk profiles over time rather than binary predictions enables optimized preventive scheduling: replace the component at 70% of its characteristic life, before the wear-out region, without wasting the remaining 30% of useful life.

Theory Library

→ Survival Analysis / Weibull Distribution → Extreme Value Theory

Layer 6.5

Fab ML Failure Mode Taxonomy

In software ML, failure modes are primarily statistical: the model encounters distribution shift, accuracy degrades, and you retrain. In fab ML, failure modes are physically coupled to hardware states, manufacturing events, and data infrastructure constraints. Misdiagnosing a physically-coupled failure as a statistical one and applying a statistical fix will not only fail to solve the problem: it will actively make it worse.

The taxonomy below classifies the seven failure modes that account for the majority of production fab ML incidents. Each has a distinct physical trigger, a detectable data signature, and a mitigation strategy that addresses the root cause rather than the symptom.

Seasoning Drift

response: online learning

Physical trigger

Polymer and byproduct buildup on chamber walls shifts baseline sensor readings over time. The chamber's effective electrical and chemical state changes continuously between PM events.

Data signature

Slow, monotonic drift in sensor means. Typically 0.1 to 0.5 sigma per week for a well-maintained chamber. Too slow to trigger standard SPC alarms but accumulates to a statistically significant shift over the PM interval.

Why it is dangerous

A model trained at the start of a PM interval makes systematically biased predictions by the end of that interval. If the model is recalibrated using a rolling baseline that adapts to the drift, the drift becomes invisible to both the model and the monitoring system.

Mitigation

Implement a "chamber age" feature: wafers processed since the last PM event. This gives the model explicit visibility into seasoning state. Use piecewise linear detrending within PM intervals rather than global normalization. Anchor your PSI baseline to the post-PM steady state, not a rolling window.

PM Reset Shock

response: change-point detection

Physical trigger

After a preventive maintenance event, the chamber walls are chemically pristine: no seasoning layer, different surface energy, changed plasma ignition behavior. The chamber state at wafer 1 post-PM is further from the training distribution than any other operating state.

Data signature

Step change in the sensor distribution immediately after the PM timestamp. Shift magnitude typically 2 to 5 sigma, affecting multiple correlated parameters simultaneously (RF impedance, etch rate, gas consumption all shift together).

Why it is dangerous

PM events are scheduled and predictable, but a naive monitoring system fires alarms on every post-PM wafer. Engineers who see repeated false alarms on the first wafer after every PM will disable the alarm. When a real excursion occurs on a post-PM wafer, it goes undetected.

Mitigation

Create a binary IS_POST_PM feature and exclude the first 5 to 10 wafers post-PM from training data for steady-state models. Maintain a separate "PM response" model trained specifically on post-PM data. Update PSI baselines only after the chamber has stabilized.

Ghost Excursion

response: input validation

Physical trigger

Sensor recalibration, communications timeout, or ADC overflow produces a physically impossible reading that appears as a spike in the data. The sensor returns to baseline immediately after the event.

Data signature

Single-sample spike to a value at or beyond the sensor's physical limits (9999.9, -999.0, 0.0 for a parameter that cannot be zero) followed by immediate return to the pre-event baseline. Spike duration is exactly one sample period.

Why it is dangerous

A model that computes step maximum as a feature will incorporate the ghost spike into the feature value and generate a false alarm. If the model uses the spike as a training label, it will learn a spurious correlation between the sensor fault and whatever process outcome followed.

Mitigation

Implement physics guardrails: reject any feature value that exceeds the hardware specification limits for that sensor. Values outside the physical range are sensor faults, not process events. Apply this filter before feature extraction, not after. Document the valid range for every parameter from the tool qualification datasheet.

Reticle Confusion

response: change-point detection

Physical trigger

Training data is lot-averaged across multiple reticles, or a model is trained on lots using one reticle and deployed on lots using a different reticle. Reticle-specific defect patterns appear in one reticle but not others.

Data signature

High training accuracy combined with near-zero accuracy on new reticle releases. The failure appears as a sudden model degradation event that correlates with a reticle change, not a process change.

Why it is dangerous

The model has not failed. It is correctly predicting outcomes for the reticle it was trained on. It has never seen the new reticle and has no way to signal this. The prediction output looks normal; only the actual yield reveals the problem weeks later.

Mitigation

Always stratify train/test splits by RETICLE_ID. When a new reticle is introduced, treat the first production lots as a hold-out set and do not retrain until at least 50 lots with the new reticle have accumulated.

Q-Time Violation Leakage

response: input validation

Physical trigger

Wafers that exceed their queue-time limit between process steps are scrapped before reaching electrical test. Their process telemetry is recorded in the FDC historian but their final yield label is scrap (zero yield), not a measurement of actual process quality.

Data signature

Training data contains wafers with label zero that were scrapped for logistical reasons rather than process reasons. The model learns to associate certain telemetry patterns with zero yield when those patterns actually correlate with scheduling events.

Why it is dangerous

This is not a model accuracy problem: the model is correctly learning from its training data. The training data is wrong. A model trained on Q-time violations generates false alarms whenever production scheduling is stressed, which tends to happen precisely when the fab is running at highest capacity.

Mitigation

Filter training data on SCRAP_FLAG = 0 before label generation, or filter on SCRAP_REASON NOT IN ('Q-TIME', 'QUEUE_TIME', 'HOLD_EXPIRED'). Implement real-time Q-time tracking in the inference pipeline and exclude in-flight wafers approaching their limit from model scope.

Air-Gap Timeout

response: input validation

Physical trigger

Model inference latency exceeds the tool's SECS/GEM heartbeat timeout. The tool controller interprets the non-response as a communication failure and initiates an emergency stop sequence.

Data signature

The inference API returns a 504 Gateway Timeout. The tool logs an emergency stop event with error code referencing the APC communication timeout. Wafer processing is interrupted mid-recipe.

Why it is dangerous

A mid-recipe emergency stop leaves the wafer in an undefined process state. The downstream impact extends beyond the single interrupted wafer to the lot queue behind it, which may also incur Q-time violations during the tool recovery period.

Mitigation

Compile to ONNX and quantize to INT8 before deployment. Profile inference latency on the actual production hardware at the 99.9th percentile, not the mean. Set the latency SLA at the 99.9th percentile with 20% margin. Implement a hard timeout handler that returns the last known-good prediction rather than raising an exception.

Metrology Blindness (Delayed Labels)

response: online learning

Physical trigger

Wafers are processed at step A, but the measurement that serves as the ground truth label is not taken until step D, 4 to 48 hours later. During that window, the process may drift without the model receiving any feedback signal.

Data signature

Model confidence remains high while actual yield degrades. The degradation is only discovered when the delayed label arrives. By that time, hundreds of additional wafers may have been processed under the drifted conditions.

Why it is dangerous

This is the most structurally difficult failure mode because it cannot be solved by better modeling. A model that is confident when it should be uncertain is more dangerous than an uncertain model, because it prevents human intervention. The longer the label delay, the more wafers are at risk during any given drift episode.

Mitigation

Implement "open loop" alarms that trigger on elapsed time or wafer count, not on model output. If the inference pipeline has not received a ground-truth label for more than X hours or Y wafers, automatically degrade the model's confidence scores and notify the engineering team. X and Y are set based on the maximum acceptable drift accumulation, not on data availability convenience.

Critical distinction

Drift vs. Shift: 8 Diagnostic Signals

The distinction between drift and shift is the most consequential diagnostic decision in fab ML operations. Applying online learning to a shift event teaches the model that broken hardware is normal. Applying change-point detection to drift generates proliferating short-lived models that lose the stable reference point needed for fault detection. Getting it wrong in either direction is operationally costly.

Drift

Gradual and continuous. Seasoning accumulates, metrology blindness extends, Q-time violations correlate with slowly evolving scheduling patterns. Handled with online learning, rolling baselines, and chamber-age features. The key property: the model can adapt if the adaptation is monitored for physical plausibility.

response: online learning

Shift

Abrupt and discrete. PM events reset the chamber state, new reticles introduce new defect patterns, communication timeouts cause instantaneous failures. Handled with change-point detection, model versioning, and hard boundaries. The key property: the pre-shift model is no longer valid post-shift and must not be adapted: it must be replaced.

response: change-point detection

These seven failure modes split into two categories based on the temporal character of the underlying physical change. The 8-signal diagnostic table provides the observational tests that distinguish them without requiring physical investigation:

Signal	Drift (gradual) -> online learning	Shift (abrupt) -> change-point detection
PSI change rate	Gradual: <0.01/day, accumulates over weeks	Sudden: >0.25 within 24 hours
Sensor correlation	All sensors drift together (correlated)	One or few sensors step; others stable
MES maintenance log	No entry: no scheduled work	PM ticket, recipe change, or part replacement logged
Timing pattern	Weeks to months, cyclical (PM interval)	Hours to days, abrupt and irregular
Yield impact	Gradual, graceful degradation	Sudden step drop: >10% in 48 hours
CUSUM behavior	Slow, steady accumulation over many samples	Immediate alarm on first post-event wafer
Chamber scope	Same pattern across entire chamber fleet	Usually single chamber affected
Self-correction	Never self-corrects: requires intervention	May partially self-correct post-PM seasoning

Theory Library

→ Covariate Shift → Changepoint Detection → Fixed vs Rolling Baseline

<- Layer 5: Economics Layer 7: Reference