Systems and Control
SECS/GEM, Run-to-Run control, FDC, Virtual Metrology, Design of Experiments, Fab MLOps, and Causal Inference.
Systems and State Machines
The fab is a distributed real-time system. The MES (Manufacturing Execution System) tracks lot genealogy and dispatches wafers. Equipment controllers run embedded state machines that govern when a tool accepts a wafer, reports data, and responds to commands. SECS/GEM is the protocol connecting them, a standard older than HTTP but still the lingua franca of every modern fab floor.
SECS/GEM: The Language of Equipment
SECS (Semiconductor Equipment Communications Standard) and GEM (Generic Equipment Model) define how factory hosts communicate with tools. It is not a request-response REST API. It is a session-based, binary protocol with strict timing constraints and arcane error handling, designed for equipment controllers from the 1980s that needed to communicate over RS-232 serial lines.
The protocol stack
SECS operates at two transport layers:
- SECS-I: RS-232 serial communication. Legacy, limited to 9600 baud, still present on older metrology tools and some ion implant systems installed before 2005.
- HSMS: High-Speed Message Services. TCP/IP over Ethernet, port 5000 by default. The modern standard for all tools installed after 2000.
Above the transport layer, SECS-II defines the message format. Messages are identified by Stream (S) and Function (F) numbers. The notation SnFm identifies a specific transaction. Key messages every fab data scientist encounters:
S1F1 - Are You There? (host to tool, connection check) S1F2 - I Am Here (tool response) S2F41 - Host Command Send (recipe download, process start) S6F11 - Event Report Send (Stream 6, Function 11: async notification from tool when a state change occurs) S5F1 - Alarm Report Send (equipment alarm notification) S9F7 - Unrecognized Device ID (error: bad device ID in message header)
The GEM state machine
GEM defines six control states for equipment, and the processing state machine governs the execution sequence:
IDLE -> READY -> PROCESSING -> COMPLETE -> IDLE
State transitions are triggered by S2F41 commands from the host or by operator actions at the tool panel. Data collection is state-dependent: a tool only fires Process_Start and Process_End collection events when in the PROCESSING state. Silent gaps in your historian data often correspond to a tool that was IDLE or in a maintenance state, not a data collection failure.
Data collection events (CEIDs)
Collection Event IDs (CEIDs) are the named triggers that cause a tool to fire an S6F11 report. Every fab configures its own CEIDs, but the universal ones are: Step_Start (wafer enters the process step), Step_End (wafer exits the process step), Process_Start (plasma strikes, gas flows begin), and Process_End (plasma extinguishes, gases purged). These four CEIDs are the primary anchors for joining sensor trace data to MES lot context.
The MES Integration Layer
The MES lot history and the SECS/GEM event data are stored in separate databases with separate clocks. Joining them is the foundational operation of every fab data science project.
- MES lot history: which tool processed each wafer, at what time, with what recipe.
- SECS/GEM event data: process parameters logged during each run, sampled at 1 Hz from the GEM interface or 100 Hz from Interface A.
- Metrology data: CD, thickness, and overlay measurements from inspection tools, available hours to days after the process step.
- Electrical test data: binning results and parametric measurements from wafer sort, available 45 to 90 days after the process step.
The Great Fab Join
The canonical join pattern aligns sensor data to lot context using the Step_Start and Step_End timestamps:
SELECT m.lot_id, m.wafer_id, m.step_name, m.tool_id, m.start_time, s.parameter_id, s.value, s.timestamp FROM mes_lot_history m LEFT JOIN secs_trace_data s ON m.tool_id = s.tool_id AND s.timestamp BETWEEN m.start_time AND m.end_time WHERE m.step_name = 'ETCH_MAIN' AND m.lot_id = 'L2026031501' ORDER BY s.timestamp;
The timestamp alignment problem
Tool clocks drift. The MES server runs on UTC. The tool controller runs on local time. The historian server runs on a third clock. The drift between these can reach several minutes over a month of operation without NTP synchronization. A naive join on exact timestamps will misalign sensor data to the wrong wafer.
import pandas as pd
def align_tool_telemetry(high_freq_df, low_freq_df):
# Sort both dataframes by timestamp (mandatory for merge_asof)
high_freq_df = high_freq_df.sort_values('timestamp')
low_freq_df = low_freq_df.sort_values('timestamp')
# merge_asof joins each high-freq row to the nearest
# preceding low-freq row - no exact match required
aligned = pd.merge_asof(
high_freq_df,
low_freq_df,
on='timestamp',
tolerance=pd.Timedelta('2s'), # max allowed clock skew
direction='backward'
)
return alignedRun-to-Run Control Architecture
Modern fabs implement Run-to-Run (R2R) control: adjusting recipe parameters wafer-by-wafer based on metrology feedback. The closed-loop architecture requires synchronized timing between the metrology result, the recipe adjustment calculation, and the next wafer's process start. If the MES routes Wafer N+1 to the tool faster than the R2R loop can close, the adjustment is applied to the wrong wafer.
To break the metrology latency bottleneck, fabs use Virtual Metrology: predicting the metrology result from process data immediately at Step_End, then feeding that prediction into the R2R controller without waiting for physical measurement. The associated risk: if the VM model drifts, the R2R controller chases its predictions and amplifies process variation rather than reducing it.
FDC Architecture
FDC systems monitor equipment trace data for anomalies. They are distinct from R2R: R2R corrects processes within normal variation; FDC stops them when they go wrong. The two systems operate in parallel on overlapping data streams.
Traditional SPC uses univariate Shewhart charts: if any single parameter exceeds 3-sigma from its mean, stop the tool. This approach misses faults where multiple parameters drift in a correlated way that keeps each within its individual control limits but places the process in a physically impossible operating region.
Run-to-Run Control Mathematics
Statistical process control tells you when a process has gone wrong. Run-to-Run control tells the tool what to do differently on the next wafer so the process does not go wrong again. R2R controllers are deployed at every lithography scanner, every etch chamber, and every CMP module. They run at production line speed, meaning the controller must compute the next recipe adjustment in the seconds between wafer completions.
The EWMA Controller
The Exponentially Weighted Moving Average (EWMA) controller is the standard R2R algorithm across the semiconductor industry. It weights recent observations more heavily than older ones, making it adaptive to slowly drifting processes without requiring a formal drift model.
Let Y_N be the metrology result for wafer N, and Y_target be the specification target. The recipe adjustment applied before wafer N+1 is:
Tuning the lambda parameter
The choice of lambda governs the tradeoff between responsiveness and noise rejection. High lambda (0.7-1.0): fast response to real process shifts, but also amplifies metrology noise into unnecessary recipe adjustments. Low lambda (0.1-0.3): smooth corrections that reject noise, but slow to respond to actual drift.
The optimal lambda under minimum variance criteria depends on the ratio of process noise to metrology noise. In practice, fabs start with lambda = 0.3 and tune upward if the process drifts faster than the controller tracks it.
Double EWMA for Drift Compensation
A standard EWMA controller corrects the current level of error but does not explicitly model drift. If the process is drifting (for example, a CMP pad glazing over 200 wafers), the EWMA will always be chasing the target from behind. Double EWMA adds a second filter that tracks the drift rate explicitly, allowing the controller to predict and pre-compensate for future drift.
PID Control and Where ML Fits
More sophisticated R2R systems implement PID (Proportional-Integral-Derivative) control. The integral term accumulates the history of errors and corrects for persistent bias. The derivative term responds to the rate of change of error, providing predictive correction. The feedforward term is where ML augments classical control: a VM model predicts the incoming wafer's state (for example, the incoming film thickness from CVD), and the controller pre-adjusts the recipe before the wafer arrives rather than waiting for the error to accumulate.
# Anti-windup: prevent integral from accumulating # when the controller output is saturated integral = integral + Ki * error integral = max(delta_min, min(delta_max, integral)) delta = Kp * error + integral
The Metrology Delay Problem
The sampled-data EWMA controller faces a fundamental challenge when metrology is not available on every wafer. If only every 5th wafer is measured, the controller must estimate what happened to the 4 unmeasured wafers. The standard approach is to hold the last known correction constant for the unmeasured wafers, then update when the next measurement arrives. This is the minimum-variance solution under the assumption that the process is stationary between measurements.
Virtual metrology solves this by predicting a measurement for every wafer, giving the controller a full observation stream even when physical metrology samples only 20% of wafers.
FDC Architecture and Multivariate Statistics
A Fault Detection and Classification system monitors process data in real time and stops the tool when the data indicates an out-of-control condition. The word "real time" is doing significant work in that sentence: the FDC system must evaluate each incoming trace data record, compare it to a reference model, and decide whether to alarm, all within the inter-sample interval. At 100 Hz sampling, that is a 10ms budget per evaluation cycle.
Univariate SPC: The Baseline
The Shewhart control chart monitors a single parameter against fixed control limits derived from its historical distribution. If the current observation x_N exceeds the mean +/- 3 sigma, an alarm fires. Under the assumption of normality, this occurs with probability 0.0027 when the process is in control, a false alarm rate of 2.7 per 1,000 observations.
The CUSUM chart
Shewhart charts detect large, sudden shifts efficiently but detect small, persistent shifts poorly, because each observation is evaluated independently with no memory of previous values. The Cumulative Sum (CUSUM) chart accumulates evidence of persistent drift:
For slow drifts characteristic of chamber aging (pad glazing, quartz erosion, polymer buildup), CUSUM and EWMA charts detect the shift in 10 to 30 wafers. The Shewhart chart requires 43 to 155 wafers for the same shift magnitude. In a fab processing 500 wafers per day, the difference is catching a drifting chamber in 1 hour versus 6 hours, and the financial difference is proportional to the scrap produced in that window.
Multivariate SPC: Hotelling's T-squared
Multivariate SPC uses PCA decomposition of the trace data to define two monitoring statistics that together cover the full parameter space:
- T-squared statistic: Measures deviation in the principal component subspace: the directions of normal process variation. High T-squared means the process is at an unusual operating point but still behaving in a correlated way.
- Q-statistic (Squared Prediction Error): Measures deviation in the residual subspace: where unusual sensor correlations appear. High Q means the normal correlation structure between sensors has broken down, which is characteristic of sensor failures or novel fault modes not present in training data.
Model retraining triggers
FDC models are trained on reference wafers processed under known-good conditions. They require retraining after every chamber PM or process change. The formal triggers for FDC model retraining are:
- A planned PM event changes the physical chamber state (new quartz liner, new focus ring, new showerhead).
- A process recipe change shifts the nominal operating setpoints by more than the normal run-to-run variation.
- The false alarm rate on the production floor exceeds an established threshold (typically 2 to 5 per shift).
- The PSI of the incoming trace data distribution exceeds 0.2 against the Phase I reference distribution.
The Physics of Ground Truth
In software data science, the target variable is the thing you want to predict. Its values are recorded by the system and treated as fact. In a semiconductor fab, the target variable is a physical measurement produced by an instrument that vibrates, drifts, and degrades. A CD-SEM measurement of trench width is not a fact: it is an estimate with a known noise floor below which no model can improve.
A VM model trained without understanding the noise structure of its labels will set unachievable accuracy targets, report false improvements from hyperparameter tuning that is actually fitting label noise, and produce models that perform worse in production than in evaluation.
How a CD-SEM measures
A CD-SEM (Critical Dimension Scanning Electron Microscope) fires a beam of high-energy electrons at the wafer. When the beam hits the edge of a silicon trench, secondary electrons bounce off and reach a detector. The tool generates a waveform from the detector signal and applies an algorithm to locate the peaks, which correspond to the trench edges. The distance between peaks is reported as the critical dimension.
This measurement process has three sources of noise: electron scattering is stochastic (each individual electron arrives at a slightly different position even from an identical beam); photoresist shrinkage occurs because the electron beam deposits energy in the material it measures, physically shrinking the resist by 0.3 to 0.8 nm per measurement; and algorithmic noise comes from the peak-finding algorithm applied to a noisy waveform.
Gauge R&R and the R-squared ceiling
Gauge R&R (Repeatability and Reproducibility) quantifies how much of the observed variance in your Y column comes from the measurement process rather than from real process variation. The maximum achievable R-squared for a VM model is:
Fleet matching and the bimodal Y problem
Fabs operate fleets of metrology tools. Because electrons interact with magnetic lenses uniquely on every machine, Tool A will consistently measure systematically different values than Tool B for the same physical feature. Equipment engineers calibrate to minimize this inter-tool offset, but a residual offset typically remains.
If training data contains labels from both Tool A and Tool B, the target variable has a bimodal distribution: two peaks separated by the inter-tool offset. A model trained on this data will struggle to converge because identical sensor traces from the process tool map to two different label values depending on which metrology tool made the measurement. The fix is to stratify by metrology tool and train separate models, or to apply a bias correction to align all labels to a single reference tool before training.
Design of Experiments
Observational data records what the factory has already done. If the fab has run the plasma etch chamber between 400 W and 410 W for three years, the historical dataset contains no information about what happens at 450 W. A model trained on that dataset cannot predict the yield at 450 W because it has never observed that operating point. To answer questions outside the historical operating range, the process must be deliberately perturbed.
In software, perturbation is cheap. In the fab, it costs physical silicon. A 5-parameter experiment at 3 levels each requires 243 wafer runs. At $50,000 per advanced-node wafer, that is a $12.15 million experiment. Designed experiments reduce that cost by running a carefully selected fraction of all possible combinations while still estimating the effects of interest.
Full factorial designs
The simplest designed experiment tests k factors at two levels: low (-1) and high (+1). The full 2^k factorial runs every combination. For k = 3 factors, this is 8 runs, mapping to the eight corners of a cube in three-dimensional parameter space. The orthogonality of the design matrix guarantees that main effects and two-factor interactions can be estimated independently.
Fractional factorials and aliasing
For k = 7 factors, a full factorial requires 128 runs. A 2^(7-4) fractional factorial selects a strategically chosen subset of 16 runs by setting certain factor columns equal to products of other factor columns. Resolution IV guarantees that no main effect is aliased with any other main effect or with any two-factor interaction, but some two-factor interactions are aliased with each other.
Center points and curvature
A 2^k factorial maps the corners of a hypercube and fits a model that is linear in all factors. Adding center point runs (all factors at the midpoint between their low and high levels) tests for curvature. If the average response at the center points differs significantly from the average at the corner points, the true response surface is curved and a linear model will not adequately represent it. Center points cost 3 to 5 additional wafer runs and provide a pure error estimate for the ANOVA F-test.
Blocking for chamber memory
Statistical design theory requires random run order to prevent time-based confounding. In plasma chambers, the walls carry memory of the previous chemistry. A block variable representing chamber condition groupings (before vs. after wet-clean, high-polymer vs. low-polymer chemistry) allows the ANOVA to remove chamber memory effects from the experimental error, recovering the true effect estimates for the factors of interest.
OFAT and why it fails
One-factor-at-a-time (OFAT) experimentation changes one factor while holding all others constant. It cannot detect interaction effects because it never tests combinations of factor levels. In a process with a negative interaction between RF power and chamber pressure, OFAT will find that increasing RF power improves yield at the standard pressure. It will not find that this improvement reverses at elevated pressure. The split-lot experiment that tests both factors simultaneously will find the interaction and identify the correct operating point.
Fab MLOps and Data Pipelines
A 1D CNN trained to detect plasma micro-arcs may achieve 99% accuracy in offline evaluation. If it takes 1.5 seconds to run on the tool controller hardware, it will never catch a micro-arc. Micro-arcs occur in 50 milliseconds. For any model whose output triggers a real-time process decision, the latency of inference on the production hardware is not a deployment detail: it is a primary design constraint that must be validated before any other evaluation metric is considered.
The Python overhead problem
Python is not a low-latency inference environment. When a scikit-learn or PyTorch model predicts on a single sample in Python, the framework dispatches to its underlying C library, allocates memory, executes the math, and deserializes the result back into a Python object. This overhead runs 20 to 50 ms on a modern laptop and 80 to 200 ms on a 2012-era dual-core tool controller, regardless of how computationally simple the model is.
ONNX export
ONNX (Open Neural Network Exchange) serializes a trained model as a static, language-agnostic computational graph with weights embedded. The ONNX Runtime executes this graph in C++ without any Python interpreter involvement. On the same hardware where Python inference takes 50 ms, ONNX Runtime typically runs in 2 to 8 ms.
import torch
torch.onnx.export(
model,
dummy_input,
"endpoint_detector.onnx",
input_names=["trace_features"],
output_names=["endpoint_probability"],
opset_version=17, # match the version on your prod server
dynamic_axes={
"trace_features": {0: "batch_size", 1: "sequence_len"}
}
)Quantization
Post-training quantization maps FP32 weights to INT8 through a scale factor and zero-point offset per layer. The model shrinks to 25% of its original size. Modern CPUs execute INT8 arithmetic through SIMD vectorization that processes 4 to 8 values simultaneously, making INT8 inference 3 to 6 times faster than FP32 on CPU hardware. The accuracy cost on well-calibrated models is typically less than 1% on the metrics that matter for production.
ONNX export pitfalls
The first class of failure involves opset version mismatches. ONNX Runtime on the production server is fixed at whatever version the IT team approved during the original stack qualification, which may be one to three years old. Exporting with a higher opset version than the runtime supports produces a model that runs without error during development and fails silently or crashes on the production server.
The rule: determine the ONNX Runtime version on the production server before writing any training code, find the maximum opset version that runtime supports, and set opset_version to that value at export. Do not use the highest opset available in your development environment.
Latency tiers
| Tier | Location | Budget | Use cases |
|---|---|---|---|
| Edge | Tool controller IPC | < 200 ms | Endpoint detection, micro-arc interlocks, real-time FDC |
| Fog | APC server on fab floor | < 2 s | Run-to-run control, between-wafer recipe adjustment |
| Core | MES / VM server | < 30 s | Virtual metrology, lot-level yield prediction |
| Batch | Data warehouse | Minutes to hours | Root cause analysis, DoE analysis, historical trending |
The Fab IT Landscape
The air gap between the fab equipment network and the public internet is a physical security measure protecting process recipes worth billions in R&D. It is not a firewall rule that can be granted exceptions. There is no pip install on the production server. There is no Docker registry on the equipment network. Deployment means physically carrying media across the gap, through an antivirus kiosk, and submitting a change control ticket that takes 2 to 6 weeks to approve.
The Data Pipeline Architecture
Every ML model in the fab depends on a data pipeline that moves sensor readings from the tool controller to a place where a model can act on them, at a latency consistent with the control decision the model is making. The pipeline architecture determines the model's real-world performance ceiling regardless of the model's offline evaluation metrics.
Fab data infrastructure is organized into five tiers, each with different latency, storage, and access characteristics. The Fog tier's time-series historian is the primary data source for FDC, R2R, and VM models. Historians are specialized databases built for append-only, timestamp-indexed, high-throughput write workloads. A chamber producing 50 parameters at 100 Hz generates 432 million rows per day. A bay of 40 chambers generates 140 GB per day. General-purpose relational databases are not suitable for this role.
-- Canonical historian schema (narrow and tall) CREATE TABLE trace_data ( tool_id VARCHAR(32) NOT NULL, parameter_id VARCHAR(64) NOT NULL, timestamp_ns BIGINT NOT NULL, , UTC nanoseconds since epoch value FLOAT(32) NOT NULL, quality TINYINT NOT NULL, , 0=good, 1=uncertain, 2=bad PRIMARY KEY (tool_id, parameter_id, timestamp_ns) );
Causal Inference in Sequential Manufacturing
A yield engineering team spent three months training gradient boosted models on tool usage data, trying to identify which etch chamber was causing a 4% yield loss. The model consistently flagged Chamber 14 with high feature importance. The team scheduled a teardown and found nothing wrong.
Chamber 14 was flagged because it ran immediately before Chamber 22 in the standard lot routing sequence. Chamber 22 had a contaminated focus ring. The correlation was real. The causal relationship was not.
Why correlation fails in the fab
In a 300 mm fab, every wafer passes through 500 to 1,000 process steps. At each step, a specific tool processes the wafer. Tool assignment is not random: lots are routed by queue time, tool availability, and recipe compatibility. Tools that process the same lots tend to be co-scheduled. Their usage patterns are correlated through routing, not through physics.
If Chamber 14 and Chamber 22 both appear frequently in the process history of failing lots, a standard feature importance calculation cannot separate their contributions. It will distribute predictive credit between them based on their correlation structure, not based on which chamber's physics caused the yield loss.
The causal graph and do-calculus
Causal inference requires specifying which variables cause which other variables. This specification is the causal graph, a directed acyclic graph (DAG) where an edge from A to B means A causes B. In the killer tool problem, Chamber 22 causes yield loss, and fab routing causes both Chamber 14 usage and Chamber 22 usage. Chamber 14 is a collider variable: it shares a common cause (routing) with Chamber 22 but has no direct causal path to yield.
The do-calculus formalizes the difference between observing a variable and intervening on it:
- P(Y | X1 = 1): the probability of yield given that Chamber 14 was used. This is observational: it includes all the routing correlation.
- P(Y | do(X1 = 1)): the probability of yield if you intervened to force Chamber 14 usage for all lots. This is causal: it isolates the direct effect of Chamber 14 on yield, removing routing correlation.
These two quantities are equal only when X1 is not confounded by any common cause with Y. In a fab, they are almost never equal because routing is a common cause of tool usage patterns across all chambers.
Adjustment and backdoor paths
A backdoor path between treatment X and outcome Y is any path running through a common cause of X and Y. In the Chamber 14 example, the backdoor path is: Chamber 14 <- Routing -> Chamber 22 -> Yield. Conditioning on routing closes this path and allows unbiased estimation of Chamber 14's causal effect.
The backdoor criterion states that a set of variables Z is sufficient for causal identification of the effect of X on Y if Z blocks all backdoor paths from X to Y and Z contains no descendants of X. In the fab, the routing schedule often satisfies the backdoor criterion for individual tool effects, making propensity score methods and stratified analysis feasible.
Causal inference vs. causal discovery
Causal inference assumes the causal graph is known or specified by domain experts, and estimates the causal effect of specific tools on yield within that graph. The graph is an input. Causal discovery treats the graph as an output: algorithms like the PC algorithm and NOTEARS return a Markov equivalence class: a set of graphs that are statistically indistinguishable given the data. Multiple graphs in the equivalence class may have different causal interpretations.
Propensity score methods require that all relevant confounders are observed and correctly specified. In a fab with complex routing logic, this assumption is untestable. Causal conclusions from observational tool usage data are always tentative. The appropriate output of causal analysis is a ranked list of candidate causal tools with confidence intervals, not a definitive identification; that conclusion comes from a targeted split-lot experiment on the top candidates.