The Fab Data Scientist's Survival Guide
Seven layers from silicon physics to production ML
This document is written for data scientists transitioning from software domains. Throughout each chapter you will find four types of margin notes. Experienced fab engineers can read the main text and skip the notes. The notes exist for everyone else.
Silicon Is Not Software
In 2019, a data science team at a logic foundry was testing an automated endpoint detection model for a plasma etch chamber. The model predicted process completion 200 milliseconds late. In that window, the plasma drilled through the target silicon layer, through the etch stop, and into the transistor junction below it. The 25-wafer lot was scrapped at $2.5 million. The chamber needed a 14-hour wet-clean before it could run again. The downstream lithography scanner sat idle waiting for material.
The model predicted the endpoint correctly enough. Running 200 milliseconds late destroyed the lot.
In semiconductor manufacturing, experiments run on physical silicon at $50,000 per wafer. Ground truth for a yield prediction arrives 60 days after the prediction, when the wafers finally reach electrical test. The machine the deployed model runs on is often an industrial PC from 2012, air-gapped from the public internet.
The Constraints
Physics bounds the optimizer.
Mathematical optimization has no concept of material failure. A response surface model may correctly identify that yield peaks at a chuck temperature of 650°C. The fluoroelastomer O-rings sealing the vacuum chamber melt at 600°C. Every optimization problem in the fab must be bounded by hardware survivability limits as hard constraints in the formulation. A model that finds a physically impossible optimum will be deployed at some point and will fail when it is.
Experiments cost $50,000 per wafer.
Testing five parameters at three levels requires 243 wafer runs. At $50,000 per advanced-node wafer, that is a $12 million grid search. Fractional factorial designs, response surface methodology, and Bayesian optimization are standard practice in the fab because the physics is expensive to perturb. The math exists to minimize the number of wafers spent finding an answer.
Sensor readings are not ground truth.
A mass flow controller reading 0.0 sccm while the plasma is still striking means the sensor is unplugged. Tool clocks drift by minutes. Databases repeat frozen values for days when thermocouples fail. The primary data stream from legacy SECS/GEM protocols cannot sample faster than 1 Hz. Plasma micro-arcs that destroy transistors happen in 50 milliseconds and leave no trace in 1 Hz data.
Deployment targets are air-gapped industrial computers.
The tool controller attached to the etch chamber is often a machine from 2012: dual-core processor, 2 GB of RAM, no GPU, no internet connection, Python 3.7. A PyTorch model that runs in 80 milliseconds on a development laptop may take 1.5 seconds on that hardware. If the plasma event the model monitors happens in 50 milliseconds, 1.5 seconds is a missed event.
How This Document Is Organized
The manual is structured in seven layers. Each layer represents a domain of knowledge that a production fab data scientist must hold simultaneously.
The Fab Rosetta Stone
In a morning yield meeting, a lead process engineer might say: "The FDC on the PVD caught an MFC fault, we need to check the SAH wafer before Q-Time expires or we scrap the WIP." Every word in that sentence refers to a specific physical system, financial consequence, or time constraint. This glossary gives you the minimum vocabulary to function in that conversation.
Pre-Flight: Physics for Computer Scientists
Scale
Fab data spans twelve orders of magnitude in length scale. A 300 mm wafer holds dice measured in millimeters; each die contains transistors measured in nanometers. At the 3 nm node, the transistor gate is approximately 15 silicon atoms wide. The defects that kill yield are often measured in the single-digit nanometer range on substrates hundreds of millimeters across.
The conversions that matter most: 1 μm = 1,000 nm, and 1 mm = 1,000,000 nm. A defect reported at 0.08 microns is 80 nm , roughly the width of a human hair divided by 1,000.
The six core processes
The four sensor units you will encounter daily
| Unit | Measures | Typical range | Data quality note |
|---|---|---|---|
| sccm | Gas flow (standard cubic centimeters per minute) | 0 to 500 | 0.0 may mean off or dead sensor; context determines which |
| mTorr | Chamber pressure (1 Torr = 133 Pa) | 5 to 200 | Correlated with etch rate; rising pressure indicates vacuum degradation |
| W | RF power | 200 to 5,000 | Sudden drop to 0 W mid-step means plasma extinction |
| °C | Temperature | -20 to 600 | Thermocouple readings lag physical temperature by seconds; pyrometers are sensitive to surface emissivity changes |
Day 0: Your First Query
You have completed the security training. You have received your read-only credentials for the data warehouse. The equipment engineer who sponsored your access has sent you a one-line email: "Check the ETCH01_PRESSURE table for yesterday." You open the query interface.
SELECT TOP 5 TIMESTAMP, WAFER_ID, STEP_ID, VAR_01, VAR_02, VAR_03, VAR_07, VAR_12 FROM FDC_HISTORIAN.ET.CH01_PRESSURE_01HZ WHERE TIMESTAMP >= '2026-03-29 00:00:00' ORDER BY TIMESTAMP DESC;
The query returns 47,000 rows. You recognize TIMESTAMP and WAFER_ID. You have no idea what VAR_01 through VAR_12 are. You have four immediate questions.
What are VAR_01 through VAR_50?
They are sensor readings. The FDC historian stores them as generic columns because the table schema was defined in 2008 and the vendor charges $50,000 to add named columns. VAR_01 is Chamber Pressure in millitorr. VAR_02 is RF Forward Power in watts. VAR_07 is OES Intensity at 520 nm. VAR_12 is MFC_1 flow in sccm. This mapping exists in a spreadsheet on a shared drive that only two people know about.
Why is VAR_12 returning -9999.0?
-9999.0 is the FDC historian's sentinel value for "sensor offline" or "communication timeout." It does not mean the OES intensity was negative. If you ingest this table into pandas and run df.mean() without replacing -9999.0 with NaN, your features will be corrupted by a physically impossible value.
Why are there three rows per second?
The table name ends in _01HZ. This is a SECS/GEM trace sampled at 1 Hz. The three rows per second may be from three different wafers running simultaneously in different chambers on the same tool. If you need the 100 Hz Interface A data for endpoint detection, that table is named ET.CH01_PRESSURE_100HZ and contains 100x more rows.
Where is the ground truth?
There is no Y column in this table. The FDC historian contains process parameters (X). The metrology results (Y) are in the MES METROLOGY_RESULTS table, joined by WAFER_ID and STEP_ID. For the wafer that ran an hour ago, the CD measurement will not exist yet. It will exist in 4 to 8 hours when the wafer reaches the metrology tool. For final electrical test data, it will exist in 45 to 90 days.
Identify which VAR_XX columns in your target chamber are process sensors, which are safety interlocks, and which are currently returning -9999.0. Document the physical meaning of every column you intend to use. This is not optional , it is the foundation of every subsequent analysis.
The Mindset Shift
This section is for data scientists entering semiconductor manufacturing from consumer internet, fintech, healthcare, or other software domains. The technical skills transfer. The intuitions about what those skills are for, and what the consequences of using them wrong look like, do not transfer without deliberate recalibration.
The items below are not hypothetical. Each one represents a failure mode that recurs regularly in fabs when engineers who are competent in their original domain apply their existing mental models without adjusting them to the physical and financial context.
| Software DS assumption | Fab DS reality |
|---|---|
| Data is abundant and cheap. I can always collect more if my sample size is insufficient. | Experiments cost $50,000 per wafer. A 243-run factorial design is $12.1 million. You will receive 47 samples for a rare defect mode and you must make it work. |
| Ground truth arrives in milliseconds (click) to days (credit default). I can validate my model quickly. | Ground truth for etch depth arrives 60 days later when wafers reach electrical test. Your model will operate for two months before you know if it works. |
| I can query production databases directly for feature extraction. | The production FDC historian is on an air-gapped network. Querying it requires a change control ticket. Your development environment will never directly touch production data. |
| Feature drift means retrain the model with fresh data. | Feature drift might mean a $2M chamber component has degraded. Retraining masks the hardware problem. Investigate first. |
| A 200ms latency budget is generous for real-time inference. | The plasma micro-arc you are detecting lasts 50ms. Your model has 50ms to detect it and trigger the interlock. |
| My model runs in a container on a cloud instance with 32 cores and 128GB RAM. | Your model runs on a dual-core industrial PC from 2012 with 2GB RAM and no GPU. It must infer in 5ms. |
| Drift detection: PSI > 0.2 means significant shift. | A recipe change from 15mTorr to 45mTorr produces PSI = 0.8 immediately. Monitor deviation from setpoint, not absolute value distribution. |
| Model confidence interval: 95% is a statistical luxury. | Model confidence interval determines whether physical scrap is produced. It is an operational necessity. |
| Feature engineering: try thousands of combinations and let the algorithm select. | Each feature must be physically interpretable to a process engineer with 20 years of experience. If you cannot explain what the feature measures, the model will not be trusted. |
On data collection
A data scientist trained in software learns that more data is almost always better and that the cost of collecting more data is primarily engineering time. This is correct in most software domains.
In the fab, collecting more data requires running wafers through a process step under conditions that differ from the approved Process of Record. This is a wafer split experiment. Each wafer in the split costs $50,000. A five-factor experiment at three levels per factor requires, at minimum, a Resolution V fractional factorial design with 32 runs. That is $1.6M in wafer cost before accounting for the engineering time, the MES coordination, and the Yield Review Board approval process, which takes 2 to 4 weeks for a process change request.
The correct mental model: treat data collection as a capital expenditure, not an operating expense. Before requesting an experimental split lot, write out the specific hypothesis being tested, the minimum detectable effect size required to act on the result, and the sample size justified by that effect size. If you cannot specify those three things, you are not ready to request the experiment.
On ground truth latency
Most ML applications receive ground truth within seconds to weeks. For etch depth, film thickness, or transistor parameter prediction, ground truth arrives when the wafer reaches the relevant metrology step , hours to days after the step being predicted. For final electrical test, the delay is 45 to 90 days of fab processing.
During those 45 to 90 days, the model may be processing thousands of wafers. If the model has a systematic bias that was not caught in offline evaluation, the financial consequences accumulate silently until electrical test. The appropriate response is not to wish for faster ground truth. The appropriate response is input distribution monitoring: if the distribution of the model's input features shifts in a way that was not present in training data, the model is operating outside its validated range and should be routed to physical metrology.
On deployment infrastructure
In software production environments, model deployment means containerizing a Python process and pushing it to a Kubernetes cluster. In the fab, model deployment means packaging a serialized model file, normalization constants, feature extraction code, and a lineage manifest, transferring the package through physical media across an air gap, passing it through an antivirus kiosk, submitting an IT change management ticket, waiting 2 to 6 weeks for change control approval, and installing the package on a machine running Python 3.7 with no package manager access.
There is no workaround for this process. Design for the production stack from the start. Export to ONNX before the model is considered complete. Validate the exported model on a mirror of the production environment before submitting the change control ticket. The time from "model is done" to "model is running in production" is measured in weeks, not minutes.
On feature drift
In software ML, input distribution shift typically means user behavior has changed. Retraining on recent data is the standard response. In the fab, input distribution shift may indicate behavior change (new recipe, new product) or it may indicate hardware degradation. A 10% drift in RF Forward Power distribution may mean recipes were updated , or it may mean the RF match network capacitor is wearing out. Retraining the model to treat the drifted distribution as normal masks a hardware failure that will eventually cause a yield excursion.
The correct response to input distribution shift in a fab ML system is root cause investigation: is the drift from a recipe or product change (expected , document and update the baseline) or from hardware degradation (do not retrain , escalate to equipment engineering for inspection)?
On model confidence
In software applications, model confidence intervals are primarily used for user experience design. In virtual metrology, model confidence determines whether a wafer is routed to physical metrology (adding 4 to 8 hours of cycle time) or accepted based on the VM prediction alone. If the model accepts a wafer with a low-confidence prediction that turns out to be wrong, the wafer proceeds to the next processing step, accumulates additional value, and eventually fails electrical test , at which point the root cause is 90 days and hundreds of process steps in the past.
On physical interpretability
In software ML, a feature that the algorithm selects but that you cannot explain is acceptable if the validation metrics justify it. In the fab, a model with unexplained features will not be trusted by the process engineering team and will not change the Process of Record. The process engineers who own the tool have typically spent 10 to 20 years understanding its physics. A model that says "this combination of features predicts yield" without being able to explain the physical mechanism is not a model , it is a black box that no experienced engineer will stake their process on.
Every feature in a production fab ML model should have a physical interpretation that can be stated in one sentence referencing a specific mechanism. "RF_Forward_Power_mean reflects the average energy delivered to the plasma during the main etch step, which controls ion bombardment flux and therefore etch rate." That is an acceptable feature justification. "The algorithm selected it because it correlates with yield in cross-validation" is not.
Before writing your first production model, estimate the dollar cost of a false positive and a false negative for the specific decision it will make. In ad targeting, a false negative costs approximately $0.005. In etch endpoint detection, a false negative costs approximately $2.5 million. That difference determines everything: the acceptable false negative rate, the appropriate confidence threshold, the cost of additional metrology, and the value of the model relative to the status quo.