Physical Processes
Vendor data taxonomy, unit process physics, device physics, spatial statistics, survival analysis, physics-informed ML, advanced packaging.
Vendor-Specific Data Taxonomy
Generic process knowledge tells you that plasma etch requires RF power, chamber pressure, and gas flow. It does not tell you that on a Lam Research etcher the critical parameter is RF_Impedance_Real, while on an Applied Materials etcher it is Load_Coil_Position, and on a Tokyo Electron tool the same physical concept is logged as PRES_CHAM in a completely different file format. A data scientist building models across a heterogeneous fab fleet must translate physical failure mechanisms into vendor-specific column names.
The ASML Scanner: TwinScan architecture
ASML TwinScan systems have two stages: while Stage A is exposing Wafer 1 under the lens, Stage B is measuring the topography and alignment marks of Wafer 2. The measurement data from Stage B is used to compute leveling and alignment corrections applied to the next exposure. The predictive modeling consequence: Leveling_Sensor_Z and Alignment_Error_X/Y arrays logged for a wafer were collected before exposure, making them perfect feedforward features for predicting overlay on the next wafer. They are not diagnostic of the exposure that just completed.
The alignment polynomial
Overlay data logged by the scanner's internal metrology represents the residual error after the scanner's corrections were applied. It does not represent the raw incoming error. To model the true physical variation of the upstream process, you must reconstruct the uncorrected alignment signal by adding back the fitted polynomial corrections logged in the recipe execution report. Skipping this step produces a model trained on the scanner's correction residuals rather than on the physical wafer distortion.
The Lam Research Plasma Chamber: RF subsystem
Lam Research tools produce the richest telemetry in the fab. The RF subsystem alone carries more diagnostic information than most other sensors combined. RF data from Lam tools represents complex impedance. The parameter set must be treated as a complex number Z = R + jX, or equivalently in polar form as magnitude and phase angle. Analyzing only the real or only the imaginary component discards half the diagnostic information available.
Lam tools log Mass Flow Controller data with a thermal lag of 0.5 to 2.0 seconds between the commanded setpoint and the logged actual value. The delta between setpoint and actual during a step transition is a more predictive feature than either value alone. This overshoot predicts chamber pressure transients during recipe step transitions, which drive etch rate non-uniformity through the microloading mechanism.
The ESC (Electrostatic Chuck) helium backpressure
The electrostatic chuck clamps the wafer using helium backside pressure (He_BP). This gas is not a process chemical: it is the thermal coupling agent between the wafer and the cooled pedestal. A drop in He_BP means the wafer is decoupled thermally from the chuck, causing temperature excursions that affect etch rate and selectivity. Monitoring the slope of He_BP over consecutive wafers predicts ESC seal degradation 50 or more wafers before a catastrophic drop.
- Normal operation: 10 to 20 Torr
- Incipient leak: below 5 Torr; wafer clamping degraded, thermal coupling compromised
- Catastrophic loss: 0 Torr; wafer likely dropped or shattered inside the chamber
Applied Materials: throttle valve and PVD target health
Applied CVD tools maintain chamber pressure by controlling a throttle valve on the exhaust line. Throttle_Valve_Position is reported as a percentage from 0 to 100. As chamber walls season with deposited polymer, the conductance of the exhaust path decreases and the valve must open further to maintain the same pressure setpoint. Trending this value over 500 to 1,000 wafers reveals chamber coating progression and predicts flaking risk before particle counts spike.
PVD tools log Sputter_Voltage and Sputter_Current. The derived impedance (V/I) tracks the health of the sputter target. As a target erodes through its useful life, the impedance drifts in a characteristic pattern. A sudden step change in impedance indicates target perforation: the plasma has etched through the thinnest part of the target and is now sputtering the backing plate material into the film.
Tokyo Electron: showerhead gap and cell architecture
TEL etchers log Showerhead_Gap: the physical distance in millimeters between the gas distribution plate and the wafer. This gap adjusts dynamically to tune plasma uniformity. When joining TEL etch data to Lam etch data for fleet-level analysis, the showerhead gap has no direct analogue on the Lam tool and must be excluded from the shared feature set or handled with a tool-type indicator variable.
TEL batch furnaces use a cell ID architecture where multiple wafer lots share a single process tube. SECS-II collection events fire per cell, not per wafer. A single "Batch Complete" CEID contains data for 100 or more wafers, but the MES typically records only the first wafer ID as the lot identifier. Joining on wafer ID alone will assign the batch telemetry to one wafer and leave the rest with null features.
The Golden Schema protocol
In practice, data from these four vendors arrives in incompatible formats: different timestamp representations (ASML uses big-endian byte order; TEL uses JST; Lam uses UTC milliseconds; AMAT uses floating-point epoch seconds), different numeric encodings, and different naming conventions for physically identical parameters.
Chamber pressure, for example: Lam uses CHAMBER_PRESSURE, AMAT uses ChamberPr or CPRESS depending on tool generation, TEL uses PRES_CHAM. A Master Parameter Mapping Table stored in YAML or SQL maps these vendor-specific names to canonical names before feature extraction. Without this table, any script that references a specific parameter name will silently produce null features on data from other vendors.
A data scientist wrote a Python ingestion script for a newly installed ASML scanner, parsing the binary timestamp field using standard Intel little-endian byte order. ASML uses big-endian byte order. Reversing the byte order of a 64-bit Unix timestamp produces a garbled value bearing no meaningful relationship to wall-clock time. Some records appeared to have occurred decades in the past.
The model trained on this data learned correlations between lithography parameters and etch outcomes that had no physical basis. It ran in shadow mode for three weeks before an engineer noticed the wafer-level timestamps were inconsistent with the lot-level MES timeline.
Cost: $2.4M in misrouted wafers and six weeks of model revalidation. Fix: all ingestion scripts now include an explicit endianness validation step that reads a known-good reference timestamp from the file header and checks it against the MES lot start time before parsing the remainder of the file.
Unit Processes: Physics and Failure Modes
Lithography: the pattern transfer bottleneck
Lithography is the optical projection of a mask pattern onto photoresist. It is the most capital-intensive step in the fab and the source of the most measurement-intensive data science problems. The minimum printable feature size is governed by the Rayleigh criterion:
where CD is the critical dimension, lambda is wavelength (13.5 nm for EUV, 193 nm for DUV), NA is numerical aperture (0.33-0.55 for EUV), and k1 is a process-dependent factor approaching its theoretical limit of 0.25 for advanced production. A 1% dose error creates a k1-dependent CD error that propagates directly into transistor electrical parameters.
EUV stochastics: the photon shot noise problem
At EUV wavelengths, each exposure uses relatively few photons. The photon count N per voxel follows Poisson statistics with standard deviation equal to the square root of N. At low dose values, this variance produces random variations in where the photoresist is exposed, manifesting as line edge roughness, bridging defects, and broken features that cannot be predicted from scanner telemetry alone because they arise from quantum variance. What can be predicted is the probability of failure: on wafers with low dose settings, bridging defects occur at rates that follow the EUV shot noise model.
Overlay and the spatial frequency domain
Overlay error is not random noise. It has spatial structure: wafer-scale distortions from thermal expansion, field-scale distortions from lens heating, and shot-scale distortions from stage vibration. Treating raw X and Y overlay values as independent scalars discards this structure. Spatial models must evaluate overlay signatures as polynomials mapped to the reticle field coordinates. High-order aberrations including coma and astigmatism create overlay signatures that vary quadratically across the exposure slit. Averaging overlay across a field erases the signature.
Plasma etch: the ion-chemistry balance
Plasma etching removes material through three mechanisms operating simultaneously: physical sputtering from ion bombardment (directional but poor selectivity), chemical etching from reactive neutrals (isotropic but highly selective), and ion-assisted chemical etching where ion bombardment activates surface sites for chemical reaction at rates 10 to 100 times faster than thermal chemistry alone. The ratio of ion flux to neutral radical flux controls where the process operates. RF power controls ion energy and flux. Pressure controls mean free path: low pressure increases ion directionality but reduces radical density. Gas chemistry determines radical species and selectivity.
Key telemetry: RF_Forward_Power_W, Reflected_Power_W, Chamber_Pressure_mTorr, Gas_Flow_sccm, OES_Endpoint_Intensity, DC_Bias_V, He_Backpressure_Torr
Aspect ratio dependent etching
Etch rate decreases as feature depth increases, a phenomenon called aspect ratio dependent etching (ARDE) or RIE lag. At an aspect ratio of 10:1 (feature depth ten times its width), the etch rate at the bottom of the feature is typically 30 to 50% lower than the etch rate on an open flat surface. A VM model trained on shallow-feature etch data will underpredict etch depth for deep features if aspect ratio is not included as a covariate. The target aspect ratio is typically specified in the recipe parameters and must be extracted and included as a feature.
OES endpoint detection
Optical emission spectroscopy monitors the light emitted by the plasma. Specific atomic transitions produce spectral lines at known wavelengths: silicon at 288 nm, silicon fluoride at 440 nm, carbon monoxide at 483 nm from resist erosion. As the etch clears through the target film, the emission spectrum changes: the signal from the target material drops and the signal from the underlying material rises. This transition defines the endpoint.
The detection problem has two failure modes: a false trigger where a transient plasma instability spikes the emission intensity before the etch is complete, and a missed endpoint where the etch area is small relative to the total chamber area so the spectral change is too subtle to detect above background noise. Both failure modes are addressed by adaptive thresholding algorithms that track the rolling baseline of the OES signal and detect relative changes rather than absolute threshold crossings.
Ion implantation: beam physics
Ion implantation fires a beam of ionized dopant atoms into the silicon crystal lattice at a controlled depth and concentration. The dopant species is generated as a plasma in the ion source chamber. A mass analyzer magnet deflects the ion beam through a known radius; only ions with the correct charge-to-mass ratio exit the analyzer, providing isotopic purity. The mass-analyzed beam is accelerated to the target energy, which determines implant depth, then electrostatically scanned to cover the wafer surface uniformly.
Key telemetry: Beam_Current_mA, Beam_Energy_keV, Dose_cm2, Tilt_Angle_deg, Twist_Angle_deg, Wafer_Temp_C, Beam_Uniformity_pct
Ion range and straggle
When an energetic ion enters a solid, it loses energy through nuclear stopping (elastic collisions with lattice atoms) and electronic stopping (inelastic interactions with the electron cloud). The resulting doping profile is approximately Gaussian centered at the projected range Rp with standard deviation equal to the straggle dRp. For practical reference: a 10 keV boron implant into silicon has Rp approximately 30 nm with straggle approximately 9 nm; a 100 keV boron implant has Rp approximately 300 nm.
Channeling, anneal, and dose uniformity
Silicon is crystalline with open channels between atomic planes. An ion beam aligned with a channel travels significantly deeper than theory predicts for amorphous material. Implanters apply a deliberate tilt angle (typically 7 degrees from wafer normal) and twist angle (typically 22 degrees azimuthal) to misalign the beam with all major crystal planes simultaneously. Tilt_Angle_deg and Twist_Angle_deg parameters must be included as features in any implant VM model.
The thermal anneal activates implanted dopants and repairs crystal damage. Anneal telemetry (ramp rate, peak temperature, cooling rate from the RTP tool's pyrometer) is causally connected to the electrical results of the implant step even though it happens in a different tool hours later. Models predicting electrical results from implant data must include the anneal step parameters as features or they will show systematic residuals that correlate with anneal conditions.
The dose integrator compensates for beam current drift by adjusting scan speed. This is mathematically correct for the total dose but not for all downstream outcomes: high beam current generates heat through ion-lattice collisions. If the wafer reaches approximately 150 degrees Celsius during a high-dose implant, dynamic annealing occurs and the damage profile differs from a low-temperature implant at the same nominal dose. A VM model predicting sheet resistance after implant should include Beam_Current_mA as a feature even though the dose feedback system nominally controls it out.
Deposition: CVD, ALD, and PVD
Chemical vapor deposition reacts gas-phase precursors at the wafer surface to produce a solid film. At low temperatures, the reaction rate at the surface limits deposition (reaction-rate-limited regime): etch rate depends strongly on temperature and gas composition. At high temperatures, deposition rate is limited by how quickly precursor molecules diffuse to the surface (mass-transport-limited regime): rate depends on gas flow, pressure, and reactor geometry rather than temperature. A model spanning the regime boundary with a single linear fit will have systematic residuals near the transition.
Atomic layer deposition alternates between two precursor pulses separated by purge steps. The first precursor adsorbs and saturates the surface; the second reacts with the adsorbed layer to deposit exactly one monolayer. This self-limiting behavior provides thickness control to fractions of a nanometer. It holds only within a temperature window called the ALD window: below the lower bound, the precursor does not react; above the upper bound, it decomposes. The ALD_Pulse_Duration_ms and ALD_Purge_Duration_ms parameters are the most informative telemetry for detecting ALD window violations. Insufficient pulse duration (under-dosing) means the surface is not saturated and growth-per-cycle falls below nominal. Insufficient purge duration leaves residual precursor that reacts with the second pulse in a non-self-limiting way.
Physical vapor deposition ejects atoms from a solid target by plasma sputtering. The target erodes non-uniformly: thinnest at regions of highest ion flux determined by the magnetron geometry. As the target erodes, the sputter rate changes and the impedance drifts in a characteristic pattern. A sudden step change in DC impedance deviating from the expected gradual drift indicates target perforation: the plasma has etched through the backing plate and is depositing backing plate metal into the film stack.
CMP: the Preston equation and failure modes
CMP removal rate is described by the Preston equation:
where Kp is the Preston coefficient (materials and chemistry constant), P is the applied pressure at the pad-wafer interface, and v is the relative velocity between pad and wafer surfaces. The relationship is linear at moderate pressures and velocities. It breaks down at high pressure (pad deformation changes contact geometry) and at low velocity (slurry starvation and pad adhesion dominate).
Pad glazing occurs when the asperity tips wear smooth and the slurry's abrasive particles no longer make effective contact with the wafer surface. The macroscopic removal rate decreases, but more importantly, the contact becomes spatially non-uniform. The pad conditioner (a diamond-tipped disk that periodically roughens the pad surface) is the primary control for managing glazing. Pad_Conditioner_Force_N and the conditioning time-per-wafer are the key inputs for predicting pad wear state.
CMP endpoint is detected either by in-situ optical interferometry (laser through a window in the platen measuring film thickness via interference) or by motor current (friction) monitoring. In copper damascene CMP, two secondary effects produce feature-scale non-uniformity: dishing, where the center of wide copper lines polishes faster than the edges (soft copper relative to surrounding dielectric), and erosion, where dense arrays polish faster than isolated lines due to different effective hardness. Both increase with over-polish time. A VM model predicting post-CMP thickness must include the accumulated endpoint signal integrated over the polish cycle, not just the nominal target thickness.
A memory manufacturer ran a CMP module for 11 days between pad changes without a process alarm. Wafer bow measurements before the CMP step had been drifting upward for the entire period. The module team's SPC charts monitored within-wafer thickness uniformity only - not bow. The bow drift indicated that film stress was accumulating, which changed the pad-wafer contact geometry and caused systematic center-to-edge removal rate non-uniformity.
Including incoming wafer bow as a feature in the CMP VM model would have detected the drift 4 days earlier and prevented 3 days of yield loss.
Device Physics Primer
Process parameters determine geometry. Geometry determines electrical behavior. A 2 nm variation in etch depth does not matter geometrically at the scale of a transistor; it matters because it shifts threshold voltage (Vth) by 50 mV, which increases subthreshold leakage by an order of magnitude, which bins the chip from the high-performance tier to the low-performance tier, which changes its selling price by $30 per die. Understanding this chain is what connects your etch VM model to a dollar value.
The Transistor Architecture Evolution
Classical MOSFETs placed the gate electrode on a flat silicon surface. At nodes above 22 nm, this geometry provided sufficient electrostatic control. As gates shrank below 22 nm, the gate progressively lost control of the channel and short-channel effects caused Vth to become length-dependent. At 22 nm and below, the industry moved to FinFETs: the channel is a vertical silicon fin wrapped on three sides by the gate. This three-sided control restores electrostatic suppression that was lost in planar geometry.
At the 3 nm node and below, FinFETs hit their scaling limit. Gate-All-Around (GAA) architecture replaces vertical fins with horizontal silicon nanosheets stacked vertically, with the gate material completely surrounding each sheet on all four sides. The channel region is grown as alternating layers of silicon and silicon-germanium (SiGe); a selective isotropic etch removes the SiGe without etching the silicon, leaving suspended silicon nanosheets surrounded by gate material.
Threshold voltage and drive current tradeoffs
Vth is the voltage required to turn the transistor on. It depends on gate oxide thickness, channel doping, and gate work function. Variation in Vth is the primary driver of bin splits: high Vth means slow switching (bins to low frequency); low Vth means fast switching but high leakage (bins to high power tier). In modern high-k/metal-gate processes, the effective work function is tuned by depositing ultra-thin metal layers using ALD. The thickness of these layers, controlled to fractions of a nanometer, determines Vth. VM models predicting speed bins must include ALD pulse telemetry as inputs.
The fundamental tradeoff: increasing speed by lowering Vth causes an exponential increase in leakage. Fabs deliberately manufacture "skew" lots during process development - running intentional variations in gate length or oxide thickness - to construct the empirical Ion/Ioff tradeoff curve. This curve defines the Pareto frontier of what is achievable at a given technology node.
Subthreshold leakage and exponential sensitivity
Even when Vgs = 0, a small current flows between source and drain due to carrier diffusion in weak inversion. This subthreshold current scales exponentially with threshold voltage. A 1 nm variation in gate oxide thickness shifts Vth by approximately 30 mV. Because the subthreshold current is exponential in Vth, this creates orders-of-magnitude variation in leakage across dies that differ by only 1 nm in a single dimension.
Gate oxide tunneling and junction leakage
At 3 nm nodes, the gate dielectric is only a few nanometers thick. Electrons tunnel through it quantum mechanically. Tunneling current density scales exponentially with the electric field across the oxide. ALD telemetry, specifically the pulse decay time constants that track film thickness per cycle, is the process signal most directly connected to EOT (equivalent oxide thickness) variation and therefore to gate tunneling leakage.
Reverse-biased p-n junctions leak through trap-assisted tunneling at crystal defects near the junction. This leakage is sensitive to junction abruptness: a 2 nm smearing of the doping profile from excessive thermal budget during anneal increases junction leakage by approximately 10x. Anneal ramp rate and peak temperature telemetry from the RTP tool must be included in junction leakage prediction models.
The yield-performance tradeoff
No single process condition simultaneously optimizes yield, drive current, leakage, and speed. Aggressive scaling improves performance but reduces yield through tighter tolerance stacks. Conservative conditions improve yield but produce slow chips that bin into low-margin categories. The data science role is building Pareto frontiers that map process parameter distributions to the joint distribution of yield and electrical performance bins. Optimizing yield in isolation moves the process toward conditions that maximize the fraction of working chips but may systematically shift performance distributions toward slower bins.
Electrical test parameters are not independent. A process shift that increases Vth simultaneously reduces Ion and reduces Ioff. A univariate SPC chart on Ion will flag this as degradation. A multivariate chart on the joint (Ion, Ioff) distribution will correctly classify it as a Vth shift - which has a different root cause and different corrective action than a process that reduces Ion while holding Ioff constant.
Spatial Statistics and Yield Geometry
Optical defect inspection tools report the (x, y) coordinates of every candidate defect on the wafer, along with a size estimate and a classification from the tool's built-in ADC classifier. The total count of defects per wafer is only weakly predictive of yield. The spatial distribution of those defects is highly predictive: a 100-defect scratch that kills 3 dice has a completely different yield impact than 100 uniformly scattered particles that kill 100 dice.
Defect spatial patterns and their root causes
The negative binomial yield model
Yield is the fraction of dice on a processed wafer that pass electrical test. The simplest model assumes killer defects land at random across the wafer. Under that assumption, the defect count follows a Poisson distribution and yield is Y = exp(-D0 * A), where D0 is the defect density (defects per unit area) and A is the die area. This Poisson yield model assumes spatial independence. In physical fabs, defects cluster: a scratch concentrates 50 defects into a 2 mm track, leaving the rest of the wafer clean. The Poisson model systematically underestimates the yield of large dice in clustered defect environments.
To account for clustering, defect density is modeled as a gamma-distributed random variable across the wafer, leading to the negative binomial distribution for defect counts and the Murphy yield model:
where alpha is the cluster parameter. As alpha approaches infinity, clustering disappears and the formula converges to the Poisson model. For typical logic fabs, alpha is between 1.0 and 3.0. The negative binomial model correctly predicts that die shrinks improve yield non-linearly in clustered defect environments.
Coordinate system mismatches
Defect inspection tools, electrical testers, and lithography scanners each report wafer coordinates in different systems with different units and different origin definitions. The notch (a V-shaped indent at the wafer edge) is the physical anchor. Every tool's pre-aligner orients the notch to a standard position before the wafer enters the process chamber. But KLA inspection tools align the notch to the bottom (6 o'clock) while electrical testers align it to the top (12 o'clock), inverting the Y-axis. A defect at Y = +80 mm in the KLA system appears at approximately die row -67 in the tester coordinate system.
The transformation must account for: the Y-axis inversion between KLA and tester systems, the different origin definitions (KLA uses wafer center; testers use bottom-left die), and the scaling from millimeters to die indices using the die pitch. The parameters (scaling, rotation, translation) are tool-specific and must be determined empirically by matching known reference patterns, then stored per tool model in the ETL pipeline.
Spatial autocorrelation
Standard clustering algorithms like K-means work poorly on wafer maps because they assume clusters are spherical and require discrete group assignment. Defect clusters are often irregular (curved scratches, ring patterns) and exist on a background of unclustered random defects. The K-function quantifies the degree of spatial clustering across different length scales without parametric assumptions:
where L(r) equals zero under complete spatial randomness, is positive when defects cluster at scale r, and is negative when defects are more regularly spaced than random. Plotting L(r) against r with simulation envelopes shows at which spatial scales clustering is statistically significant. The standard workflow: use the K-function to confirm clustering before running DBSCAN. If L(r) is within the random envelope at all scales, DBSCAN will find no meaningful clusters regardless of parameter settings.
Moran's I for reticle repeat detection
Reticle defects produce a spatially periodic pattern with low local density but strong spatial autocorrelation at the reticle pitch. Moran's I measures global spatial autocorrelation using a spatial weight matrix built at the reticle field pitch. A significant positive Moran's I computed at this scale is the quantitative signature of a reticle repeat defect - even when total defect counts are within SPC limits.
A memory fab ran a 2% systematic yield loss for six months. Total defect counts were within SPC limits. A Moran's I calculation on the defect grid, using a weight matrix at the reticle field pitch for the patterning layer in question, obtained I = 0.72 - well above the 95% simulation envelope of 0.08. The defect was on the reticle, printing at the same position in every exposure shot across every wafer.
The reticle was replaced. Yield recovered to 97.5% within one lot. Six months of yield loss had not been detected by any univariate SPC chart because the absolute defect count per wafer was never high enough to trigger an alarm. The spatial structure was the signal.
Survival Analysis and Reliability Statistics
Predicting when a machine component will fail or a device will wear out in the field is a time-to-event problem. Standard regression methods fail for time-to-event data because they cannot handle right-censoring: components that are still functioning at the end of the observation period have not failed, but their lifetime data is not missing - it is known to exceed the observation period. Discarding censored observations (the instinct of most bootcamp-trained data scientists) introduces severe selection bias toward early failures.
The Weibull distribution
The Weibull distribution models time-to-failure of a component. Its shape parameter beta determines which part of the "bathtub curve" governs the failure mode:
- beta less than 1: Decreasing hazard rate over time. Characteristic of "infant mortality" where early failures are driven by manufacturing defects that escaped screening. Corrective action: improve incoming inspection or burn-in testing.
- beta equals 1: Constant hazard rate (exponential distribution). Characteristic of random, memoryless failure events. Corrective action: accept as random and manage through redundancy.
- beta greater than 1: Increasing hazard rate over time. Characteristic of wear-out mechanisms (metal fatigue, dielectric breakdown, abrasive wear). Corrective action: schedule preventive replacement before the wear-out region is reached.
Kaplan-Meier estimator
The Kaplan-Meier estimator is a non-parametric statistic that estimates the survival function S(t) from censored lifetime data without assuming a parametric form. It computes the probability of surviving past time t by evaluating the product of conditional survival probabilities at each observed failure event. For predicting equipment failure (for example, CMP pad life), computing KM curves stratified by consumable lot identifies variations in consumable quality before they manifest as outright process faults.
Cox proportional hazards model
The Cox model assesses the impact of multiple covariates on survival without specifying a parametric baseline hazard. For equipment health modeling, covariates include chamber age, cycle count, process gas exposure hours, and PM history. The Cox model provides hazard ratios that quantify how each covariate changes the instantaneous risk of failure relative to a reference. A hazard ratio of 2.3 for "HF exposure hours" means that a chamber with twice the HF exposure hours has 2.3 times the instantaneous failure rate, holding all other covariates constant.
Physics-Informed Machine Learning
A pure data-driven model trained on CMP removal rate data will fit the training distribution well. It will fail when the polishing pad changes, when the slurry lot changes, or when the incoming wafer film thickness moves outside the training range. It fails because it learned the mapping from sensor readings to outcomes without encoding the physical law that governs the mapping. A physics-informed model encodes the law and learns only the deviation from it.
The residual modeling pattern
The most common structure in fab ML: the physics model captures the main effect using first-principles equations; the ML model learns the systematic deviation between the physics prediction and the observed data. The ML model is trained on residuals from the physics model, not on the raw target variable.
For CMP, the physics term is the Preston equation (RR = Kp * P * v). This explains roughly 60% of removal rate variance under normal operating conditions. A gradient boosted tree trained on the residuals captures slurry chemistry effects, pad conditioning state, wafer bow contributions, and process history that the Preston equation does not model. At operating conditions the model has not seen, the physics term provides a physically consistent baseline and the ML residual term returns near zero when it cannot extrapolate confidently, degrading gracefully toward the physics model.
Feature engineering from physical equations
A second form of physics-informed ML uses physical equations to construct features. Etch uniformity models include pattern density features derived from layout geometry, motivated by the microloading effect: etch rate in isolated features differs from etch rate in dense arrays because neutral radical density is depleted in high-pattern-density regions. The microloading correction factor is a function of the local pattern density within the mean free path distance, which is a derived physical feature, not a raw sensor reading.
Simulator calibration
TCAD simulators model device physics from first principles. Their parameters (material coefficients, reaction rate constants) are not precisely known and must be fitted to experimental data. A Gaussian Process surrogate trained on simulator outputs at sampled input points provides both a prediction and a confidence interval. A calibrated simulator is accurate near calibration data and increasingly uncertain as conditions move away from calibration points. The GP surrogate quantifies this uncertainty explicitly: a wide confidence interval means the simulator has not been calibrated in that region.
Physical constraints as regularization
Data-driven models trained without physical constraints can produce predictions that violate conservation laws or physical monotonicity relationships. Monotonicity constraints in gradient boosted tree libraries (XGBoost, LightGBM) force the model to be monotonically increasing or decreasing with respect to specified features. For etch models, removal rate must be monotonically increasing with RF power (at fixed pressure) and monotonically increasing with pressure (at fixed power) in the ion-flux-limited regime. Encoding this as a constraint prevents the model from learning spurious inversions in sparse data regions.
Advanced and Emerging Methods
Federated Learning Across Fab Sites
A foundry customer running the same product at three fab sites wants a shared yield model that draws on data from all three locations. Sharing raw wafer data across sites violates customer NDAs: the process recipes, defect signatures, and tool fingerprints in that data represent billions of dollars in IP. Federated learning allows each site to train a local model on its own data and share only the model weights, not the underlying data.
In federated averaging (FedAvg), each site trains a local model for E epochs, then uploads only the weight updates to a central aggregation server. The server computes a weighted average of the incoming models (weighted by each site's sample count) and distributes the averaged model back to all sites. The primary technical challenge is non-IID data: Site A may run process recipes that Site B does not use, producing fundamentally different data distributions. When data distributions diverge, standard FedAvg converges to a poor average rather than to any site's optimal model.
Graph Neural Networks for Layout-Dependent Effects
Standard tabular ML treats each transistor as an independent sample. This is physically wrong at advanced nodes: the electrical behavior of a transistor depends not just on its own process parameters but on the geometry and processing of its neighbors within 1 to 2 microns. These layout-dependent effects (LDE) - stress from adjacent structures, well proximity effects, pattern density effects - are not captured by any feature derived from the transistor's own parameters alone.
The layout is represented as a graph where each node is a device and each edge encodes a spatial relationship. Edge construction involves: proximity edges (devices within a threshold radius, typically 0.5 to 2 microns), edge features (distance, intervening material type), and node features (device-level process parameters plus post-process metrology). A graph neural network updates each node's representation by aggregating information from its neighbors through a message-passing framework. After L layers, each node encodes information from its L-hop neighborhood. Two layers is typically sufficient for LDE mechanisms.
Why standard ML fails here
A random forest trained on per-transistor process parameters will show systematic residuals that correlate with local layout density. Dense SRAM arrays will consistently show higher Vth than predicted; isolated devices will show lower Vth. The residuals are not noise: they are the signal that the model cannot capture because it was given only local features. The GNN provides the neighborhood context that makes these residuals explainable and predictable.
Test, Binning, and Reliability Screening
Wafer sort (electrical test) is the fab's ground truth. It is the first time the physical geometry created by 500 process steps is powered up and measured electrically. The output is the die bin map, which determines not only whether a die is sold but how much it sells for. A die that passes all static limits but bins to the low-performance category due to high subthreshold leakage may sell for $40 instead of $120.
The statistical challenge at wafer sort is distinguishing normal process variation from latent reliability defects. A die with a leakage current of 15 nA might be a healthy fast-corner device, or it might contain a crystal defect that will cause it to short-circuit after 1,000 hours of operation in the field. Both dies pass the static datasheet limit. One is a warranty return; one is not. Part Average Testing is the statistical screen that separates them.
Part Average Testing
PAT identifies dice that pass all static datasheet limits but are statistical outliers relative to their own lot population. The physical assumption is that a die standing multiple standard deviations away from its lot neighbors contains a localized manufacturing defect that makes it untrustworthy for long-term reliability, even if its measured values are within the absolute specification.
Static PAT limits are set during product qualification based on historical distributions and remain fixed for the product lifetime. They protect against lots that have shifted entirely off-center. The static limit is set at k standard deviations from the historical mean, where k is chosen to achieve a target outlier removal rate (AEC-Q100 recommends k = 6 for Grade 1 automotive).
Dynamic PAT limits are calculated per lot using robust statistics (median and interquartile range) to avoid distortion by the outliers they are trying to detect. A die failing Dynamic PAT is scrapped even if it passes the customer specification.
Multivariate PAT (Mahalanobis distance)
Univariate PAT tests one parameter at a time. It cannot catch a die that is perfectly average on drive current and perfectly average on leakage, but whose ratio of drive to leakage is physically impossible given the known process constraints. Multivariate PAT uses Mahalanobis distance to measure how far a die is from the lot centroid in the joint parameter space, accounting for correlations between parameters.
A die with high Mahalanobis distance is an outlier in the correlation structure of the parameters. Implementing this requires a robust covariance estimator (like Minimum Covariance Determinant) because the standard sample covariance matrix is highly sensitive to the outliers it is trying to detect - a circular dependency that the robust estimator breaks by fitting to the majority of the data rather than the full sample.
Advanced Packaging and Heterogeneous Integration
Moore's Law scaling of transistors in two dimensions is slowing. The industry response is "More than Moore": stacking dice vertically and connecting them with high-density interconnects. Advanced packaging moves the yield challenge from the single-die level to the multi-die module level, introducing combinatorial optimization problems that do not appear in monolithic chip manufacturing.
Through-Silicon Vias (TSVs)
A TSV is a vertical electrical connection passing completely through a silicon wafer or die. Manufacturing TSVs requires deep reactive ion etching (DRIE) to drill the hole, followed by copper electroplating to fill it. The primary yield limiters are incomplete copper fill (leaving voids that increase resistance and fail reliability screening) and copper protrusion after anneal (creating short circuits to adjacent metal layers).
A single die may have 10,000 TSVs. If any one TSV fails, the entire stacked assembly fails. The TSV failure rate must be below 1 in 10^7 to achieve acceptable module yield. Models predicting TSV failure must operate in the extreme tails of the distribution - the domain of extreme value theory and generalized Pareto distributions, not Gaussian statistics.
Hybrid bonding
Hybrid bonding connects two dice face-to-face by mating copper pads to copper pads and oxide to oxide simultaneously, without solder bumps. This requires atomic-level surface flatness. The CMP process preparing the surfaces is the most critical mechanical step in the assembly process. Data scientists working on hybrid bonding yield combine CMP telemetry with acoustic microscopy images: SAM (scanning acoustic microscopy) detects voids at the bonding interface. The modeling task is predicting the probability of void formation based on incoming wafer bow and the CMP polish profile.
The Known Good Die matching problem
When stacking an expensive GPU die with four expensive HBM (High Bandwidth Memory) dice, the final assembly is only as good as its weakest component. If a fast GPU is stacked with slow memory, the module must be binned as slow, wasting the value of the fast GPU. The Known Good Die matching algorithm optimizes the pairing of dice from different wafers into modules to maximize the total value of the shipped assemblies.