YieldOpsAcademy
Interview Prep Manual
Part 3 requires full access.
Unlock $149
Part 3

The ML Arsenal

What works in production on 2012 hardware with 50ms latency budgets and skeptical process engineers. Not what is state-of-the-art in research papers.

Chapter 3.1

Virtual Metrology: Predicting Y Before Y Arrives

Virtual Metrology is the signature ML application in semiconductor manufacturing. You predict metrology results (film thickness, critical dimension, defect count) from process sensor data before the physical measurement completes. Metrology takes 4 to 8 hours per wafer; VM predicts in milliseconds. Ground truth arrives 60 to 90 days later. Wrong predictions accumulate silently.

The fundamental rule
VM without OOD detection is more dangerous than no VM. No model routes all wafers to physical metrology, which is slow and expensive but always accurate. VM without OOD routes some wafers to a model that is fast, cheap, and sometimes catastrophically wrong.

The four mandatory components

VM architecture: every component is mandatory
Process Tool (Level 1)
    Sensors: RF power, pressure, OES, gas flow
         |
         | raw 100 Hz trace data
         v
Feature Extraction (Level 2)
    Domain-driven, physically interpretable
    Every feature has a one-sentence physical explanation
         |
         | feature vector
         v
Model Core (XGBoost + SHAP)            Reliability Index (Autoencoder)
    Predicts: thickness, CD, defect     Measures: OOD distance
    Interprets: SHAP per prediction     Trained on commissioning data only
         |                                       |
         +---------------+-----------------------+
                         |
                         v
                  Decision Router
                  Low RI  -> accept VM prediction
                  High RI -> abstain, route to physical metrology
                  Sensor alarm -> suspend model immediately

Component 1: Feature extraction

Every feature must have a one-sentence physical interpretation. If you cannot explain what a feature measures in terms of plasma physics, tool mechanics, or process chemistry, it should not be in the model.

FeaturePhysical InterpretationComputation
rf_power_meanAverage energy delivered to plasma, controlling ion bombardment flux and etch ratemean(RF_Forward_Power) over main etch step
chamber_pressure_slopeRate of pressure change, indicating gas consumption rate or potential leaknp.polyfit(time, pressure, 1)[0]
oes_520nm_max_derivativeMaximum rate of change in silicon emission, early endpoint signalmax(np.diff(OES_520nm))
match_cap_positionTuning capacitor position at step end, leading indicator of wear and arc riskmain_etch["match_capacitor_a"].iloc[-1]
Feature extraction implementation
def extract_vm_features(fdc_trace: pd.DataFrame,
                        mes_context: dict) -> pd.Series:
    """Extract VM features. Every feature has physical interpretation."""
    step = fdc_trace[
        (fdc_trace['timestamp'] >= mes_context['step_start']) &
        (fdc_trace['timestamp'] <= mes_context['step_end'])
    ]
    t = (step['timestamp'] - step['timestamp'].iloc[0]).dt.total_seconds()

    return pd.Series({
        'rf_power_mean':       step['rf_forward_power'].mean(),
        'rf_power_std':        step['rf_forward_power'].std(),
        'pressure_slope':      np.polyfit(t, step['chamber_pressure'], 1)[0],
        'oes_520nm_max_deriv': np.diff(step['oes_520nm']).max(),
        'match_cap_position':  step['match_capacitor_a'].iloc[-1],
        'step_duration':       t.iloc[-1],  # Recipe compliance check
    })

Component 2: Model core

Gradient-boosted trees with SHAP. Not neural networks (black box, too slow). Not random forest (slower inference, less consistent importance). GBTs are the production standard for tabular sensor data.

XGBoost + SHAP for interpretable VM prediction
import xgboost as xgb
import shap

model = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=4,           # Shallow: faster inference, more interpretable
    learning_rate=0.1,
    subsample=0.8,         # Handles PM-cycle concept drift
    colsample_bytree=0.8,
    objective='reg:squarederror',
    random_state=42
)
model.fit(X_train, y_train)

# SHAP explanation for process engineer (required, not optional)
def explain_prediction(model, X_instance, feature_names):
    explainer = shap.TreeExplainer(model)
    sv = explainer.shap_values(X_instance)
    top3 = np.argsort(np.abs(sv[0]))[-3:][::-1]
    parts = [f"{feature_names[i]}={X_instance.iloc[0,i]:.2f} "
             f"({'increases' if sv[0][i] > 0 else 'decreases'}) prediction"
             for i in top3]
    return "Driven by: " + "; ".join(parts)

Component 3: Reliability Index (OOD detection)

Autoencoder Reliability Index
from tensorflow.keras import layers, Input, Model

def build_ri_autoencoder(input_dim: int, encoding_dim: int = 8):
    """Reliability Index = reconstruction MSE. High RI = OOD = abstain."""
    x = inp = Input(shape=(input_dim,))
    x = layers.Dense(16, activation='relu')(x)
    x = layers.Dense(encoding_dim, activation='relu')(x)
    x = layers.Dense(16, activation='relu')(x)
    x = layers.Dense(input_dim, activation='linear')(x)
    ae = Model(inp, x)
    ae.compile(optimizer='adam', loss='mse')
    return ae

# Train ONLY on commissioning data (pm_sequence = 0)
ri_model = build_ri_autoencoder(input_dim=X_commissioning.shape[1])
ri_model.fit(X_commissioning, X_commissioning,
             epochs=50, batch_size=32, validation_split=0.1, verbose=0)

def compute_ri(autoencoder, X):
    return np.mean((X - autoencoder.predict(X, verbose=0))**2, axis=1)

# Threshold: commissioning mean + 3 sigma
ri_train = compute_ri(ri_model, X_commissioning)
ri_threshold = ri_train.mean() + 3 * ri_train.std()

Component 4: Cost-calibrated decision router

Calibrate abstention threshold to business cost, not statistics
def calibrate_abstention_threshold(ri_val, y_true, y_pred,
                                   cost_fn=2.5e6, cost_fp=500,
                                   tolerance=2.0):
    """
    cost_fn: false negative ($2.5M per ghost excursion)
    cost_fp: false positive ($500 per unnecessary physical metrology)
    Ratio 5000:1 justifies aggressive abstention.
    """
    thresholds = np.linspace(ri_val.min(), ri_val.max(), 100)
    costs = []
    for thresh in thresholds:
        fn = ((ri_val <= thresh) & (np.abs(y_pred - y_true) > tolerance)).sum()
        fp = ((ri_val > thresh) & (np.abs(y_pred - y_true) <= tolerance)).sum()
        costs.append(fn * cost_fn + fp * cost_fp)
    idx = np.argmin(costs)
    return thresholds[idx], costs[idx]

optimal_thresh, min_cost = calibrate_abstention_threshold(ri_val, y_true, y_pred)
print(f"Optimal threshold: {optimal_thresh:.3f}  Expected annual cost: ${min_cost:,.0f}")
Chapter 3.2

Fault Detection and Classification (FDC)

VM predicts quality. FDC detects when something is wrong right now, stopping bad wafers before processing completes. Use the hierarchy: start with univariate SPC, add multivariate only when individual sensors miss the excursion, add model-based only when both miss it.

1
Univariate SPC
Individual sensors, Shewhart/EWMA/CUSUM. Fast, interpretable, high false positive rate for correlated excursions.
2
Multivariate SPC
Hotelling T-squared on correlated sensors. Catches joint excursions that no individual chart sees.
3
Model-based FDC
Predict one sensor from others; monitor the residual. Catches complex dependencies.
4
Spatial FDC
Defect pattern recognition (Archetype 7). Reticle issues and chuck fingerprints invisible to sensor monitoring.
Univariate FDC: online SPC with chart selection
class UnivariateFDC:
    """Per-sensor SPC. fit() on commissioning, update() per new observation."""

    def __init__(self, chart_type='ewma', **params):
        self.chart_type = chart_type
        self.params = params

    def fit(self, data: pd.Series):
        if self.chart_type == 'shewhart':
            self.center, s = data.mean(), data.std()
            self.ucl, self.lcl = self.center + 3*s, self.center - 3*s
        elif self.chart_type == 'ewma':
            self.lam = self.params.get('lam', 0.2)
            self.center = data.mean()
            s_ewma = data.std() * (self.lam / (2 - self.lam))**0.5
            self.ucl, self.lcl = self.center + 3*s_ewma, self.center - 3*s_ewma
            self.ewma_val = self.center
        elif self.chart_type == 'cusum':
            self.target, self.sigma = data.mean(), data.std()
            self.h = self.params.get('h', 4)
            self.k = self.params.get('k', 0.5)
            self.C_plus = self.C_minus = 0

    def update(self, value: float) -> bool:
        if self.chart_type == 'shewhart':
            return value > self.ucl or value < self.lcl
        elif self.chart_type == 'ewma':
            self.ewma_val = self.lam * value + (1 - self.lam) * self.ewma_val
            return self.ewma_val > self.ucl or self.ewma_val < self.lcl
        elif self.chart_type == 'cusum':
            z = (value - self.target) / self.sigma
            self.C_plus  = max(0, self.C_plus  + z - self.k)
            self.C_minus = max(0, self.C_minus - z - self.k)
            return self.C_plus > self.h or self.C_minus > self.h
Model-based FDC: residual monitoring
class ModelBasedFDC:
    """Predict target sensor from correlated sensors. Alarm on large residuals."""

    def __init__(self, target_sensor: str, predictor_sensors: list):
        import xgboost as xgb
        self.target = target_sensor
        self.predictors = predictor_sensors
        self.model = xgb.XGBRegressor(n_estimators=50, max_depth=3)

    def fit(self, data: pd.DataFrame):
        self.model.fit(data[self.predictors], data[self.target])
        residuals = data[self.target] - self.model.predict(data[self.predictors])
        mu, sigma = residuals.mean(), residuals.std()
        self.ucl, self.lcl = mu + 3*sigma, mu - 3*sigma

    def update(self, obs: pd.Series) -> dict:
        y_pred = self.model.predict(obs[self.predictors].values.reshape(1, -1))[0]
        residual = obs[self.target] - y_pred
        return {'predicted': y_pred, 'actual': obs[self.target],
                'residual': residual,
                'alarm': residual > self.ucl or residual < self.lcl}
Chapter 3.3

The Model Selection Decision Tree

For production-depth treatment of each algorithm including tuning parameters, hardware traps, and defensive patterns, see Field Manual Layer 6: The ML Arsenal.

Latency and interpretability constraints determine your algorithm. This is not a preference; it is a derivation. Start with the constraint, end with the algorithm.

What is the latency requirement?
|
+-- < 50ms (real-time control, tool floor)
|   |
|   +-- Interpretability required -> Linear regression, EWMA controller
|   |   Examples: R2R control, endpoint detection
|   |   Why: coefficients map to physical parameters engineers control
|   |
|   +-- Safety interlock (no ML allowed) -> Hardware PLC logic only
|       Why: ML inference can fail; hardwired interlocks cannot
|
+-- < 2 seconds (APC server, between wafers)
|   |
|   +-- Interpretability required -> Gradient-boosted trees + SHAP
|   |   Examples: Virtual Metrology, FDC classification
|   |   Why: SHAP gives per-prediction explanation in physical terms
|   |
|   +-- Interpretability minimal -> Small quantized CNN (ONNX INT8)
|       Examples: CD-SEM image analysis, optical defect inspection
|       Why: only use CNNs where images are the input; never for sensors
|
+-- Hours to days (enterprise analytics, offline)
    |
    +-- Interpretability required -> XGBoost with SHAP, Random Forest
    |   Examples: fleet-wide yield prediction, root cause attribution
    |
    +-- Research only -> Deep learning, transformers
        Not production: requires GPU, massive data, no interpretability
Latency BudgetInterpretabilityAlgorithmWhen to Use
< 50msRequiredLinear, EWMAR2R control, endpoint detection
< 50msN/A (safety)PLC hardwiredSafety interlocks only
< 2sRequiredGBT + SHAPVirtual Metrology, FDC
< 2sMinimalCNN (ONNX INT8)Image metrology only
HoursRequiredXGBoost + SHAPFleet analysis, yield prediction
HoursNot requiredDeep learningResearch only, not production
Chapter 3.4

Edge Deployment: From Notebook to Production

See also: Theory Library: Edge Deployment / ONNX for the full implementation and environment-specific constraints.

The air gap changes everything. Your development environment never touches production. Deployment is physical media transfer, change control approval, and 2 to 6 weeks of waiting. Design for the production stack from day one.

The deployment pipeline
Development (Your Laptop / Server)
    Train model (XGBoost, sklearn, PyTorch)
    Validate: walk-forward, per-PM, OOD detection
    Export to production format:
        Trees:       XGBoost native binary or JSON
        Neural nets: ONNX (torch.onnx.export)
        Linear:      Coefficients as JSON
    Build inference wrapper (pure Python, stdlib only)
    Package: model + normalization params + feature extractors
    Self-test script: known input -> known output
         |
         | USB drive or secure courier
         v
Air Gap
         |
         v
Production IT
    Antivirus scan
    Code review (no network calls, no obfuscation, no pip installs)
    Change control ticket submission
    Approval: 2 to 6 weeks
         |
         v
Staging (Mirror Hardware: identical 2012 dual-core PC)
    Latency: p99 < 50ms for 10,000 inferences
    Memory:  < 500MB RSS
    Output:  bit-exact match to development (within 1e-6)
         |
         v
Production (Tool Controller)
    Shadow mode: log predictions, do not act on them (first 100 wafers)
    Validation lot: physical metrology confirms VM accuracy
    Active mode: predictions used for routing decisions
ONNX export and latency validation
import torch, onnx, onnxruntime as ort, time, numpy as np

# Export to ONNX (after training any PyTorch model)
dummy = torch.randn(1, input_dim)
torch.onnx.export(
    model, dummy, "vm_model.onnx",
    export_params=True, opset_version=11,
    do_constant_folding=True,
    input_names=['features'], output_names=['prediction'],
    dynamic_axes={'features': {0: 'batch'}, 'prediction': {0: 'batch'}}
)

# Validate the export
onnx.checker.check_model(onnx.load("vm_model.onnx"))

# Validate latency on TARGET hardware (not your laptop)
def validate_latency(model_path, input_shape, n=10_000, target_p99_ms=50):
    session = ort.InferenceSession(model_path)
    latencies = []
    for _ in range(n):
        x = np.random.randn(*input_shape).astype(np.float32)
        t0 = time.perf_counter()
        session.run(None, {'features': x})
        latencies.append((time.perf_counter() - t0) * 1000)
    p99 = np.percentile(latencies, 99)
    if p99 > target_p99_ms:
        raise RuntimeError(f"p99={p99:.1f}ms exceeds {target_p99_ms}ms target")
    print(f"p50={np.percentile(latencies,50):.1f}ms  p99={p99:.1f}ms  max={max(latencies):.1f}ms")

The nine-point deployment checklist

1
Latency p99
METHODtime.perf_counter() loop, 10,000 inferences on mirror hardware
PASS< 50ms
FAILMissed arcs, scrapped lots
2
Memory footprint
METHODpsutil.Process().memory_info()
PASS< 500MB RSS
FAILTool controller crash, production stop
3
Zero external dependencies
METHODStatic analysis, no pip installs
PASSstdlib + your model binary only
FAILDeployment rejection, 4-week delay
4
No network calls
METHODCode review, network namespace test
PASSZero socket() calls
FAILSecurity violation, audit failure
5
Numerical stability
METHODFuzz with -9999.0, NaN, Inf, 1e38
PASSNo crash, no NaN output
FAILSilent model failure, wrong predictions
6
Bit-exact reproducibility
METHODSame input on dev and prod hardware
PASSIdentical output within 1e-6
FAILUnexplained drift, loss of trust
7
Fail-safe wiring
METHODLogic audit with safety engineer
PASSModel output ANDed with hardware interlock
FAILModel bypasses safety, injury risk
8
Rollback tested
METHODDeploy previous version, measure time
PASS< 5 minutes to prior model
FAILExtended downtime on failure
9
Shadow mode validation
METHODLog predictions vs physical metrology for 100 wafers
PASSAccuracy within 2% of development evaluation
FAILModel degradation goes undetected
Chapter 3.5

Validation: Fab-Specific Methods

Standard k-fold cross-validation gives you an optimistic score that will not match production. Three validation strategies replace it.

Walk-Forward Validator (sklearn-compatible)

from sklearn.model_selection import BaseCrossValidator
import numpy as np

class WalkForwardValidator(BaseCrossValidator):
    """
    sklearn-compatible walk-forward validator.
    Train on [0:t], test on [t:t+test_size]. Never shuffle.

    Plug into cross_val_score() as a drop-in replacement for KFold.
    """
    def __init__(self, n_splits=5, test_size=None):
        self.n_splits = n_splits
        self.test_size = test_size

    def split(self, X, y=None, groups=None):
        n = len(X)
        ts = self.test_size or n // (self.n_splits + 1)
        idx = np.arange(n)
        for i in range(1, self.n_splits + 1):
            s = i * ts
            yield idx[:s], idx[s:s + ts]

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits

# Usage: exact same API as KFold
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y,
                         cv=WalkForwardValidator(n_splits=5),
                         scoring='neg_mean_squared_error')

Per-PM cross-validation

def per_pm_cv(df, model, features, target, pm_col='pm_sequence'):
    """
    Validate within PM periods AND across PM boundaries.

    within_pm score: can the model interpolate within a maintenance window?
    across_pm score: does the model generalize across PM resets?

    If across_pm << within_pm, the model has learned PM-specific artifacts
    that will not transfer to new PM periods in production.
    """
    within, across = [], []
    periods = sorted(df[pm_col].unique())

    for pm in periods:
        d = df[df[pm_col] == pm]
        if len(d) < 100: continue
        mid = len(d) // 2
        model.fit(d.iloc[:mid][features], d.iloc[:mid][target])
        within.append(model.score(d.iloc[mid:][features], d.iloc[mid:][target]))

    for i in range(len(periods) - 1):
        train = df[df[pm_col] == periods[i]]
        test  = df[df[pm_col] == periods[i+1]]
        model.fit(train[features], train[target])
        across.append(model.score(test[features], test[target]))

    print(f"Within-PM: {np.mean(within):.3f} +/- {np.std(within):.3f}")
    print(f"Across-PM: {np.mean(across):.3f} +/- {np.std(across):.3f}")
    # Gap > 0.1 = model not generalizing across maintenance events
    return {'within_pm': within, 'across_pm': across}

Cost-calibrated scoring

def cost_calibrated_score(y_true, y_pred, ri_scores,
                          cost_fn=2.5e6, cost_fp=500,
                          tolerance=2.0):
    """
    Report total business cost, not accuracy.

    For VM: false negative cost >> false positive cost (ratio ~5000:1).
    Optimizing accuracy alone produces a model that is statistically
    impressive but operationally wrong.
    """
    thresholds = np.linspace(ri_scores.min(), ri_scores.max(), 100)
    best_cost, best_thresh = np.inf, None

    for thresh in thresholds:
        fn = ((ri_scores <= thresh) & (np.abs(y_pred - y_true) > tolerance)).sum()
        fp = ((ri_scores > thresh)  & (np.abs(y_pred - y_true) <= tolerance)).sum()
        cost = fn * cost_fn + fp * cost_fp
        if cost < best_cost:
            best_cost, best_thresh = cost, thresh

    abstain_rate = (ri_scores > best_thresh).mean()
    print(f"Optimal RI threshold: {best_thresh:.3f}")
    print(f"Abstention rate: {abstain_rate:.1%}")
    print(f"Expected annual cost at this threshold: ${best_cost * 365 / len(y_true):,.0f}")
    return best_thresh
Part 3 Summary

ML Arsenal at a Glance

ChapterCore SystemKey Implementation
3.1 VMPredict metrology before it arrivesFeature extraction with physical interpretation + XGBoost + SHAP + Autoencoder RI + cost-calibrated router
3.2 FDCReal-time fault detection in layersUnivariate SPC (Shewhart/EWMA/CUSUM) then T-squared then model-based residual monitoring
3.3 Model SelectionConstraints choose the algorithmDecision tree: latency constraint > interpretability requirement > specific algorithm
3.4 Edge DeploymentProduction on constrained hardwareONNX export, p99 latency on mirror hardware, nine-point checklist, shadow mode first
3.5 ValidationFab-specific validation methodsWalkForwardValidator (sklearn-compatible), per-PM CV, cost-calibrated threshold; never standard KFold
<- Part 2: Technical FoundationPart 4: Interview Simulation