The Technical Foundation
Working code for live interviews. The five SQL join patterns, six Python data engineering patterns, the SPC decision tree, Nyquist for sensor data, and the ISA-95 hierarchy.
SQL for Fab Data: The Five Join Patterns
Fab data lives in relational databases: Oracle, SQL Server, PostgreSQL, or specialized historians. Standard SQL fails because of three unusual properties:
Pattern 1: Tolerance Join (merge_asof)
Problem: Join FDC sensor data to MES lot context. Both have timestamps. They never match exactly due to clock drift, jitter, and different sampling rates.
-- NEVER DO THIS SELECT f.*, m.lot_id FROM fdc_data f JOIN mes_events m ON f.timestamp = m.step_start; -- Timestamps never equal. Clock drift of even 1 second drops all rows.
import pandas as pd
# Both DataFrames must be sorted by timestamp first
fdc_sorted = fdc_data.sort_values('timestamp')
mes_sorted = mes_events.sort_values('step_start')
joined = pd.merge_asof(
fdc_sorted,
mes_sorted,
left_on='timestamp',
right_on='step_start',
by='tool_id', # Never join across tools
tolerance=pd.Timedelta('5 seconds'), # Physically plausible clock drift
direction='nearest' # or 'backward' for causal joins
)
# Validate join quality
orphan_pct = joined['lot_id'].isna().mean() * 100
print(f"Orphan FDC rows: {orphan_pct:.1f}%")
# >5% orphans = systematic alignment problem, investigate before proceedingPattern 2: Range Join (BETWEEN)
Problem: Attribute FDC rows to process steps. A step has a duration from step_start to step_end, and multiple FDC rows fall within this range.
WITH fdc_labeled AS (
SELECT
f.*,
m.lot_id,
m.recipe_id,
m.step_name,
EXTRACT(EPOCH FROM (f.timestamp - m.step_start)) AS offset_seconds
FROM fdc_data f
LEFT JOIN mes_events m
ON f.tool_id = m.tool_id
AND f.timestamp BETWEEN m.step_start AND m.step_end
)
SELECT *
FROM fdc_labeled
WHERE ABS(offset_seconds) < 300 -- Reject joins offset >5 min (likely wrong tool mapping)
OR offset_seconds IS NULL; -- Keep orphans for investigationPattern 3: PM-Aware Partition (Window Functions)
Problem: Compute per-PM-period statistics. PM events reset tool state and shift the distribution. Global statistics blend pre-PM and post-PM data, destroying the signal you need for drift detection.
WITH pm_periods AS (
SELECT
tool_id,
pm_timestamp AS period_start,
LEAD(pm_timestamp) OVER (PARTITION BY tool_id ORDER BY pm_timestamp)
AS period_end,
ROW_NUMBER() OVER (PARTITION BY tool_id ORDER BY pm_timestamp) - 1
AS pm_sequence -- 0 = commissioning
FROM pm_log
),
fdc_with_period AS (
SELECT
f.*,
p.pm_sequence,
f.timestamp - p.period_start AS time_since_pm
FROM fdc_data f
JOIN pm_periods p
ON f.tool_id = p.tool_id
AND f.timestamp >= p.period_start
AND (f.timestamp < p.period_end OR p.period_end IS NULL)
)
SELECT
tool_id,
pm_sequence,
AVG(chamber_pressure) AS pressure_mean,
STDDEV(chamber_pressure) AS pressure_std,
AVG(chamber_pressure) - LAG(AVG(chamber_pressure)) OVER (
PARTITION BY tool_id ORDER BY pm_sequence
) AS pressure_shift
FROM fdc_with_period
GROUP BY tool_id, pm_sequence;
-- pm_sequence = 0 is your golden baseline. All others compared to it.Pattern 4: Gap Detection (LAG / diff)
Problem: Detect communication timeouts, frozen historians, and missing data. SECS/GEM reports at regular intervals; gaps indicate failure.
WITH timediffs AS (
SELECT
tool_id,
sensor_id,
timestamp,
timestamp - LAG(timestamp) OVER (
PARTITION BY tool_id, sensor_id
ORDER BY timestamp
) AS time_since_last
FROM fdc_data
)
SELECT *
FROM timediffs
WHERE time_since_last > INTERVAL '5 seconds';
-- Expected 1 Hz = 1 second between rows. Gap >5s = SECS/GEM timeout or network failure.fdc_data['time_since_last'] = (
fdc_data
.groupby(['tool_id', 'sensor_id'])['timestamp']
.diff()
)
timeouts = fdc_data[fdc_data['time_since_last'] > pd.Timedelta('5 seconds')]
print(f"Timeout events: {len(timeouts)}")
print(timeouts[['tool_id', 'sensor_id', 'timestamp', 'time_since_last']].head())Pattern 5: Deduplication (ROW_NUMBER)
Problem: Duplicate CEIDs from SECS retry logic, network duplicates, or historian bugs. Same wafer, same step, multiple start events.
WITH deduped AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY tool_id, lot_id, step_name
ORDER BY timestamp
) AS rn
FROM mes_events
WHERE ceid IN ('Step_Start', 'Step_End')
)
SELECT * FROM deduped WHERE rn = 1;
-- rn = 1 keeps the first (earliest) occurrence per tool/lot/step.
-- Use ORDER BY timestamp DESC to keep the most recent (if retries are corrections).| Pattern | Function | When to Use | Validation Check |
|---|---|---|---|
| Tolerance join | merge_asof() | FDC to MES with clock drift | Orphan %, cross-correlation lag |
| Range join | BETWEEN | Event attribution to step duration | Offset from step start |
| PM-aware partition | Window functions | Per-PM statistics, golden baseline | Period sequence continuity |
| Gap detection | LAG() / diff() | SECS/GEM timeout detection | Gap distribution vs. expected interval |
| Deduplication | ROW_NUMBER() | SECS retry logic, duplicate CEIDs | Count before/after, verify uniqueness |
Python for Fab Data: Six Pandas Patterns
Standard Pandas patterns fail on fab data because of sentinel values, irregular sampling, and physical constraints. Here are the six patterns that work in production.
Pattern 1: Sentinel Handling
Problem: Missing data encoded as -9999.0, 9999.0, or 0.0, not NaN. Standard statistics include sentinels as valid values.
# WRONG: includes -9999.0 as a valid pressure reading
df['chamber_pressure'].mean()
# Correct: map sentinels to NaN first
import numpy as np
SENTINELS = [-9999.0, 9999.0, -999.0, 999.0]
df_clean = df.replace(SENTINELS, np.nan)
df_clean['chamber_pressure'].mean() # Now excludes sentinels
# Ambiguous 0.0: cross-reference with recipe step to diagnose
df['mfc_status'] = np.where(
(df['mfc_flow'] == 0.0) & (df['recipe_step'] == 'gas_flow_on'),
'BROKEN', # 0.0 sccm during active gas flow = MFC failure
'NORMAL'
)| Historian Vendor | Sentinel Value | Meaning |
|---|---|---|
| Historian A | -9999.0 | Communication timeout |
| Historian B | 9999.0 | Out of range (over scale) |
| Historian C | -999.0 | Sensor offline |
| Legacy tool | 0.0 | Ambiguous: off or broken? Cross-reference recipe step. |
Pattern 2: Time-Based Windows (Irregular Sampling)
Problem: SECS/GEM data has gaps from timeouts and PM events. Index-based rolling windows treat a 60-second gap the same as a 1-second gap.
# WRONG: 60-row window includes gaps as valid samples
df.rolling(window=60).mean()
# Correct: time-based window respects actual clock time
df_time = df.set_index('timestamp')
# '60s' window = exactly 60 seconds of actual time
df_rolling = df_time.rolling(window='60s', min_periods=10).mean()
# min_periods=10: require at least 10 samples, reject sparse windows
# Resample to regular grid for downstream ML (e.g., CNNs that expect fixed length)
df_1hz = df_time.resample('1s').mean()
# NaN where no data in that second - forward-fill only if you understand the physicsPattern 3: Robust Scaling Per-PM Period
Problem: PM events create step changes in sensor distributions. A global StandardScaler mixes pre-PM and post-PM data, destroying the signal.
# WRONG: global scaler mixes pre-PM and post-PM distributions from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Correct: fit on commissioning data only (pm_sequence = 0) from sklearn.preprocessing import RobustScaler commissioning = df[df['pm_sequence'] == 0] scaler = RobustScaler(quantile_range=(5, 95)) # Resistant to arc spikes scaler.fit(commissioning[feature_cols]) # Transform all data using the commissioning distribution as reference X_scaled = scaler.transform(df[feature_cols]) # RobustScaler uses median/IQR instead of mean/std: resistant to the # arc spikes and sentinel values that would corrupt StandardScaler
Pattern 4: Frozen Sensor Detection
Problem: A sensor reporting identical values for 60+ seconds is broken, not stable. Zero variance is a hardware failure signature (Archetype 6: The Frozen Thermocouple).
def detect_frozen(series: pd.Series, window: str = '60s',
threshold: float = None) -> pd.Series:
"""
Returns boolean mask: True where sensor is frozen (variance = 0).
Physical reality: every analog sensor measuring a real process has
non-zero variance from thermal noise and process variation.
Zero variance for 60+ seconds = broken circuit, not stability.
"""
rolling_var = series.rolling(window=window, min_periods=10).var()
if threshold is None:
# Calibrate from the first 10% of data (assumed healthy)
healthy = rolling_var.dropna().iloc[:len(rolling_var) // 10]
threshold = healthy.quantile(0.01) # 1st percentile of healthy variance
return rolling_var < threshold
# Usage: flag frozen readings before they reach any model
frozen_mask = detect_frozen(df['temperature'], window='60s')
df.loc[frozen_mask, 'temperature'] = np.nan # Replace with NaN
df.loc[frozen_mask, 'quality_flag'] = 'FROZEN_SENSOR' # Log the reasonPattern 5: EWMA Controller
Problem: Etch depth drifts 0.3 nm/day. An EWMA run-to-run controller adjusts the recipe after every measurement to maintain target. Stability requires 0 < G * lambda < 2.
class EWMAController:
"""
EWMA Run-to-Run controller.
Update equation: u[n] = u[n-1] + G * lam * (y_target - y[n])
Stable when: 0 < G * lam < 2
Conservative: lam = 0.3 (slow to react, robust to noise)
Aggressive: lam = 0.8 (fast response, amplifies noise)
"""
def __init__(self, G: float, lam: float, u_initial: float, y_target: float):
self.G, self.lam = G, lam
self.u, self.y_target = u_initial, y_target
self.history = []
if not (0 < G * lam < 2):
raise ValueError(f"Unstable controller: G*lam = {G*lam:.2f}, must be in (0, 2)")
def update(self, y_measured: float) -> float:
error = self.y_target - y_measured
self.u = self.u + self.G * self.lam * error
self.history.append({'y': y_measured, 'error': error, 'u': self.u})
return self.u
# Usage: etch depth control
controller = EWMAController(
G=0.5, # Process gain: 1 nm etch time change -> 0.5 nm CD change
lam=0.3, # Conservative smoothing
u_initial=60.0, # Initial etch time (seconds)
y_target=45.0 # Target CD (nm)
)
for y in measured_cds:
u_new = controller.update(y)
print(f"Measured: {y:.2f} nm -> New etch time: {u_new:.2f} s")Pattern 6: Walk-Forward Validation
Problem: Standard k-fold cross-validation shuffles time, putting future wafers in training. In production you can never train on future data. Walk-forward validation respects the arrow of time.
def walk_forward_splits(df, n_splits=5, test_size=None):
"""
Train on [0:t], validate on [t:t+test_size]. Never shuffle.
Each split adds more training data (expanding window).
This matches production: you retrain periodically with all
historical data, then evaluate on the next unseen period.
"""
n = len(df)
test_size = test_size or n // (n_splits + 1)
for i in range(1, n_splits + 1):
split = i * test_size
train = df.iloc[:split]
test = df.iloc[split:split + test_size]
assert train.index.max() < test.index.min(), "Leakage detected"
yield train, test
# PM-aware extension: also test cross-PM generalization
def walk_forward_pm_aware(df, pm_col='pm_sequence'):
"""Within-PM splits + cross-PM generalization test."""
for pm in sorted(df[pm_col].unique()):
pm_data = df[df[pm_col] == pm]
mid = len(pm_data) // 2
yield pm_data.iloc[:mid], pm_data.iloc[mid:]
# Hardest test: train on PM 0 (commissioning), test on PM 1
yield df[df[pm_col] == 0], df[df[pm_col] == 1]Statistical Process Control: The Decision Tree
SPC is the monitoring layer that catches excursions before they become scrap. The wrong chart misses the failure mode. Start with the decision tree, not with "add more charts."
What is the failure mode?
|
+-- Sudden step change (shift) -----------> Shewhart 3-sigma chart
| |
| +-- Single parameter ---------------> X-bar / R-chart
| |
| +-- Multiple correlated parameters -> Hotelling T-squared
| (catches joint excursions each individual chart misses)
|
+-- Small persistent drift --------------> CUSUM (Cumulative Sum)
| |
| +-- Drift direction known ----------> One-sided CUSUM
| |
| +-- Autocorrelated data ------------> EWMA control chart
|
+-- Periodic / seasonal -----------------> Decompose, then Shewhart on residuals
|
+-- Unknown -----------------------------> Start with CUSUM or EWMA
(Shewhart misses shifts smaller than 1.5-sigma; CUSUM catches them)Shewhart Chart
def shewhart_chart(data: pd.Series, L: float = 3.0) -> dict:
"""3-sigma Shewhart control chart. Fast, interpretable, misses small shifts."""
center = data.mean()
sigma = data.std()
ucl, lcl = center + L * sigma, center - L * sigma
signals = (data > ucl) | (data < lcl)
return {'center': center, 'ucl': ucl, 'lcl': lcl, 'signals': signals}
# Limitation: a 1-sigma drift can persist for hundreds of samples before signaling.
# Use CUSUM for consumable degradation (CMP pads, electrodes, focus rings).CUSUM Chart
def cusum_chart(data: pd.Series, target: float = None,
h: float = 4.0, k: float = 0.5) -> dict:
"""
Cumulative Sum control chart. Detects small persistent shifts.
h: decision interval (alarm threshold), typically 4 to 5
k: reference value (slack), typically 0.5
C+ accumulates upward deviations, C- accumulates downward.
Reset to 0 when negative (prevents history from ancient events).
"""
if target is None:
target = data.mean()
sigma = data.std()
z = (data - target) / sigma
C_plus = pd.Series(0.0, index=data.index)
C_minus = pd.Series(0.0, index=data.index)
C_plus.iloc[0] = max(0, z.iloc[0] - k)
C_minus.iloc[0] = max(0, -z.iloc[0] - k)
for i in range(1, len(data)):
C_plus.iloc[i] = max(0, C_plus.iloc[i-1] + z.iloc[i] - k)
C_minus.iloc[i] = max(0, C_minus.iloc[i-1] - z.iloc[i] - k)
return {'C_plus': C_plus, 'C_minus': C_minus,
'signals': (C_plus > h) | (C_minus > h)}EWMA Control Chart
def ewma_chart(data: pd.Series, lam: float = 0.2, L: float = 3.0) -> dict:
"""
EWMA control chart. For autocorrelated data.
Small lam (0.1) = smooth, slow to detect.
Large lam (0.4) = responsive, noisier.
Asymptotic control limits (valid after ~20 observations):
sigma_ewma = sigma * sqrt(lam / (2 - lam))
"""
ewma = data.ewm(alpha=lam, adjust=False).mean()
sigma_ewma = data.std() * (lam / (2 - lam)) ** 0.5
center = data.mean()
ucl, lcl = center + L * sigma_ewma, center - L * sigma_ewma
return {'ewma': ewma, 'ucl': ucl, 'lcl': lcl,
'signals': (ewma > ucl) | (ewma < lcl)}Hotelling T-Squared (Multivariate)
from scipy.stats import f as f_dist
def hotelling_t2(data: pd.DataFrame, alpha: float = 0.001) -> dict:
"""
Hotelling T-squared multivariate SPC.
Use when: multiple correlated sensors (pressure, temperature, flow)
where individual charts show nothing but a joint excursion is real.
T2 = (x - mu)' * Sigma_inv * (x - mu)
"""
n, p = data.shape
mu = data.mean().values
cov = data.cov().values
cov_inv = np.linalg.inv(cov)
T2 = pd.Series(
[(data.iloc[i].values - mu) @ cov_inv @ (data.iloc[i].values - mu)
for i in range(n)],
index=data.index
)
# F-distribution control limit
ucl = (p * (n-1) * (n+1)) / (n * (n-p)) * f_dist.ppf(1 - alpha, p, n-p)
return {'T2': T2, 'ucl': ucl, 'signals': T2 > ucl}Sampling and Signal Processing
Physical constraints on data collection. Not all signals can be detected from all data, regardless of which algorithm you use.
The ISA-95 Hierarchy
Where data lives, who owns it, and how it flows. Every data science project in a fab operates within this hierarchy. Knowing which level your data comes from determines what joins are possible, what latency you have, and where your model can be deployed.
Level 4: Enterprise
ERP, SAP, financial planning, corporate reporting
Data: quarterly yield summaries, cost accounting, capacity plans
Latency: days to weeks
|
| business objectives
v
Level 3: Manufacturing Operations
MES (Manufacturing Execution System): lot tracking, routing, recipes
Data: wafer location, step history, metrology results, lot genealogy
Latency: seconds to minutes
|
| lot scheduling, routing, recipes
v
Level 2: Manufacturing Control
FDC historian, APC server, VM system, R2R controllers
Data: 1 Hz SECS/GEM sensor streams, alarm logs, recipe adjustments
Latency: milliseconds to seconds
|
| real-time control, fault detection
v
Level 1: Basic Control
PLC, tool controller, safety interlocks, SECS/GEM gateway
Data: raw sensor I/O at 100 Hz to 10 kHz via Interface A
Latency: microseconds to milliseconds
|
| sensor I/O, actuator commands
v
Level 0: Process
Chamber, plasma, wafer, physicsTechnical Foundation at a Glance
| Chapter | Core Skills | Key Patterns |
|---|---|---|
| 2.1 SQL | Five join patterns for temporal, drifting, asynchronous fab data | merge_asof, BETWEEN, window functions, LAG/diff, ROW_NUMBER |
| 2.2 Python | Pandas patterns for sentinels, irregular sampling, frozen sensors, R2R control, walk-forward validation | replace(), rolling("60s"), RobustScaler, variance threshold, EWMA class, split generators |
| 2.3 SPC | Decision tree: pick the right chart before writing code | Shewhart, CUSUM, EWMA control chart, Hotelling T-squared |
| 2.4 Sampling | Nyquist, aliasing, jitter, quantization as physical constraints | Frequency calculations, time-based windows, tolerance joins |
| 2.5 ISA-95 | Hierarchy navigation, air gap understanding, deployment constraints | Level 0 to 4 architecture, Interface A, ONNX deployment target |