Part 2

The Technical Foundation

Working code for live interviews. The five SQL join patterns, six Python data engineering patterns, the SPC decision tree, Nyquist for sensor data, and the ISA-95 hierarchy.

Chapter 2.1

SQL for Fab Data: The Five Join Patterns

Canonical reference implementations with full conceptual treatment are in Theory Library: SQL for Fab Data. This section focuses on interview execution: what to announce before writing, what the interviewer is testing for with each pattern, and the audit steps that separate strong candidates.

Fab data lives in relational databases: Oracle, SQL Server, PostgreSQL, or specialized historians. Standard SQL fails because of three unusual properties:

1

Time is the primary key, but clocks drift. Tool A and Tool B report the same time differently.

2

Events are asynchronous. FDC reports continuously; MES reports at step boundaries; metrology reports when complete.

3

Relationships are temporal, not static. A wafer exists in FDC at time T1, in MES at T2, in metrology at T3. The "same wafer" is defined by temporal overlap, not ID equality.

Pattern 1: Tolerance Join (merge_asof)

Problem: Join FDC sensor data to MES lot context. Both have timestamps. They never match exactly due to clock drift, jitter, and different sampling rates.

Wrong: exact equality join (returns almost nothing)

-- NEVER DO THIS
SELECT f.*, m.lot_id
FROM fdc_data f
JOIN mes_events m ON f.timestamp = m.step_start;
-- Timestamps never equal. Clock drift of even 1 second drops all rows.

Correct: tolerance-based join with merge_asof

import pandas as pd

# Both DataFrames must be sorted by timestamp first
fdc_sorted = fdc_data.sort_values('timestamp')
mes_sorted = mes_events.sort_values('step_start')

joined = pd.merge_asof(
    fdc_sorted,
    mes_sorted,
    left_on='timestamp',
    right_on='step_start',
    by='tool_id',                          # Never join across tools
    tolerance=pd.Timedelta('5 seconds'),   # Physically plausible clock drift
    direction='nearest'                    # or 'backward' for causal joins
)

# Validate join quality
orphan_pct = joined['lot_id'].isna().mean() * 100
print(f"Orphan FDC rows: {orphan_pct:.1f}%")
# >5% orphans = systematic alignment problem, investigate before proceeding

Pattern 2: Range Join (BETWEEN)

Problem: Attribute FDC rows to process steps. A step has a duration from step_start to step_end, and multiple FDC rows fall within this range.

SQL: attribute sensor rows to their process step

WITH fdc_labeled AS (
  SELECT
    f.*,
    m.lot_id,
    m.recipe_id,
    m.step_name,
    EXTRACT(EPOCH FROM (f.timestamp - m.step_start)) AS offset_seconds
  FROM fdc_data f
  LEFT JOIN mes_events m
    ON f.tool_id = m.tool_id
    AND f.timestamp BETWEEN m.step_start AND m.step_end
)
SELECT *
FROM fdc_labeled
WHERE ABS(offset_seconds) < 300   -- Reject joins offset >5 min (likely wrong tool mapping)
   OR offset_seconds IS NULL;      -- Keep orphans for investigation

Pattern 3: PM-Aware Partition (Window Functions)

Problem: Compute per-PM-period statistics. PM events reset tool state and shift the distribution. Global statistics blend pre-PM and post-PM data, destroying the signal you need for drift detection.

SQL: compute per-PM statistics with commissioning baseline

WITH pm_periods AS (
  SELECT
    tool_id,
    pm_timestamp                                              AS period_start,
    LEAD(pm_timestamp) OVER (PARTITION BY tool_id ORDER BY pm_timestamp)
                                                              AS period_end,
    ROW_NUMBER() OVER (PARTITION BY tool_id ORDER BY pm_timestamp) - 1
                                                              AS pm_sequence  -- 0 = commissioning
  FROM pm_log
),
fdc_with_period AS (
  SELECT
    f.*,
    p.pm_sequence,
    f.timestamp - p.period_start AS time_since_pm
  FROM fdc_data f
  JOIN pm_periods p
    ON f.tool_id = p.tool_id
    AND f.timestamp >= p.period_start
    AND (f.timestamp < p.period_end OR p.period_end IS NULL)
)
SELECT
  tool_id,
  pm_sequence,
  AVG(chamber_pressure)                                                AS pressure_mean,
  STDDEV(chamber_pressure)                                             AS pressure_std,
  AVG(chamber_pressure) - LAG(AVG(chamber_pressure)) OVER (
    PARTITION BY tool_id ORDER BY pm_sequence
  )                                                                    AS pressure_shift
FROM fdc_with_period
GROUP BY tool_id, pm_sequence;
-- pm_sequence = 0 is your golden baseline. All others compared to it.

Pattern 4: Gap Detection (LAG / diff)

Problem: Detect communication timeouts, frozen historians, and missing data. SECS/GEM reports at regular intervals; gaps indicate failure.

SQL: flag rows with timestamp gaps exceeding expected interval

WITH timediffs AS (
  SELECT
    tool_id,
    sensor_id,
    timestamp,
    timestamp - LAG(timestamp) OVER (
      PARTITION BY tool_id, sensor_id
      ORDER BY timestamp
    ) AS time_since_last
  FROM fdc_data
)
SELECT *
FROM timediffs
WHERE time_since_last > INTERVAL '5 seconds';
-- Expected 1 Hz = 1 second between rows. Gap >5s = SECS/GEM timeout or network failure.

Python equivalent

fdc_data['time_since_last'] = (
    fdc_data
    .groupby(['tool_id', 'sensor_id'])['timestamp']
    .diff()
)

timeouts = fdc_data[fdc_data['time_since_last'] > pd.Timedelta('5 seconds')]
print(f"Timeout events: {len(timeouts)}")
print(timeouts[['tool_id', 'sensor_id', 'timestamp', 'time_since_last']].head())

Pattern 5: Deduplication (ROW_NUMBER)

Problem: Duplicate CEIDs from SECS retry logic, network duplicates, or historian bugs. Same wafer, same step, multiple start events.

SQL: deduplicate SECS/GEM events, keep first occurrence

WITH deduped AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY tool_id, lot_id, step_name
      ORDER BY timestamp
    ) AS rn
  FROM mes_events
  WHERE ceid IN ('Step_Start', 'Step_End')
)
SELECT * FROM deduped WHERE rn = 1;
-- rn = 1 keeps the first (earliest) occurrence per tool/lot/step.
-- Use ORDER BY timestamp DESC to keep the most recent (if retries are corrections).

Five patterns: quick reference

Pattern	Function	When to Use	Validation Check
Tolerance join	merge_asof()	FDC to MES with clock drift	Orphan %, cross-correlation lag
Range join	BETWEEN	Event attribution to step duration	Offset from step start
PM-aware partition	Window functions	Per-PM statistics, golden baseline	Period sequence continuity
Gap detection	LAG() / diff()	SECS/GEM timeout detection	Gap distribution vs. expected interval
Deduplication	ROW_NUMBER()	SECS retry logic, duplicate CEIDs	Count before/after, verify uniqueness

Chapter 2.2

Python for Fab Data: Six Pandas Patterns

Standard Pandas patterns fail on fab data because of sentinel values, irregular sampling, and physical constraints. Here are the six patterns that work in production.

Pattern 1: Sentinel Handling

Problem: Missing data encoded as -9999.0, 9999.0, or 0.0, not NaN. Standard statistics include sentinels as valid values.

Wrong and correct sentinel handling

# WRONG: includes -9999.0 as a valid pressure reading
df['chamber_pressure'].mean()

# Correct: map sentinels to NaN first
import numpy as np

SENTINELS = [-9999.0, 9999.0, -999.0, 999.0]
df_clean = df.replace(SENTINELS, np.nan)
df_clean['chamber_pressure'].mean()  # Now excludes sentinels

# Ambiguous 0.0: cross-reference with recipe step to diagnose
df['mfc_status'] = np.where(
    (df['mfc_flow'] == 0.0) & (df['recipe_step'] == 'gas_flow_on'),
    'BROKEN',   # 0.0 sccm during active gas flow = MFC failure
    'NORMAL'
)

Historian Vendor	Sentinel Value	Meaning
Historian A	-9999.0	Communication timeout
Historian B	9999.0	Out of range (over scale)
Historian C	-999.0	Sensor offline
Legacy tool	0.0	Ambiguous: off or broken? Cross-reference recipe step.

Pattern 2: Time-Based Windows (Irregular Sampling)

Problem: SECS/GEM data has gaps from timeouts and PM events. Index-based rolling windows treat a 60-second gap the same as a 1-second gap.

Wrong and correct rolling window

# WRONG: 60-row window includes gaps as valid samples
df.rolling(window=60).mean()

# Correct: time-based window respects actual clock time
df_time = df.set_index('timestamp')

# '60s' window = exactly 60 seconds of actual time
df_rolling = df_time.rolling(window='60s', min_periods=10).mean()
# min_periods=10: require at least 10 samples, reject sparse windows

# Resample to regular grid for downstream ML (e.g., CNNs that expect fixed length)
df_1hz = df_time.resample('1s').mean()
# NaN where no data in that second - forward-fill only if you understand the physics

Pattern 3: Robust Scaling Per-PM Period

Problem: PM events create step changes in sensor distributions. A global StandardScaler mixes pre-PM and post-PM data, destroying the signal.

Wrong and correct scaling

# WRONG: global scaler mixes pre-PM and post-PM distributions
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Correct: fit on commissioning data only (pm_sequence = 0)
from sklearn.preprocessing import RobustScaler

commissioning = df[df['pm_sequence'] == 0]
scaler = RobustScaler(quantile_range=(5, 95))  # Resistant to arc spikes
scaler.fit(commissioning[feature_cols])

# Transform all data using the commissioning distribution as reference
X_scaled = scaler.transform(df[feature_cols])
# RobustScaler uses median/IQR instead of mean/std: resistant to the
# arc spikes and sentinel values that would corrupt StandardScaler

Pattern 4: Frozen Sensor Detection

Problem: A sensor reporting identical values for 60+ seconds is broken, not stable. Zero variance is a hardware failure signature (Archetype 6: The Frozen Thermocouple).

Frozen sensor detector: rolling variance below healthy minimum

def detect_frozen(series: pd.Series, window: str = '60s',
                  threshold: float = None) -> pd.Series:
    """
    Returns boolean mask: True where sensor is frozen (variance = 0).

    Physical reality: every analog sensor measuring a real process has
    non-zero variance from thermal noise and process variation.
    Zero variance for 60+ seconds = broken circuit, not stability.
    """
    rolling_var = series.rolling(window=window, min_periods=10).var()

    if threshold is None:
        # Calibrate from the first 10% of data (assumed healthy)
        healthy = rolling_var.dropna().iloc[:len(rolling_var) // 10]
        threshold = healthy.quantile(0.01)  # 1st percentile of healthy variance

    return rolling_var < threshold

# Usage: flag frozen readings before they reach any model
frozen_mask = detect_frozen(df['temperature'], window='60s')
df.loc[frozen_mask, 'temperature'] = np.nan          # Replace with NaN
df.loc[frozen_mask, 'quality_flag'] = 'FROZEN_SENSOR' # Log the reason

Pattern 5: EWMA Controller

Problem: Etch depth drifts 0.3 nm/day. An EWMA run-to-run controller adjusts the recipe after every measurement to maintain target. Stability requires 0 < G * lambda < 2.

EWMA Run-to-Run controller class

class EWMAController:
    """
    EWMA Run-to-Run controller.

    Update equation: u[n] = u[n-1] + G * lam * (y_target - y[n])

    Stable when: 0 < G * lam < 2
    Conservative: lam = 0.3 (slow to react, robust to noise)
    Aggressive:   lam = 0.8 (fast response, amplifies noise)
    """
    def __init__(self, G: float, lam: float, u_initial: float, y_target: float):
        self.G, self.lam = G, lam
        self.u, self.y_target = u_initial, y_target
        self.history = []
        if not (0 < G * lam < 2):
            raise ValueError(f"Unstable controller: G*lam = {G*lam:.2f}, must be in (0, 2)")

    def update(self, y_measured: float) -> float:
        error = self.y_target - y_measured
        self.u = self.u + self.G * self.lam * error
        self.history.append({'y': y_measured, 'error': error, 'u': self.u})
        return self.u

# Usage: etch depth control
controller = EWMAController(
    G=0.5,       # Process gain: 1 nm etch time change -> 0.5 nm CD change
    lam=0.3,     # Conservative smoothing
    u_initial=60.0,  # Initial etch time (seconds)
    y_target=45.0    # Target CD (nm)
)

for y in measured_cds:
    u_new = controller.update(y)
    print(f"Measured: {y:.2f} nm  ->  New etch time: {u_new:.2f} s")

Pattern 6: Walk-Forward Validation

Problem: Standard k-fold cross-validation shuffles time, putting future wafers in training. In production you can never train on future data. Walk-forward validation respects the arrow of time.

Walk-forward splits: never shuffle, never leak future data

def walk_forward_splits(df, n_splits=5, test_size=None):
    """
    Train on [0:t], validate on [t:t+test_size]. Never shuffle.

    Each split adds more training data (expanding window).
    This matches production: you retrain periodically with all
    historical data, then evaluate on the next unseen period.
    """
    n = len(df)
    test_size = test_size or n // (n_splits + 1)

    for i in range(1, n_splits + 1):
        split = i * test_size
        train = df.iloc[:split]
        test = df.iloc[split:split + test_size]
        assert train.index.max() < test.index.min(), "Leakage detected"
        yield train, test

# PM-aware extension: also test cross-PM generalization
def walk_forward_pm_aware(df, pm_col='pm_sequence'):
    """Within-PM splits + cross-PM generalization test."""
    for pm in sorted(df[pm_col].unique()):
        pm_data = df[df[pm_col] == pm]
        mid = len(pm_data) // 2
        yield pm_data.iloc[:mid], pm_data.iloc[mid:]
    # Hardest test: train on PM 0 (commissioning), test on PM 1
    yield df[df[pm_col] == 0], df[df[pm_col] == 1]

Chapter 2.3

Statistical Process Control: The Decision Tree

SPC is the monitoring layer that catches excursions before they become scrap. The wrong chart misses the failure mode. Start with the decision tree, not with "add more charts."

SPC decision tree: pick the chart before writing code

What is the failure mode?
|
+-- Sudden step change (shift) -----------> Shewhart 3-sigma chart
|   |
|   +-- Single parameter ---------------> X-bar / R-chart
|   |
|   +-- Multiple correlated parameters -> Hotelling T-squared
|       (catches joint excursions each individual chart misses)
|
+-- Small persistent drift --------------> CUSUM (Cumulative Sum)
|   |
|   +-- Drift direction known ----------> One-sided CUSUM
|   |
|   +-- Autocorrelated data ------------> EWMA control chart
|
+-- Periodic / seasonal -----------------> Decompose, then Shewhart on residuals
|
+-- Unknown -----------------------------> Start with CUSUM or EWMA
    (Shewhart misses shifts smaller than 1.5-sigma; CUSUM catches them)

Shewhart Chart

def shewhart_chart(data: pd.Series, L: float = 3.0) -> dict:
    """3-sigma Shewhart control chart. Fast, interpretable, misses small shifts."""
    center = data.mean()
    sigma  = data.std()
    ucl, lcl = center + L * sigma, center - L * sigma
    signals = (data > ucl) | (data < lcl)
    return {'center': center, 'ucl': ucl, 'lcl': lcl, 'signals': signals}

# Limitation: a 1-sigma drift can persist for hundreds of samples before signaling.
# Use CUSUM for consumable degradation (CMP pads, electrodes, focus rings).

CUSUM Chart

def cusum_chart(data: pd.Series, target: float = None,
               h: float = 4.0, k: float = 0.5) -> dict:
    """
    Cumulative Sum control chart. Detects small persistent shifts.

    h: decision interval (alarm threshold), typically 4 to 5
    k: reference value (slack), typically 0.5

    C+ accumulates upward deviations, C- accumulates downward.
    Reset to 0 when negative (prevents history from ancient events).
    """
    if target is None:
        target = data.mean()
    sigma = data.std()
    z = (data - target) / sigma

    C_plus  = pd.Series(0.0, index=data.index)
    C_minus = pd.Series(0.0, index=data.index)
    C_plus.iloc[0]  = max(0, z.iloc[0] - k)
    C_minus.iloc[0] = max(0, -z.iloc[0] - k)

    for i in range(1, len(data)):
        C_plus.iloc[i]  = max(0, C_plus.iloc[i-1]  + z.iloc[i] - k)
        C_minus.iloc[i] = max(0, C_minus.iloc[i-1] - z.iloc[i] - k)

    return {'C_plus': C_plus, 'C_minus': C_minus,
            'signals': (C_plus > h) | (C_minus > h)}

EWMA Control Chart

def ewma_chart(data: pd.Series, lam: float = 0.2, L: float = 3.0) -> dict:
    """
    EWMA control chart. For autocorrelated data.
    Small lam (0.1) = smooth, slow to detect.
    Large lam (0.4) = responsive, noisier.

    Asymptotic control limits (valid after ~20 observations):
    sigma_ewma = sigma * sqrt(lam / (2 - lam))
    """
    ewma = data.ewm(alpha=lam, adjust=False).mean()
    sigma_ewma = data.std() * (lam / (2 - lam)) ** 0.5
    center = data.mean()
    ucl, lcl = center + L * sigma_ewma, center - L * sigma_ewma
    return {'ewma': ewma, 'ucl': ucl, 'lcl': lcl,
            'signals': (ewma > ucl) | (ewma < lcl)}

Hotelling T-Squared (Multivariate)

from scipy.stats import f as f_dist

def hotelling_t2(data: pd.DataFrame, alpha: float = 0.001) -> dict:
    """
    Hotelling T-squared multivariate SPC.

    Use when: multiple correlated sensors (pressure, temperature, flow)
    where individual charts show nothing but a joint excursion is real.

    T2 = (x - mu)' * Sigma_inv * (x - mu)
    """
    n, p  = data.shape
    mu    = data.mean().values
    cov   = data.cov().values
    cov_inv = np.linalg.inv(cov)

    T2 = pd.Series(
        [(data.iloc[i].values - mu) @ cov_inv @ (data.iloc[i].values - mu)
         for i in range(n)],
        index=data.index
    )
    # F-distribution control limit
    ucl = (p * (n-1) * (n+1)) / (n * (n-p)) * f_dist.ppf(1 - alpha, p, n-p)
    return {'T2': T2, 'ucl': ucl, 'signals': T2 > ucl}

Chapter 2.4

Sampling and Signal Processing

Physical constraints on data collection. Not all signals can be detected from all data, regardless of which algorithm you use.

Nyquist-Shannon Sampling Theorem

To detect a signal with maximum frequency f_max, you must sample at f_s > 2 * f_max. A plasma micro-arc lasts 50ms, which means its frequency content is approximately 20 Hz. Minimum sampling: 40 Hz. Interface A at 100 Hz provides comfortable margin. The 1 Hz SECS/GEM historian samples at 1/20th the required rate. The 50ms arc is not hard to detect at 1 Hz; it is mathematically invisible. No algorithm can find what was never recorded.

Interview soundbite

"We were sampling 20x too slowly. The arc was not just hard to detect from 1 Hz data. It was mathematically impossible to detect. This is not a modeling problem. It is a physics constraint on the data collection pipeline."

Aliasing

When f_s < 2 * f_max, high-frequency signals fold into the low-frequency range. Example: 100 Hz plasma oscillations sampled at 1 Hz appear as near-zero-Hz drift in the data. Misinterpreted as process shift, when the actual cause is high-frequency plasma instability. The fix is an anti-aliasing low-pass filter before digitization, cutting off above f_s/2. Not always present in legacy SECS/GEM systems.

Jitter

Timestamp variance of plus or minus 50ms is common in SECS/GEM due to software polling, network latency, and OS scheduling. "1 Hz" sampling is actually irregular, with intervals varying from 900 to 1100ms. This affects spectral analysis and period detection. The fix: always use time-based windows (rolling("60s")) not index-based windows (rolling(60)), and tolerance-based joins not equality joins.

Quantization

ADC resolution of 12 bits is common, giving 4096 discrete levels. For 0 to 1000W RF power, that is approximately 0.25W resolution. Signals smaller than the quantization floor are invisible even at correct sampling rate. Arc detection needs sufficient signal-to-noise ratio, not just correct sampling rate.

Chapter 2.5

The ISA-95 Hierarchy

Where data lives, who owns it, and how it flows. Every data science project in a fab operates within this hierarchy. Knowing which level your data comes from determines what joins are possible, what latency you have, and where your model can be deployed.

Level 4: Enterprise
    ERP, SAP, financial planning, corporate reporting
    Data: quarterly yield summaries, cost accounting, capacity plans
    Latency: days to weeks
         |
         | business objectives
         v
Level 3: Manufacturing Operations
    MES (Manufacturing Execution System): lot tracking, routing, recipes
    Data: wafer location, step history, metrology results, lot genealogy
    Latency: seconds to minutes
         |
         | lot scheduling, routing, recipes
         v
Level 2: Manufacturing Control
    FDC historian, APC server, VM system, R2R controllers
    Data: 1 Hz SECS/GEM sensor streams, alarm logs, recipe adjustments
    Latency: milliseconds to seconds
         |
         | real-time control, fault detection
         v
Level 1: Basic Control
    PLC, tool controller, safety interlocks, SECS/GEM gateway
    Data: raw sensor I/O at 100 Hz to 10 kHz via Interface A
    Latency: microseconds to milliseconds
         |
         | sensor I/O, actuator commands
         v
Level 0: Process
    Chamber, plasma, wafer, physics

Level 2 and 3: Your daily work

FDC sensor data (Level 2) joined to MES lot context (Level 3). This is the Great Fab Join: every feature engineering project starts here. The five SQL patterns in Chapter 2.1 are specifically for this join.

Level 1: High-frequency data for endpoint detection

Interface A at 100 Hz to 10 kHz. Separate read-only Ethernet port from the SECS/GEM control port. This is where you get the data to detect 50ms plasma arcs. Not universally available; ask what percentage of the fleet has Interface A before designing any real-time endpoint detection system.

Level 1 deployment: the air gap

Your ONNX model that runs on the Level 1 tool controller is physically separated from your Level 2 development environment by an air gap. No network connectivity. Deployment requires physical media transfer, change control approval, and 2 to 6 weeks of process. Design for ONNX from day one. Never assume cloud connectivity from the tool floor.

Level 3 to 4: Where you justify the project

Yield summaries, WIP at risk, cost of quality. When you present ROI to a Fab Director, you are presenting Level 4 numbers (dollars) derived from Level 2 and 3 data (sensor readings and lot outcomes). The ROI Translation Matrix from Part 0 is the bridge between these layers.

Part 2 Summary

Technical Foundation at a Glance

Chapter	Core Skills	Key Patterns
2.1 SQL	Five join patterns for temporal, drifting, asynchronous fab data	merge_asof, BETWEEN, window functions, LAG/diff, ROW_NUMBER
2.2 Python	Pandas patterns for sentinels, irregular sampling, frozen sensors, R2R control, walk-forward validation	replace(), rolling("60s"), RobustScaler, variance threshold, EWMA class, split generators
2.3 SPC	Decision tree: pick the right chart before writing code	Shewhart, CUSUM, EWMA control chart, Hotelling T-squared
2.4 Sampling	Nyquist, aliasing, jitter, quantization as physical constraints	Frequency calculations, time-based windows, tolerance joins
2.5 ISA-95	Hierarchy navigation, air gap understanding, deployment constraints	Level 0 to 4 architecture, Interface A, ONNX deployment target

<- Part 1: Eight Archetypes Part 3: ML Arsenal