Knowing the material is not enough. Practice the specific formats you will encounter under pressure: live SQL, coding challenge, behavioral storytelling, hostile cross-examination, and system design whiteboard.
Chapter 4.1
The Technical Screen: SQL Live-Coding
Shared editor or whiteboard
45 to 60 minutes, 1 to 3 problems
The interviewer observes your process, not just your final answer. They want to see: do you audit timezone handling before writing the join? Do you use LEFT JOIN? Do you validate join quality? Do you explain your reasoning as you type?
-Do you handle time and clock drift correctly?
-Do you validate join quality (orphan rate, cross-correlation lag)?
-Do you catch edge cases (sentinels, duplicates, PM boundaries)?
-Do you explain the physical reasoning behind each decision?
Practice Problem 1
Tolerance-Based Join
PROMPT
"We have FDC sensor data and MES lot records. Write a query to attribute each FDC row to the correct lot and step. The FDC historian uses UTC. MES uses local time with DST. Clocks drift by up to 30 seconds."
Think aloud: "First I need to handle timezone and drift. I'll convert both to UTC at ingestion. For the join, exact equality fails due to clock drift, so I'll use a tolerance window."
WITH fdc_utc AS (
SELECT tool_id,
timestamp AT TIME ZONE 'UTC' AS ts_utc,
sensor_id, sensor_value
FROM fdc_data
),
mes_utc AS (
SELECT tool_id, lot_id, step_name,
step_start AT TIME ZONE 'UTC' AS start_utc,
step_end AT TIME ZONE 'UTC' AS end_utc
FROM mes_events
WHERE ceid IN ('Step_Start', 'Step_End')
),
joined AS (
SELECT f.*, m.lot_id, m.step_name,
EXTRACT(EPOCH FROM (f.ts_utc - m.start_utc)) AS offset_seconds
FROM fdc_utc f
LEFT JOIN mes_utc m -- LEFT JOIN: keep orphan FDC rows
ON f.tool_id = m.tool_id
AND f.ts_utc BETWEEN m.start_utc - INTERVAL '30 seconds'
AND m.end_utc + INTERVAL '30 seconds'
)
SELECT tool_id, sensor_id, ts_utc, lot_id, step_name, offset_seconds
FROM joined
WHERE ABS(offset_seconds) < 30 -- Within clock drift tolerance
OR lot_id IS NULL -- Keep orphans for investigation
ORDER BY tool_id, ts_utc;
Follow-up: "Why LEFT JOIN not INNER JOIN?"
"I want to see orphan FDC rows: sensor data with no matching MES record. High orphan rate indicates systematic join failure: timezone bug, wrong tool mapping, or missing MES data. INNER JOIN hides this."
Follow-up: "How do you validate the join worked correctly?"
"Three checks: (1) orphan percentage should be below 5%; (2) offset_seconds distribution should center near zero; (3) cross-correlation of step-start events between systems should peak at lag zero, not 3600 seconds, which would indicate a DST bug."
Practice Problem 2
Gap Detection and Frozen Sensor
PROMPT
"Write a query to detect frozen sensors (zero variance over 60-second windows) and communication gaps (expected 1 Hz data but gap exceeds 5 seconds)."
WITH sensor_windows AS (
SELECT
tool_id, sensor_id, timestamp, sensor_value,
timestamp - LAG(timestamp) OVER (
PARTITION BY tool_id, sensor_id ORDER BY timestamp
) AS time_gap,
VAR_SAMP(sensor_value) OVER (
PARTITION BY tool_id, sensor_id ORDER BY timestamp
ROWS BETWEEN 59 PRECEDING AND CURRENT ROW
) AS rolling_variance,
COUNT(*) OVER (
PARTITION BY tool_id, sensor_id ORDER BY timestamp
ROWS BETWEEN 59 PRECEDING AND CURRENT ROW
) AS window_count
FROM fdc_data
WHERE sensor_value NOT IN (-9999.0, 9999.0, -999.0) -- exclude sentinels first
)
SELECT
tool_id, sensor_id, timestamp, sensor_value,
time_gap, rolling_variance,
CASE
WHEN time_gap > INTERVAL '5 seconds' THEN 'COMMUNICATION_TIMEOUT'
WHEN rolling_variance = 0 AND window_count = 60 THEN 'FROZEN_SENSOR'
ELSE 'NORMAL'
END AS data_quality_flag
FROM sensor_windows
WHERE data_quality_flag != 'NORMAL'
ORDER BY tool_id, sensor_id, timestamp;
-- Key points: sentinels excluded before variance, window must be full (60 rows),
-- CASE distinguishes multiple failure modes for different escalation paths
Practice Problem 3
Per-PM Statistics with Commissioning Baseline
PROMPT
"Compute mean and standard deviation for chamber pressure separately per PM period. Flag periods where the mean deviates more than 10% from the commissioning baseline."
WITH pm_periods AS (
SELECT tool_id,
pm_timestamp AS period_start,
COALESCE(LEAD(pm_timestamp) OVER (PARTITION BY tool_id ORDER BY pm_timestamp),
'2099-12-31'::timestamp) AS period_end,
ROW_NUMBER() OVER (PARTITION BY tool_id ORDER BY pm_timestamp) - 1
AS pm_sequence -- 0 = commissioning (golden baseline)
FROM pm_log
),
fdc_with_pm AS (
SELECT f.*, p.pm_sequence
FROM fdc_data f
JOIN pm_periods p ON f.tool_id = p.tool_id
AND f.timestamp >= p.period_start
AND f.timestamp < p.period_end
),
commissioning AS (
SELECT tool_id, AVG(chamber_pressure) AS baseline_mean
FROM fdc_with_pm WHERE pm_sequence = 0
GROUP BY tool_id
),
per_pm AS (
SELECT f.tool_id, f.pm_sequence,
AVG(f.chamber_pressure) AS pressure_mean,
STDDEV(f.chamber_pressure) AS pressure_std,
c.baseline_mean
FROM fdc_with_pm f JOIN commissioning c USING (tool_id)
GROUP BY f.tool_id, f.pm_sequence, c.baseline_mean
)
SELECT *, ABS(pressure_mean - baseline_mean) / baseline_mean AS deviation,
CASE WHEN ABS(pressure_mean - baseline_mean) / baseline_mean > 0.10
THEN 'FLAGGED' ELSE 'ACCEPTABLE' END AS status
FROM per_pm ORDER BY tool_id, pm_sequence;
Chapter 4.2
The Coding Challenge: Python Implementation
Shared IDE or take-home
60 to 90 minutes, one substantial problem
Common problem types: endpoint detection from a time series, EWMA controller, feature extraction with validation, OOD detection system. The evaluation criteria are always: causal processing (no future data), robust to noise, O(1) per-update inference, and clean typed output.
PROMPT
"Implement real-time endpoint detection for plasma etch. Input: 100 Hz OES intensity at 520nm. Output: endpoint index when SiO2 clears to Si. Constraints: causal processing (no future data), inference under 10ms, robust to noise."
Full solution: EndpointDetector class
import numpy as np
from dataclasses import dataclass
from typing import Optional
@dataclass
class EndpointResult:
endpoint_idx: Optional[int]
confidence: float
algorithm: str # state for debugging: monitoring / rapid_drop_warning / confirmed_drop
class EndpointDetector:
"""
Real-time endpoint detection for plasma etch.
Physics: OES 520nm (silicon emission) rises as SiO2 clears.
We detect the drop in the SiO2 signal when silicon is exposed.
Causal: uses only past and current data (backward difference, causal smoothing).
O(1) per update: buffers grow linearly but processing is constant-time.
"""
def __init__(self,
window_smooth: int = 10, # 100ms at 100Hz
threshold_relative: float = 0.15, # 15% drop = endpoint
min_step_duration: int = 500, # ignore first 5 seconds
confirmation_count: int = 3): # require 3 consecutive confirms
self.window_smooth = window_smooth
self.threshold_relative = threshold_relative
self.min_step_duration = min_step_duration
self.confirmation_count = confirmation_count
self.buffer: list = []
self.endpoint_confirmed = 0
self.found_endpoint: Optional[int] = None
def _causal_smooth(self, data: np.ndarray) -> np.ndarray:
"""Causal moving average. Pads with first value to maintain length."""
if len(data) < self.window_smooth:
return data
kernel = np.ones(self.window_smooth) / self.window_smooth
smoothed = np.convolve(data, kernel, mode='valid')
pad = np.full(self.window_smooth - 1, data[0])
return np.concatenate([pad, smoothed])
def update(self, timestamp: float, oes_intensity: float) -> EndpointResult:
"""Process one new OES sample. Call at 100Hz."""
self.buffer.append(oes_intensity)
idx = len(self.buffer) - 1
if idx < self.min_step_duration:
return EndpointResult(None, 0.0, "insufficient_data")
if self.found_endpoint is not None:
return EndpointResult(self.found_endpoint, 1.0, "already_confirmed")
smoothed = self._causal_smooth(np.array(self.buffer))
current = smoothed[-1]
derivative = smoothed[-1] - smoothed[-2] # backward difference
baseline = smoothed[:self.min_step_duration].max()
relative_drop = (baseline - current) / baseline if baseline > 0 else 0.0
if relative_drop > self.threshold_relative and derivative < 0:
self.endpoint_confirmed += 1
if self.endpoint_confirmed >= self.confirmation_count:
self.found_endpoint = idx
return EndpointResult(idx,
min(relative_drop / self.threshold_relative, 1.0),
"confirmed_drop")
else:
self.endpoint_confirmed = max(0, self.endpoint_confirmed - 1)
if derivative < -0.05 * baseline:
return EndpointResult(None, 0.5, "rapid_drop_warning")
return EndpointResult(None, 0.0, "monitoring")
# Validation with synthetic data
def generate_oes(n=1000, endpoint_at=600, noise=0.02):
sig = np.ones(n) + np.random.normal(0, noise, n)
if endpoint_at < n:
drop = np.exp(-np.linspace(0, 5, n - endpoint_at))
sig[endpoint_at:] = 0.3 + 0.7 * drop[0] * (1 - drop)
return sig
if __name__ == "__main__":
oes = generate_oes(endpoint_at=600)
detector = EndpointDetector()
for i, intensity in enumerate(oes):
result = detector.update(i / 100.0, intensity)
if result.endpoint_idx is not None:
error_ms = (result.endpoint_idx - 600) * 10 # 10ms per sample at 100Hz
print(f"Detected at sample {result.endpoint_idx} | error: {error_ms:.0f}ms")
break
✓Causal processing
backward difference derivative, forward pad in smoothing
✓Robust to noise
smoothing window + confirmation_count before signaling
✓O(1) per update
buffer grows but _causal_smooth is constant for each call
✓Typed output
dataclass with endpoint_idx, confidence, and state label
Chapter 4.3
The Behavioral Round: Story Structure
STAR format expected
30 to 45 minutes, 3 to 5 questions
Your story bank is the Eight Archetypes from Part 1, adapted to STAR (Situation, Task, Action, Result). The stories work even if you have no direct fab experience: "I have studied this failure mode in depth. If I were building that system today, I would..."
Story Template: The Averaged Arc
Question: "Walk me through an anomaly you caught in production data."
SITUATION
I was investigating a yield excursion at a logic foundry. Electrical test flagged 25 wafers with 94% gate oxide leakage failure. The FDC historian showed all sensors normal: step-means within limits, no alarms.
TASK
Find the root cause and prevent recurrence. The catch: the failure happened 8 hours earlier and standard monitoring missed it completely.
ACTION
I suspected a sampling rate problem. The failure mode was plasma micro-arcs lasting 50 milliseconds. I pulled 100 Hz Interface A data and found 19 of 25 wafers had RF impedance spikes: 40% increase in 10ms, back to baseline in 50ms. The standard 1 Hz historian was sampling 20x too slowly. The arc was mathematically invisible at that rate. Root cause: match network capacitor worn to 97% of tuning range. I implemented 99th percentile and max derivative features for the high-frequency stream, and added CUSUM monitoring on capacitor position against the commissioning baseline.
RESULT
Caught the next degradation at 5% deviation, not 94% yield loss. Prevention instead of firefighting. Key insight: domain knowledge drives feature engineering, not algorithm selection. No ML model would discover 99th percentile features from 1 Hz step-mean data.
Story Template: The Ghost Excursion
Question: "How do you validate a model when ground truth arrives 60 days late?"
SITUATION
Our Virtual Metrology model for CVD film thickness had run successfully for six months, replacing physical metrology on 80% of wafers. One night, the silane mass flow controller failed. Flow dropped to 0.0 sccm. No reactive gas, no film deposition.
TASK
The model needed to handle sensor failures gracefully. Standard regression has no abstention mechanism.
ACTION
I analyzed the failure: the model predicted 44.6 to 45.4 nm for all 812 affected wafers. Perfectly normal predictions, confidently wrong. The input was 47 standard deviations from training mean, but the model extrapolated silently. I implemented an autoencoder-based Reliability Index: trained on commissioning data, it measures reconstruction error. High error flags out-of-distribution inputs. MFC fault codes now trigger automatic model suspension and routing to physical metrology.
RESULT
Zero ghost excursions since implementation. The model abstains rather than guessing. The business accepted a 15% abstention rate because one avoided excursion saves $40M or more in accumulated processing. Abstention threshold is calibrated to cost, not accuracy.
"Why this industry?" , The only answer that works
WEAK
"I think semiconductors are interesting and I want to work on hard problems."
STRONG
"I have worked on recommendation systems where a false negative costs $0.005 in missed ad revenue. In semiconductor manufacturing, a false negative in endpoint detection costs $2.5 million. That constraint, where physical reality and financial reality are immediate and unforgiving, forces a discipline I find compelling. You cannot A/B test your way out. You cannot collect more data cheaply. You have to be right the first time, with incomplete information, and prove it with physics. That is why I am here."
Chapter 4.4
The Hostile Review: Defending Your Work
Skeptical process engineer or hiring manager
20 to 30 minutes
This format tests communication, confidence, and flexibility. The correct posture is not defensive and not capitulating. Acknowledge the concern, show you have thought it through, and explain the cost reasoning. Every response should end with the financial justification.
ATTACK
"Your model abstained on 15% of wafers. That is too expensive."
YOUR RESPONSE
The 15% abstention rate is calibrated to false negative cost, not chosen arbitrarily. One missed excursion processes 812 wafers with no film. At $50K per wafer plus accumulated processing, that is $40M or more in loss. Fifteen percent abstention at $500 per physical metrology run is $60K per 800 wafers. The break-even is one prevented excursion per 667 lots. We have prevented three in two years. The abstention rate is conservative, not excessive.
ATTACK
"Why not just use a neural network? They are more accurate."
YOUR RESPONSE
Neural networks fail three constraints here. First, interpretability: process engineers with 20 years of tool experience will not trust predictions they cannot connect to physical mechanisms. SHAP on gradient-boosted trees gives them that. Second, latency: even small quantized networks are 10x slower than XGBoost on 2012 dual-core hardware; we are budgeted to 10ms, not 100ms. Third, edge deployment: ONNX export of GBTs is deterministic and bit-exact; neural network quantization introduces variance that can fail the deployment checklist. We use CNNs for image-based metrology where they genuinely win. For tabular sensor data, GBTs are the production standard.
ATTACK
"Your CUSUM caught the drift but we got false alarms. Should we widen the limits?"
YOUR RESPONSE
Widening limits trades false alarms for missed detections, which is the Boiling Frog problem in another form. Before widening, I would check the baseline. If we are comparing to a rolling window, the baseline adapts to drift and we start missing gradual degradation. The fix is not wider limits; it is anchoring to the fixed commissioning baseline. We can also add a dead band: small deviations accumulate toward the threshold, large deviations trigger immediately. But the first question is always: are we comparing to a fixed reference or to yesterday?
ATTACK
"This seems overcomplicated. Why not just use simple threshold alarms?"
YOUR RESPONSE
Simple thresholds work well for single sensors with well-understood physics, and we use them there. They miss multivariate excursions. The Invisible Reticle Killer ran for six months with every univariate chart green. The pattern was spatial and periodic, visible only in Moran's I at the reticle field scale. Similarly, Hotelling T-squared catches joint excursions where individual sensors each look normal. The rule is: use the simplest chart that catches the failure mode. Start with Shewhart. Add CUSUM when you need to catch slow drift. Add T-squared when multiple correlated sensors matter. The complexity matches the failure mode, not the analyst's preferences.
Chapter 4.5
System Design: Whiteboard Architecture
Whiteboard or virtual draw tool
45 minutes, one open-ended prompt
Common prompts: "Design a Virtual Metrology system for a new etch chamber," "Design fleet-wide monitoring across 100 tools," "Design a system to detect reticle defects from inspection data." Answer every one with the same five-layer framework.
Layer 1
Data Ingestion
What to cover: Sources, formats, validation
Key decisions: SECS/GEM vs. Interface A, sentinel handling, timestamp alignment with merge_asof, orphan rate monitoring
Layer 2
Feature Engineering
What to cover: What features, how computed, storage
Key decisions: Domain-driven, physically interpretable, per-PM statistics, RobustScaler on commissioning data
Key decisions: ONNX export, p99 latency on mirror hardware, air-gap transfer, 2-week change control, shadow mode first
Layer 5
Operations
What to cover: Alerting, escalation, continuous validation
Key decisions: Abstention routing to metrology, PM-aware retraining gates, per-PM model validation, no automatic retraining on hardware degradation
Worked example: Virtual Metrology for a new etch chamber
Layer 1: Data Ingestion
"Two streams: SECS/GEM at 1 Hz for recipe context and alarms, Interface A at 100 Hz for endpoint-relevant sensors. Ingestion handles UTC conversion, sentinel replacement (-9999.0 to NaN), and timestamp alignment with 5-second tolerance using merge_asof. Validation: cross-correlation of step-start events between systems, orphan rate below 5%."
Layer 2: Feature Engineering
"Domain-driven features with one-sentence physical interpretations: RF power mean (energy delivery to plasma), pressure slope (gas consumption dynamics), OES 520nm max derivative (endpoint sensitivity), match network capacitor position (wear indicator). Computed per step, stored with lineage. Per-PM scaling: RobustScaler fit on commissioning data only."
Layer 3: Model Core
"Gradient-boosted trees for latency and interpretability. SHAP for per-prediction explanation to process engineers. Mandatory: autoencoder Reliability Index for OOD detection trained on commissioning data. Abstention threshold calibrated to $2.5M false negative cost vs. $500 false positive cost: ratio of 5000:1 justifies aggressive abstention."
Layer 4: Deployment
"ONNX export to edge controller. Target: p99 inference under 10ms on dual-core 2012 hardware. Zero external dependencies: static binary, no Docker, no network calls. Air-gap transfer via physical media. Change control: 2-week approval. Shadow mode first: log predictions for 100 wafers without acting. Active mode only after validation lot confirms accuracy within 2% of development evaluation."
Layer 5: Operations
"High Reliability Index or any sensor fault triggers automatic routing to physical metrology and pages the equipment engineer. Per-PM model validation: accuracy compared to commissioning or investigation triggered. No automatic retraining when hardware degradation causes a distribution shift; that is an engineering problem, not a data problem. Retraining only for approved recipe changes with a new commissioning baseline."
Part 4 Summary
Interview Simulation at a Glance
Format
Preparation
Key Skills
SQL Live-Coding
3 practice problems cold, validate with orphan rate and cross-correlation
Tolerance joins, PM-aware windows, gap detection, LEFT JOIN reflex
Python Challenge
EndpointDetector from memory, also EWMA controller and frozen sensor detector