Part 4

The Interview Simulation

Knowing the material is not enough. Practice the specific formats you will encounter under pressure: live SQL, coding challenge, behavioral storytelling, hostile cross-examination, and system design whiteboard.

Chapter 4.1

The Technical Screen: SQL Live-Coding

Shared editor or whiteboard

45 to 60 minutes, 1 to 3 problems

The interviewer observes your process, not just your final answer. They want to see: do you audit timezone handling before writing the join? Do you use LEFT JOIN? Do you validate join quality? Do you explain your reasoning as you type?

-Do you handle time and clock drift correctly?

-Do you validate join quality (orphan rate, cross-correlation lag)?

-Do you catch edge cases (sentinels, duplicates, PM boundaries)?

-Do you explain the physical reasoning behind each decision?

Practice Problem 1

Tolerance-Based Join

PROMPT

"We have FDC sensor data and MES lot records. Write a query to attribute each FDC row to the correct lot and step. The FDC historian uses UTC. MES uses local time with DST. Clocks drift by up to 30 seconds."

Think aloud: "First I need to handle timezone and drift. I'll convert both to UTC at ingestion. For the join, exact equality fails due to clock drift, so I'll use a tolerance window."

WITH fdc_utc AS (
  SELECT tool_id,
         timestamp AT TIME ZONE 'UTC' AS ts_utc,
         sensor_id, sensor_value
  FROM fdc_data
),
mes_utc AS (
  SELECT tool_id, lot_id, step_name,
         step_start AT TIME ZONE 'UTC' AS start_utc,
         step_end   AT TIME ZONE 'UTC' AS end_utc
  FROM mes_events
  WHERE ceid IN ('Step_Start', 'Step_End')
),
joined AS (
  SELECT f.*, m.lot_id, m.step_name,
         EXTRACT(EPOCH FROM (f.ts_utc - m.start_utc)) AS offset_seconds
  FROM fdc_utc f
  LEFT JOIN mes_utc m               -- LEFT JOIN: keep orphan FDC rows
    ON f.tool_id = m.tool_id
   AND f.ts_utc BETWEEN m.start_utc - INTERVAL '30 seconds'
                    AND m.end_utc   + INTERVAL '30 seconds'
)
SELECT tool_id, sensor_id, ts_utc, lot_id, step_name, offset_seconds
FROM joined
WHERE ABS(offset_seconds) < 30   -- Within clock drift tolerance
   OR lot_id IS NULL              -- Keep orphans for investigation
ORDER BY tool_id, ts_utc;

Follow-up: "Why LEFT JOIN not INNER JOIN?"

"I want to see orphan FDC rows: sensor data with no matching MES record. High orphan rate indicates systematic join failure: timezone bug, wrong tool mapping, or missing MES data. INNER JOIN hides this."

Follow-up: "How do you validate the join worked correctly?"

"Three checks: (1) orphan percentage should be below 5%; (2) offset_seconds distribution should center near zero; (3) cross-correlation of step-start events between systems should peak at lag zero, not 3600 seconds, which would indicate a DST bug."

Practice Problem 2

Gap Detection and Frozen Sensor

PROMPT

"Write a query to detect frozen sensors (zero variance over 60-second windows) and communication gaps (expected 1 Hz data but gap exceeds 5 seconds)."

WITH sensor_windows AS (
  SELECT
    tool_id, sensor_id, timestamp, sensor_value,
    timestamp - LAG(timestamp) OVER (
      PARTITION BY tool_id, sensor_id ORDER BY timestamp
    ) AS time_gap,
    VAR_SAMP(sensor_value) OVER (
      PARTITION BY tool_id, sensor_id ORDER BY timestamp
      ROWS BETWEEN 59 PRECEDING AND CURRENT ROW
    ) AS rolling_variance,
    COUNT(*) OVER (
      PARTITION BY tool_id, sensor_id ORDER BY timestamp
      ROWS BETWEEN 59 PRECEDING AND CURRENT ROW
    ) AS window_count
  FROM fdc_data
  WHERE sensor_value NOT IN (-9999.0, 9999.0, -999.0)  -- exclude sentinels first
)
SELECT
  tool_id, sensor_id, timestamp, sensor_value,
  time_gap, rolling_variance,
  CASE
    WHEN time_gap > INTERVAL '5 seconds' THEN 'COMMUNICATION_TIMEOUT'
    WHEN rolling_variance = 0 AND window_count = 60  THEN 'FROZEN_SENSOR'
    ELSE 'NORMAL'
  END AS data_quality_flag
FROM sensor_windows
WHERE data_quality_flag != 'NORMAL'
ORDER BY tool_id, sensor_id, timestamp;
-- Key points: sentinels excluded before variance, window must be full (60 rows),
-- CASE distinguishes multiple failure modes for different escalation paths

Practice Problem 3

Per-PM Statistics with Commissioning Baseline

PROMPT

"Compute mean and standard deviation for chamber pressure separately per PM period. Flag periods where the mean deviates more than 10% from the commissioning baseline."

WITH pm_periods AS (
  SELECT tool_id,
         pm_timestamp AS period_start,
         COALESCE(LEAD(pm_timestamp) OVER (PARTITION BY tool_id ORDER BY pm_timestamp),
                  '2099-12-31'::timestamp) AS period_end,
         ROW_NUMBER() OVER (PARTITION BY tool_id ORDER BY pm_timestamp) - 1
           AS pm_sequence   -- 0 = commissioning (golden baseline)
  FROM pm_log
),
fdc_with_pm AS (
  SELECT f.*, p.pm_sequence
  FROM fdc_data f
  JOIN pm_periods p ON f.tool_id = p.tool_id
    AND f.timestamp >= p.period_start
    AND f.timestamp <  p.period_end
),
commissioning AS (
  SELECT tool_id, AVG(chamber_pressure) AS baseline_mean
  FROM fdc_with_pm WHERE pm_sequence = 0
  GROUP BY tool_id
),
per_pm AS (
  SELECT f.tool_id, f.pm_sequence,
         AVG(f.chamber_pressure) AS pressure_mean,
         STDDEV(f.chamber_pressure) AS pressure_std,
         c.baseline_mean
  FROM fdc_with_pm f JOIN commissioning c USING (tool_id)
  GROUP BY f.tool_id, f.pm_sequence, c.baseline_mean
)
SELECT *, ABS(pressure_mean - baseline_mean) / baseline_mean AS deviation,
       CASE WHEN ABS(pressure_mean - baseline_mean) / baseline_mean > 0.10
            THEN 'FLAGGED' ELSE 'ACCEPTABLE' END AS status
FROM per_pm ORDER BY tool_id, pm_sequence;

Chapter 4.2

The Coding Challenge: Python Implementation

Shared IDE or take-home

60 to 90 minutes, one substantial problem

Common problem types: endpoint detection from a time series, EWMA controller, feature extraction with validation, OOD detection system. The evaluation criteria are always: causal processing (no future data), robust to noise, O(1) per-update inference, and clean typed output.

PROMPT

"Implement real-time endpoint detection for plasma etch. Input: 100 Hz OES intensity at 520nm. Output: endpoint index when SiO2 clears to Si. Constraints: causal processing (no future data), inference under 10ms, robust to noise."

Full solution: EndpointDetector class

import numpy as np
from dataclasses import dataclass
from typing import Optional

@dataclass
class EndpointResult:
    endpoint_idx: Optional[int]
    confidence: float
    algorithm: str  # state for debugging: monitoring / rapid_drop_warning / confirmed_drop

class EndpointDetector:
    """
    Real-time endpoint detection for plasma etch.

    Physics: OES 520nm (silicon emission) rises as SiO2 clears.
    We detect the drop in the SiO2 signal when silicon is exposed.

    Causal: uses only past and current data (backward difference, causal smoothing).
    O(1) per update: buffers grow linearly but processing is constant-time.
    """
    def __init__(self,
                 window_smooth: int = 10,          # 100ms at 100Hz
                 threshold_relative: float = 0.15, # 15% drop = endpoint
                 min_step_duration: int = 500,      # ignore first 5 seconds
                 confirmation_count: int = 3):      # require 3 consecutive confirms
        self.window_smooth = window_smooth
        self.threshold_relative = threshold_relative
        self.min_step_duration = min_step_duration
        self.confirmation_count = confirmation_count
        self.buffer: list = []
        self.endpoint_confirmed = 0
        self.found_endpoint: Optional[int] = None

    def _causal_smooth(self, data: np.ndarray) -> np.ndarray:
        """Causal moving average. Pads with first value to maintain length."""
        if len(data) < self.window_smooth:
            return data
        kernel = np.ones(self.window_smooth) / self.window_smooth
        smoothed = np.convolve(data, kernel, mode='valid')
        pad = np.full(self.window_smooth - 1, data[0])
        return np.concatenate([pad, smoothed])

    def update(self, timestamp: float, oes_intensity: float) -> EndpointResult:
        """Process one new OES sample. Call at 100Hz."""
        self.buffer.append(oes_intensity)
        idx = len(self.buffer) - 1

        if idx < self.min_step_duration:
            return EndpointResult(None, 0.0, "insufficient_data")
        if self.found_endpoint is not None:
            return EndpointResult(self.found_endpoint, 1.0, "already_confirmed")

        smoothed = self._causal_smooth(np.array(self.buffer))
        current = smoothed[-1]
        derivative = smoothed[-1] - smoothed[-2]  # backward difference
        baseline = smoothed[:self.min_step_duration].max()

        relative_drop = (baseline - current) / baseline if baseline > 0 else 0.0

        if relative_drop > self.threshold_relative and derivative < 0:
            self.endpoint_confirmed += 1
            if self.endpoint_confirmed >= self.confirmation_count:
                self.found_endpoint = idx
                return EndpointResult(idx,
                                      min(relative_drop / self.threshold_relative, 1.0),
                                      "confirmed_drop")
        else:
            self.endpoint_confirmed = max(0, self.endpoint_confirmed - 1)

        if derivative < -0.05 * baseline:
            return EndpointResult(None, 0.5, "rapid_drop_warning")
        return EndpointResult(None, 0.0, "monitoring")

# Validation with synthetic data
def generate_oes(n=1000, endpoint_at=600, noise=0.02):
    sig = np.ones(n) + np.random.normal(0, noise, n)
    if endpoint_at < n:
        drop = np.exp(-np.linspace(0, 5, n - endpoint_at))
        sig[endpoint_at:] = 0.3 + 0.7 * drop[0] * (1 - drop)
    return sig

if __name__ == "__main__":
    oes = generate_oes(endpoint_at=600)
    detector = EndpointDetector()
    for i, intensity in enumerate(oes):
        result = detector.update(i / 100.0, intensity)
        if result.endpoint_idx is not None:
            error_ms = (result.endpoint_idx - 600) * 10  # 10ms per sample at 100Hz
            print(f"Detected at sample {result.endpoint_idx} | error: {error_ms:.0f}ms")
            break

✓Causal processing

backward difference derivative, forward pad in smoothing

✓Robust to noise

smoothing window + confirmation_count before signaling

✓O(1) per update

buffer grows but _causal_smooth is constant for each call

✓Typed output

dataclass with endpoint_idx, confidence, and state label

Chapter 4.3

The Behavioral Round: Story Structure

STAR format expected

30 to 45 minutes, 3 to 5 questions

Your story bank is the Eight Archetypes from Part 1, adapted to STAR (Situation, Task, Action, Result). The stories work even if you have no direct fab experience: "I have studied this failure mode in depth. If I were building that system today, I would..."

Story Template: The Averaged Arc

Question: "Walk me through an anomaly you caught in production data."

SITUATION

I was investigating a yield excursion at a logic foundry. Electrical test flagged 25 wafers with 94% gate oxide leakage failure. The FDC historian showed all sensors normal: step-means within limits, no alarms.

TASK

Find the root cause and prevent recurrence. The catch: the failure happened 8 hours earlier and standard monitoring missed it completely.

ACTION

I suspected a sampling rate problem. The failure mode was plasma micro-arcs lasting 50 milliseconds. I pulled 100 Hz Interface A data and found 19 of 25 wafers had RF impedance spikes: 40% increase in 10ms, back to baseline in 50ms. The standard 1 Hz historian was sampling 20x too slowly. The arc was mathematically invisible at that rate. Root cause: match network capacitor worn to 97% of tuning range. I implemented 99th percentile and max derivative features for the high-frequency stream, and added CUSUM monitoring on capacitor position against the commissioning baseline.

RESULT

Caught the next degradation at 5% deviation, not 94% yield loss. Prevention instead of firefighting. Key insight: domain knowledge drives feature engineering, not algorithm selection. No ML model would discover 99th percentile features from 1 Hz step-mean data.

Story Template: The Ghost Excursion

Question: "How do you validate a model when ground truth arrives 60 days late?"

SITUATION

Our Virtual Metrology model for CVD film thickness had run successfully for six months, replacing physical metrology on 80% of wafers. One night, the silane mass flow controller failed. Flow dropped to 0.0 sccm. No reactive gas, no film deposition.

TASK

The model needed to handle sensor failures gracefully. Standard regression has no abstention mechanism.

ACTION

I analyzed the failure: the model predicted 44.6 to 45.4 nm for all 812 affected wafers. Perfectly normal predictions, confidently wrong. The input was 47 standard deviations from training mean, but the model extrapolated silently. I implemented an autoencoder-based Reliability Index: trained on commissioning data, it measures reconstruction error. High error flags out-of-distribution inputs. MFC fault codes now trigger automatic model suspension and routing to physical metrology.

RESULT

Zero ghost excursions since implementation. The model abstains rather than guessing. The business accepted a 15% abstention rate because one avoided excursion saves $40M or more in accumulated processing. Abstention threshold is calibrated to cost, not accuracy.

"Why this industry?" , The only answer that works

WEAK

"I think semiconductors are interesting and I want to work on hard problems."

STRONG

"I have worked on recommendation systems where a false negative costs $0.005 in missed ad revenue. In semiconductor manufacturing, a false negative in endpoint detection costs $2.5 million. That constraint, where physical reality and financial reality are immediate and unforgiving, forces a discipline I find compelling. You cannot A/B test your way out. You cannot collect more data cheaply. You have to be right the first time, with incomplete information, and prove it with physics. That is why I am here."

Chapter 4.4

The Hostile Review: Defending Your Work

Skeptical process engineer or hiring manager

20 to 30 minutes

This format tests communication, confidence, and flexibility. The correct posture is not defensive and not capitulating. Acknowledge the concern, show you have thought it through, and explain the cost reasoning. Every response should end with the financial justification.

ATTACK

"Your model abstained on 15% of wafers. That is too expensive."

YOUR RESPONSE

The 15% abstention rate is calibrated to false negative cost, not chosen arbitrarily. One missed excursion processes 812 wafers with no film. At $50K per wafer plus accumulated processing, that is $40M or more in loss. Fifteen percent abstention at $500 per physical metrology run is $60K per 800 wafers. The break-even is one prevented excursion per 667 lots. We have prevented three in two years. The abstention rate is conservative, not excessive.

ATTACK

"Why not just use a neural network? They are more accurate."

YOUR RESPONSE

Neural networks fail three constraints here. First, interpretability: process engineers with 20 years of tool experience will not trust predictions they cannot connect to physical mechanisms. SHAP on gradient-boosted trees gives them that. Second, latency: even small quantized networks are 10x slower than XGBoost on 2012 dual-core hardware; we are budgeted to 10ms, not 100ms. Third, edge deployment: ONNX export of GBTs is deterministic and bit-exact; neural network quantization introduces variance that can fail the deployment checklist. We use CNNs for image-based metrology where they genuinely win. For tabular sensor data, GBTs are the production standard.

ATTACK

"Your CUSUM caught the drift but we got false alarms. Should we widen the limits?"

YOUR RESPONSE

Widening limits trades false alarms for missed detections, which is the Boiling Frog problem in another form. Before widening, I would check the baseline. If we are comparing to a rolling window, the baseline adapts to drift and we start missing gradual degradation. The fix is not wider limits; it is anchoring to the fixed commissioning baseline. We can also add a dead band: small deviations accumulate toward the threshold, large deviations trigger immediately. But the first question is always: are we comparing to a fixed reference or to yesterday?

ATTACK

"This seems overcomplicated. Why not just use simple threshold alarms?"

YOUR RESPONSE

Simple thresholds work well for single sensors with well-understood physics, and we use them there. They miss multivariate excursions. The Invisible Reticle Killer ran for six months with every univariate chart green. The pattern was spatial and periodic, visible only in Moran's I at the reticle field scale. Similarly, Hotelling T-squared catches joint excursions where individual sensors each look normal. The rule is: use the simplest chart that catches the failure mode. Start with Shewhart. Add CUSUM when you need to catch slow drift. Add T-squared when multiple correlated sensors matter. The complexity matches the failure mode, not the analyst's preferences.

Chapter 4.5

System Design: Whiteboard Architecture

Whiteboard or virtual draw tool

45 minutes, one open-ended prompt

Common prompts: "Design a Virtual Metrology system for a new etch chamber," "Design fleet-wide monitoring across 100 tools," "Design a system to detect reticle defects from inspection data." Answer every one with the same five-layer framework.

Layer 1

Data Ingestion

What to cover: Sources, formats, validation

Key decisions: SECS/GEM vs. Interface A, sentinel handling, timestamp alignment with merge_asof, orphan rate monitoring

Layer 2

Feature Engineering

What to cover: What features, how computed, storage

Key decisions: Domain-driven, physically interpretable, per-PM statistics, RobustScaler on commissioning data

Layer 3

Model Core

What to cover: Algorithm, training, validation

Key decisions: Latency constraint first, interpretability requirement second, OOD detection mandatory, cost-calibrated abstention threshold

Layer 4

Deployment

What to cover: Edge vs. cloud, format, monitoring

Key decisions: ONNX export, p99 latency on mirror hardware, air-gap transfer, 2-week change control, shadow mode first

Layer 5

Operations

What to cover: Alerting, escalation, continuous validation

Key decisions: Abstention routing to metrology, PM-aware retraining gates, per-PM model validation, no automatic retraining on hardware degradation

Worked example: Virtual Metrology for a new etch chamber

Layer 1: Data Ingestion

"Two streams: SECS/GEM at 1 Hz for recipe context and alarms, Interface A at 100 Hz for endpoint-relevant sensors. Ingestion handles UTC conversion, sentinel replacement (-9999.0 to NaN), and timestamp alignment with 5-second tolerance using merge_asof. Validation: cross-correlation of step-start events between systems, orphan rate below 5%."

Layer 2: Feature Engineering

"Domain-driven features with one-sentence physical interpretations: RF power mean (energy delivery to plasma), pressure slope (gas consumption dynamics), OES 520nm max derivative (endpoint sensitivity), match network capacitor position (wear indicator). Computed per step, stored with lineage. Per-PM scaling: RobustScaler fit on commissioning data only."

Layer 3: Model Core

"Gradient-boosted trees for latency and interpretability. SHAP for per-prediction explanation to process engineers. Mandatory: autoencoder Reliability Index for OOD detection trained on commissioning data. Abstention threshold calibrated to $2.5M false negative cost vs. $500 false positive cost: ratio of 5000:1 justifies aggressive abstention."

Layer 4: Deployment

"ONNX export to edge controller. Target: p99 inference under 10ms on dual-core 2012 hardware. Zero external dependencies: static binary, no Docker, no network calls. Air-gap transfer via physical media. Change control: 2-week approval. Shadow mode first: log predictions for 100 wafers without acting. Active mode only after validation lot confirms accuracy within 2% of development evaluation."

Layer 5: Operations

"High Reliability Index or any sensor fault triggers automatic routing to physical metrology and pages the equipment engineer. Per-PM model validation: accuracy compared to commissioning or investigation triggered. No automatic retraining when hardware degradation causes a distribution shift; that is an engineering problem, not a data problem. Retraining only for approved recipe changes with a new commissioning baseline."

Part 4 Summary

Interview Simulation at a Glance

Format	Preparation	Key Skills
SQL Live-Coding	3 practice problems cold, validate with orphan rate and cross-correlation	Tolerance joins, PM-aware windows, gap detection, LEFT JOIN reflex
Python Challenge	EndpointDetector from memory, also EWMA controller and frozen sensor detector	Causal processing, O(1) per-update, typed output, physics comments
Behavioral Round	8 STAR stories from archetypes, "Why this industry?" answer memorized	Specific numbers, structured narrative, business impact in the Result
Hostile Review	4 attack-response pairs practiced aloud at least twice	Cost reasoning, principled flexibility, never capitulate without thinking
System Design	5-layer framework applied to VM, then adapted to fleet monitoring	Architecture justified by physical constraints at every layer

<- Part 3: ML Arsenal Part 5: Company Deep Dives