Reference
Data schemas, mathematical derivations, the contextual glossary, checkpoint solutions, and the integrated capstone excursion.
Data Schema Reference
SECS/GEM message structures
The HSMS message header is 10 bytes. Field layout:
| Bytes | Field | Type | Notes |
|---|---|---|---|
| 0-3 | Message length | UInt32, big-endian | Length of header + data body in bytes |
| 4-5 | Device ID | UInt16, big-endian | Tool identifier; high bit set = host-to-equipment direction |
| 6 | Stream byte | UInt8 | High bit = reply bit (W-bit); lower 7 bits = Stream number (S) |
| 7 | Function byte | UInt8 | Function number (F) |
| 8-9 | Block number | UInt16, big-endian | Always 1 for single-block messages; 0 for multi-block last block |
Key message types
S1F1 - Are You There? (host to tool, connection check) S1F2 - I Am Here (tool response) S6F11 - Event Report Send (Stream 6, Function 11) Body: DATAID (UInt32) | CEID (UInt32) | RPT list RPT list: [(RPTID, [SV values]), ...] Note: tool-initiated async notification; no reply expected unless W-bit set S2F41 - Host Command Send (recipe download, process start) Body: RCMD (string) | CPLIST [(CPNAME, CPVAL), ...] S5F1 - Alarm Report Send (equipment alarm notification) Body: ALCD (alarm code) | ALID (alarm ID) | ALTX (alarm text) S9F7 - Unrecognized Device ID (error: bad device ID in message header)
Interface A (EDA) XML schema fragment
<!-- Simplified EDA data collection plan response -->
<DMTGetDataCollectionPlanResponse>
<DCP>
<DCPName>ETCH_MAIN_100HZ</DCPName>
<Frequency unit="Hz">100</Frequency>
<Parameters>
<Parameter>
<ParameterID>RF_Forward_Power</ParameterID>
<Units>W</Units>
<DataType>Float32</DataType>
</Parameter>
<Parameter>
<ParameterID>Chamber_Pressure</ParameterID>
<Units>mTorr</Units>
<DataType>Float32</DataType>
</Parameter>
</Parameters>
<Trigger>
<EventType>RecipeStepStart</EventType>
<CEID>2001</CEID>
</Trigger>
</DCP>
</DMTGetDataCollectionPlanResponse>Mathematical Derivations
Partial derivatives and Lagrange multipliers
Lagrange multipliers handle constrained optimization: finding the maximum of f(x) subject to a constraint g(x) = 0. The method introduces a multiplier lambda and solves the system: gradient(f) = lambda * gradient(g), alongside g(x) = 0. In the CMP R2R context, this appears when the recipe optimizer must maximize removal rate uniformity across the wafer subject to a total material removal constraint. The Lagrangian is:
Complex numbers for RF impedance
RF impedance is a complex quantity: Z = R + jX where R is resistance (real part, in Ohms) and X is reactance (imaginary part, in Ohms). The magnitude is |Z| = sqrt(R^2 + X^2) and the phase angle is theta = arctan(X/R). Plasma etch chambers have capacitive reactance (X < 0) during normal operation; inductive behavior (X > 0) indicates a plasma instability or match network failure.
The Contextual Glossary
Definitions organized by where the term lives in the ISA-95 network hierarchy. Cross-references indicate which layers use each term in analytical context.
Checkpoint Solutions
Solution: Layer 3.3 Checkpoint, FDC fault signature interpretation
PCA model with k=8 components retaining 88% variance across 40 sensors. Wafer 1,247 shows a high Q alarm with 78% of contribution from RF_Impedance_Real and RF_Impedance_Imag. T-squared = 2.3, well below the UCL of 22.4. Interpret the physical meaning.
High Q indicates the observation cannot be well-reconstructed from the retained 8 principal components. Something changed in the sensor correlation structure that the PCA model's retained components do not describe. Low T-squared means the observation is not unusual in the directions the model knows about (the principal component subspace): it is unusual in the directions the model discarded (the residual subspace). The fault is in the correlation structure, not in the magnitude of known variation.
RF_Impedance_Real and RF_Impedance_Imag are the real and imaginary parts of the plasma load as a complex number. In normal operation they are strongly correlated: as chamber chemistry changes, both components shift in a predictable ratio determined by the plasma physics. If RF_Impedance_Real shifts without a corresponding shift in RF_Impedance_Imag (or vice versa), the complex impedance has decoupled from normal behavior. This decoupling is not captured in the retained principal components (which describe the normal correlated variation) but is captured in the Q-statistic (which describes the residual from that normal structure).
The decoupling of the real and imaginary components of RF impedance, with no shift in the overall operating mode (low T-squared), is most consistent with a change in the match network's tuning behavior. Specifically: a capacitor in the match network that is beginning to fail will shift one component of the impedance independently of the other, breaking the normal correlation. Recommended action: request a match network capacitor diagnostic on the tool that processed Wafer 1,247 before the next lot starts.
Solution: Layer 4.8 Checkpoint, die bin revenue calculation
Three bins: H (800 dies, +$40 premium over S), S (6,700 dies, nominal), L (2,500 dies, -$15 discount vs. S). Compute revenue impact vs. baseline policy of assigning all wafers to S-bin product. Identify the most costly misclassification cell.
Without any binning model, all 10,000 dies sell at S-bin price. The actual population contains 800 H-bin dies that could earn +$40 each and 2,500 L-bin dies that should be discounted -$15 each. Baseline policy treats all as S-bin, capturing zero premium on H-bin and absorbing zero discount on L-bin (by selling L-bin dies as S-bin, which satisfies the S-bin spec, but leaves L-bin quality signal unused).
A perfect binning model captures the H-bin premium: 800 x $40 = $32,000 per wafer lot in additional revenue. It also identifies L-bin dies so they are not shipped as S-bin (avoiding warranty claims). The most costly misclassification cell is H-bin dies classified as S-bin (false negatives on the H-bin class): each such die costs $40 in lost premium revenue. The second most costly is L-bin dies shipped as S-bin: each risks a customer warranty return costing far more than the $15 price difference.
The asymmetric cost structure means standard accuracy optimization is wrong. The threshold for calling a die H-bin should be set conservatively (high confidence required) because incorrectly shipping an L-bin die as H-bin to a premium customer costs far more than missing a few H-bin premiums. Use a cost-weighted confusion matrix with H-misclassification penalty = max(warranty_cost, $40) when selecting the decision threshold.
An Integrated Yield Excursion
The following scenario is a composite of documented excursion patterns from advanced logic fabs. Details have been modified. At each phase, identify which layer of knowledge the response team was drawing on and which gaps caused the initial delay.
A 3 nm logic fab is manufacturing a high-performance AI processor in production ramp, currently at 74% yield with a target of 88% for volume production in seven months. FDC, R2R, and VM models are deployed across all critical process steps. On Tuesday morning the weekly yield report shows a drop from 74% to 41% across all lots completed in the previous 36 hours.
600 wafers affected. At $50,000 per wafer, $30M of product is at risk. Every FDC chart is green. Every R2R controller shows normal adjustment history.
The first action is not to query the sensor database. It is to look at the spatial distribution of failing dice. Pulling end-of-line electrical test data and applying the affine coordinate transformation from Layer 4.4 (converting tester die indices to wafer-level millimeter coordinates, corrected for notch orientation), then running DBSCAN, reveals a consistent ring pattern at 95 to 110 mm from wafer center across all affected lots. A ring at this radius is the spatial signature of a rotating mechanical tool - specifically CMP. The suspect list narrows from 180 tools to the CMP modules in 20 minutes.
The MES lot history is queried joining all 600 affected wafers to their CMP step assignments. Chamber CMP-07 processed 78% of the failing wafers during a continuous 14-hour window on Sunday night. The FDC historian for CMP-07 is pulled. Standard SPC charts show green throughout. Removal rate mean is within 3-sigma. Within-wafer uniformity is within spec. None of the monitored parameters explain a 74% to 41% yield drop. This is the point where less experienced engineers return to the sensor database and look harder. It is the wrong move. The monitored parameters have already returned negative. The missing information is in a parameter that was not monitored.
The equipment engineer opens CMP-07. The wafer carrier shows a 1.4 mm offset from nominal center position on the carrier rotation axis - a bearing wear signature. A bearing wear offset changes the radial velocity distribution at the wafer surface: the pad contact velocity at radius r from wafer center is no longer axially symmetric. Regions that align with the direction of the offset experience higher contact velocity and therefore higher removal rate (Preston equation). The increased removal rate at 95 to 110 mm produces excessive copper dishing in the interconnect layer, which causes open circuits in the dense wiring at that radius. The ring location at 95 to 110 mm is consistent with the geometric calculation to two significant figures. The FDC charts were green because motor current change from a 1.4 mm offset is approximately 2%, within the normal variance of the monitored signal at the configured alarm threshold.
Of the 600 affected wafers, the spatial yield model from Phase 1 estimates which dies are likely functional. The ring at 95 to 110 mm covers approximately 22% of the die area. If yield degradation is spatially contained to that ring, inner dies (radius less than 90 mm) and outer dies (radius greater than 115 mm) may still be functional. The lot genealogy tree (Layer 2.4) is used to check automotive customer allocations. Six lots totaling 150 wafers are AEC-Q100 automotive-allocated. AEC-Q100 does not permit known-compromised process history even if final electrical test passes - the lot history itself fails traceability requirements. Decision: continue processing all 600 wafers to final test. The 150 automotive wafers represent $7.5M in unrecoverable scrap. The 450 non-automotive wafers recover approximately $16.8M of the original $22.5M at risk.
The corrective action configures daily CUSUM monitoring on carrier motor current slope for all CMP tools. The CUSUM is anchored to a rolling 30-day baseline updated only after bearing inspection and qualification events - not after every maintenance event - to avoid the baseline contamination problem from Layer 3.3. The VM model for CMP removal rate uniformity is retrained to include carrier motor current as a feature, so future drift of this parameter appears in the model's uncertainty output before the physical effect reaches the wafer. The 14-hour window during which 470 wafers processed through the failing chamber occurred on Sunday night. The on-call engineer who received the FDC alert at 2 AM had reviewed the standard SPC charts, found nothing, and cleared the alert as a false alarm. The engineer was correct that the SPC charts showed nothing unusual. The error was not diagnostic; it was architectural: the monitoring system had no visibility into the physical mechanism that was failing.
You have completed all seven layers. The algorithms, schemas, and failure modes in this manual were chosen because they appear repeatedly in production fab data science roles. The cases are composites of real incidents.
The gap between knowing this material and being effective in a fab is approximately one production excursion. The excursion will teach you things this manual cannot: the pace of a live investigation, the politics of presenting a root cause to a Fab Director, and the specific way your fab's data infrastructure diverges from the canonical patterns described here.
The manual closes the knowledge gap so you can survive the first excursion long enough to learn from it.