Layer 2

Compliance and Safety

ISO 26262, AEC-Q100, SEMI S2/S8, and the data integrity requirements of regulated manufacturing.

Layer 2.1

ISO 26262 and Functional Safety

ISO 26262 is the international standard for functional safety in road vehicles. It governs how electrical and electronic systems must be designed, validated, and deployed to ensure that a malfunction does not cause unreasonable risk to the driver, passengers, or other road users. Its relevance to semiconductor manufacturing data science is indirect but consequential: the chips you are building ML models to improve may implement safety-critical functions in autonomous vehicles, advanced driver assistance systems, or powertrain control units.

A data scientist building yield models for a fab that supplies automotive customers does not need to be an ISO 26262 engineer. They need to understand what the standard requires well enough to know which of their modeling decisions have compliance implications and which do not.

ASIL Ratings and What They Demand from ML

ISO 26262 classifies automotive safety functions by Automotive Safety Integrity Level (ASIL), from A (lowest risk) to D (highest risk). The classification depends on three factors: severity (how bad is the worst-case harm if this function fails), exposure (how often is the vehicle in a situation where this function's failure could cause harm), and controllability (can the driver recover if this function fails).

For a chip that implements an ASIL D safety function, ISO 26262 requires that the probability of a hardware failure causing a safety violation be below 10⁻⁸ failures per hour of operation. The standard does not specify how the semiconductor is manufactured, but it does require that the chip manufacturer provide evidence of the manufacturing quality and process capability used to produce the device.

The soft interlock vs. hard interlock distinction

In a fab context, ML models that influence process control are soft interlocks: they run in software, can be overridden by an operator, and sit on the process control PLC network. Physical relays, emergency stop circuits, and the facility's life safety PLC are hard interlocks: they are hardwired, trip independently of software state, and cannot be disabled by any software command.

The practical implication: an ML model can implement a soft interlock that triggers a process hold when it detects an anomaly. It cannot implement a hard interlock that protects people from toxic gas exposure or physical injury. The boundary between these two categories is the physical relay and the life safety PLC, not the sophistication of the ML model.

HIRA: Hazard Identification and Risk Assessment

Before deploying any ML model that influences process control decisions, a Hazard Identification and Risk Assessment (HIRA) should identify every way the model could produce a harmful outcome, estimate the severity and likelihood of each harm, and determine whether the existing control architecture provides adequate protection.

A HIRA for a VM model deployed in a R2R control loop would examine failure modes including:

Overconfident OOD prediction: the model predicts a normal thickness on a wafer that has no film. Harm: WIP continues through further processing steps accumulating value on a defective substrate.
Systematic bias from distribution shift: the model underpredicts thickness for a new slurry lot. Harm: R2R controller adjusts process in the wrong direction for the entire lot before the bias is detected.
Delayed ground truth: the model is trained on data from a period before a chemistry change and does not reflect the current process. Harm: predictions are systematically wrong for weeks before electrical test reveals the problem.
Feature set includes safety-classified sensor: the model learns to recommend process conditions correlated with lower exhaust flow. Harm: regulatory exposure, potential safety incident.

The Dead Man's Switch: graceful degradation hierarchy

Any ML system deployed in a process control context must have a defined degradation path for every failure mode. The hierarchy is:

1. Full ML control

Model operating within validated input range, RI above threshold, recent ground truth available.

2. Constrained ML control

Model operating but one or more conditions degraded. Recipe adjustments capped at ±X% from POR.

3. Fall back to POR

Model confidence too low or input OOD. Tool runs the fixed Process of Record without ML adjustment.

4. Hold and alert

Model or upstream data pipeline failure detected. Lot placed on hold pending engineering review.

Layer 2.2

AEC-Q100 and Zero-Defect Manufacturing

AEC-Q100 is the qualification standard published by the Automotive Electronics Council for integrated circuits used in automotive applications. Where ISO 26262 governs the system-level safety requirements that a chip must satisfy, AEC-Q100 governs the testing and qualification process the chip manufacturer must execute to demonstrate that the device is reliable enough for automotive use.

Zero-defect manufacturing is not a statistical claim that every part produced is defect-free. It is a commitment that the rate of defective parts reaching the customer is below a target so low that, for practical purposes, it is treated as zero. For automotive-grade parts, this target is typically below 1 DPPM (defect per million parts shipped).

Statistical Guard-Banding in Automotive Context

Guard-banding in an AEC-Q100 context is not a design choice made by the data scientist. It is a contractual obligation derived from the customer's ASIL requirements, the product's specified operating lifetime (typically 15 years at 125°C for automotive), and the known aging mechanisms for the specific device technology.

The guard-band calculation

For a parametric specification with a nominal limit L, the guard-banded test limit L_GB tightens the limit by three delta terms: the measurement uncertainty margin, the temperature drift margin, and the aging drift margin. Each term is computed from device characterization data.

Models used to predict parametric test outcomes must be validated against the guard-banded test limit, not the specification limit. A model that correctly predicts whether a die passes the specification limit but incorrectly predicts whether it passes the guard-banded limit is not usable for automotive disposition decisions.

Part Average Testing: The Statistics

PAT under AEC-Q100 uses two types of limits depending on the statistical properties of the parameter being screened.

Static PAT limits

Static limits are set once during device characterization and remain fixed for the product lifetime. They are appropriate when the parameter distribution is stable across lots and time, and when there is a well-established historical correlation between statistical extremes of the parameter and field reliability failures. The static limit is set at k times the historical standard deviation from the historical mean, where k is chosen to achieve a target outlier removal rate. AEC-Q100 Revision H recommends k = 6 for most parametric tests at Grade 1 (automotive, -40°C to 125°C).

Dynamic PAT limits

Dynamic limits are computed from the lot-specific distribution and are appropriate when lot-to-lot process variation is significant. A lot that ran during a period of elevated process instability will have wider natural spread, and a die at the 98th percentile of that lot's distribution is not necessarily more suspicious than a die at the 85th percentile of a tighter lot.

Dynamic limit calculation requires a minimum of 30 samples for the mean and standard deviation estimates to be statistically reliable. AEC-Q100 specifies that for lot sizes below 30 the static limits apply. The dynamic limit also requires a minimum lot-level standard deviation floor: if the computed sigma is below the floor, the floor value is used to prevent the limits from collapsing to the mean on an unusually tight lot.

Multivariate PAT limits under AEC-Q100

AEC-Q100 Revision H introduced multivariate PAT as a recommended practice for devices with strong inter-parameter correlations. The Mahalanobis distance approach is the standard implementation. The key AEC-Q100 requirement is that the reference population used to estimate the covariance matrix must be derived from confirmed-good production data: it cannot include any lots that subsequently failed reliability screening.

When a lot is rejected under multivariate PAT, the standard requires documentation of which parameters contributed most to the rejection via contribution plots, a physical failure analysis on a sample of rejected dies to confirm the test is catching genuine reliability risks, and a review of whether the rejection is from a new failure mode not previously characterized.

Outlier Identification Algorithms

Tukey fences

Tukey fences identify outliers based on the interquartile range (IQR = Q3 - Q1) without assuming a normal distribution. With k = 1.5 this identifies mild outliers; k = 3.0 identifies extreme outliers. The IQR-based approach is robust to heavy tails and skewed distributions that are common in parametric test data near physical limits. Subthreshold leakage, for example,, follows a log-normal distribution and has a long right tail.

Lot genealogy screening

AEC-Q100 requires traceability of every shipped device back to the wafer lot, the process equipment used for each critical step, and the operator or automated system that released each lot at each step. This genealogy enables field failure analysis: when a device fails in the field, the manufacturer must be able to identify every other device from the same manufacturing history.

The data science implication: any model used for lot-level disposition must operate on data that is itself traceable to the lot genealogy record. A model that uses features computed from an anonymized or aggregated dataset cannot be used to make lot-level accept/reject decisions under AEC-Q100, because the traceability chain requires that the specific input data for each decision be recoverable.

Layer 2.3

SEMI S2/S8 Safety Standards

SEMI S2 is the environmental, health, and safety guideline for semiconductor manufacturing equipment. It specifies the design requirements for equipment ventilation, toxic gas management, fire suppression, and electrical safety. SEMI S8 covers ergonomic design. Together they define the baseline safety requirements that every piece of equipment installed in a SEMI-compliant fab must meet.

For data scientists, the most operationally important part of SEMI S2 is the framework that governs which sensors on the fab floor can be used as model inputs and which cannot. Getting this wrong exposes workers to safety risk and exposes the fab to regulatory action.

The Two-PLC Architecture

Every process tool in a SEMI S2-compliant fab has two physically separate control systems.

The process PLC executes the process recipe: it commands gas flows, sets temperatures, controls RF power, and adjusts the throttle valve to maintain chamber pressure. This is where ML models are deployed. APC systems connect to the process PLC via SECS/GEM (Semiconductor Equipment Communications Standard / Generic Equipment Model) and adjust recipe parameters between wafer runs.

The life safety PLC monitors toxic gas exhaust ventilation flow rates, facility fire suppression system status, seismic sensors, and emergency stop circuits. It is hardwired to trip relays that shut down gas supply lines and trigger facility alarms independently of any software state on the process PLC or any network connected to it.

These two systems communicate in only one direction: if the life safety PLC trips, it sends a hardware signal to the process PLC to halt processing. The process PLC cannot send commands to the life safety PLC under any circumstances.

The Exclusion List

The following gases require categorical exclusion of their safety monitoring sensors from any ML feature set. The exclusion applies to sensors monitoring gas concentration in the workspace or exhaust stream for safety purposes. It does not apply to process gas flow controllers for the same chemicals, which measure the amount being intentionally introduced into the chamber.

Safety sensor exclusion list

HF (hydrogen fluoride)

Cl2 (chlorine)

F2 (fluorine)

NH3 (ammonia)

AsH3 (arsine)

PH3 (phosphine)

SiH4 (silane)

WF6 (tungsten hexafluoride)

BCl3 (boron trichloride)

Toxic exhaust flow sensors for any of the above

IDLH = Immediately Dangerous to Life or Health (NIOSH). PEL = Permissible Exposure Limit (OSHA). Any sensor tied to a safety interlock for these gases is excluded regardless of predictive value.

Why the correlation is real and why it does not matter

The exhaust flow sensors for these gases correlate with process outcomes through a legitimate physical mechanism. Exhaust flow determines how quickly process byproducts are removed from the chamber. Chamber pressure and chemistry state both influence process outcomes and are also influenced by exhaust dynamics. The correlation is real.

The correlation being real does not change the exclusion requirement. A sensor that is physically connected to a safety interlock system cannot be used as a model input regardless of its predictive value. The reason is not that the correlation is spurious: it is that a model that learns to recommend conditions correlated with a specific exhaust flow state may inadvertently move the process toward conditions that stress the safety system.

The SEMI S2 Sensor Audit

The sensor audit is the formal process that establishes which parameters in the FDC historian are permissible as ML model inputs. It must be completed before any data is pulled for model development, not after. Discovering a safety-classified sensor in a nearly-complete model's feature set requires rebuilding the model from scratch.

The audit requires two participants: the data scientist building the model and the facility's equipment safety engineer. The safety engineer is the authority; the data scientist cannot self-certify this step.

Schema-level enforcement

Relying solely on human audit is an insufficient long-term control. The schema-level fix is to add a safety_class attribute to every parameter record in the FDC historian:

ALTER TABLE trace_data
  ADD COLUMN safety_class TINYINT NOT NULL DEFAULT 2;
-- 0 = process sensor, permissible for ML
-- 1 = environmental monitor, case-by-case review required
-- 2 = safety-critical, excluded from all ML features

CREATE VIEW ml_permissible_params AS
  SELECT * FROM trace_data WHERE safety_class = 0;

Any parameter that has not been through the classification process defaults to safety_class = 2 (excluded). This is the precautionary principle applied to data schema design. The cost of incorrectly excluding a process sensor is that the model must be rebuilt after the parameter is correctly classified. The cost of incorrectly including a safety sensor is a potential worker safety incident and regulatory action.

Layer 2.4

Traceability and Data Integrity

Traceability in semiconductor manufacturing is the ability to reconstruct the complete manufacturing history of any specific device: every process step, every tool, every recipe, every metrology measurement, and every disposition decision that the device passed through between silicon substrate and shipping box. When a device fails in the field in an automotive application, the manufacturer has hours to identify every other device from the same manufacturing history that might have the same defect.

For data scientists, traceability means that every automated decision made by a model must be auditable: the specific input data, the model version, the prediction, and the action taken must be recoverable from records stored separately from the model itself and retained for the required period.

ALCOA Principles

ALCOA is a data integrity framework originally developed for pharmaceutical manufacturing and adopted by automotive semiconductor quality systems. The five principles define what "good data" means in a regulated manufacturing context: Attributable (every data point is linked to the person or system that created it), Legible (the record can be read and interpreted by a person unfamiliar with the original system), Contemporaneous (the record was created at the time the event occurred, not reconstructed later), Original (the record is the first capture of the measurement, not a copy or transcription), and Accurate (the record reflects the true value of what was measured).

What ALCOA requires of ML pipelines

An ML model used in a lot disposition decision is part of the manufacturing record for that lot. The minimum required log entry for any model-based lot disposition decision is:

{
  "lot_id": "L2026031501",
  "wafer_ids": ["W01", "W02", ..., "W25"],
  "decision_timestamp_utc": "2026-03-15T14:32:17.441Z",
  "model_id": "vm_cmp_thickness_v3.2.1",
  "model_checksum_md5": "a3f8b2c1d4e5f6a7b8c9d0e1f2a3b4c5",
  "feature_schema_version": "v4.1",
  "reliability_index_mean": 0.91,
  "reliability_index_min": 0.84,
  "predicted_thickness_nm": [44.8, 45.1, ...],
  "disposition": "ACCEPT",
  "operator_override": false
}

Lot genealogy trees

Wafer lots in the fab have parent-child relationships. When a lot is split for an experiment, the child lots inherit the process history of the parent up to the split point. When lots from different products are processed on the same tool within the same time window, they share a portion of the tool's contamination history. The genealogy tree maintained by the MES captures these relationships.

A yield model that treats lot history as independent samples will train on data where some lots share process exposure through the genealogy tree. Treating shared-exposure lots as independent inflates the effective sample size and underestimates the uncertainty of predictions for lots with unusual genealogies.

Data Provenance and the Audit Hash

While timestamps track when events occurred, cryptographic hashes track what data existed at each transformation stage. An MD5 hash is a 128-bit fingerprint generated from an arbitrary block of data. Change a single bit (a timestamp rounded from 14:32:00.001 to 14:32:00.000, or a pressure reading of 15.0 corrected to 15.00) and the resulting hash is completely different.

Complete data provenance requires a Four-Layer Chain of Custody:

Raw Data Hash: MD5 of the exact bytes extracted from the FDC historian, before any cleaning or transformation.
Cleaned Data Hash: MD5 of the dataset after programmatic cleaning: sentinel replacement, outlier flagging, timestamp alignment.
Training Set Hash: MD5 of the exact rows and columns used to fit the model, after any filtering or feature engineering.
Model Artifact Hash: MD5 of the serialized model file (the ONNX or pickle file deployed to production).

Any break in this chain invalidates the entire pipeline. An investigator can start with the production model hash and verify backward through training and cleaning to the raw extraction, reconstructing exactly what physics the model observed.

The 60-Day Trace-back Problem

The defining characteristic of semiconductor manufacturing is extreme latency. A wafer processed through an etch chamber today will not reach end-of-line electrical test for up to 60 days. When a yield excursion is discovered, the data science team must execute a forensic trace-back through 500+ process steps immediately.

Case study: the manual edit

A data science team trained a virtual metrology model predicting etch depth with R² = 0.94, a substantial improvement over the previous 0.91 model. During the mandatory provenance audit, the Training Set Hash did not match the hash of the data extracted from the warehouse.

Investigation revealed that a data scientist had manually edited the CSV file to "correct" what they believed were timestamp errors, shifting 200 rows by three seconds.

This manual edit introduced a time-travel leak: the shifted timestamps aligned the endpoint detection labels with the stabilization phase of the previous wafer, creating an impossible correlation that boosted R² by 0.03. The model was predicting when the previous wafer ended, not when the current wafer reached endpoint.

The hash mismatch detected the manual edit. The deployment was cancelled. The "optimization" was a measurement artifact.

The Temporal Protocol: Three Timestamps

To guarantee forensic reconstructability, every inference logged by the model must permanently bind three distinct temporal states:

Process Timestamp: The exact millisecond the physical event occurred in the chamber, via the SECS/GEM (Stream 6, Function 11, S6F11) trace timestamp written by the tool controller at the moment of measurement.
Inference Timestamp: The exact millisecond the model generated its prediction based on that telemetry. In a real-time system, this is seconds after the Process Timestamp. In a batch system, it may be hours.
Ground Truth Timestamp: The eventual arrival of the metrology or E-Test label, appended weeks later when the wafer finally reaches physical measurement.

If an audit reveals that a model's Inference Timestamp occurred before the Process Timestamp of a specific input variable, the pipeline has suffered a catastrophic time-travel leak: the model was inadvertently granted access to future data.

Recipe, Consumable, and Hardware Lineage

A sensor reading of 45.0 mTorr has no intrinsic meaning without knowing the tool's state at that exact moment. A complete provenance record must also capture:

Recipe Revisions: A model trained on ETCH_POR_v2.1 will likely fail on v2.2, even if the raw sensor data looks identical. Recipe version must be logged with every inference.
Consumable Lot Tracking: A model predicting CMP defect rates must know that the polishing pad used was Lot #44821. Different pad lots from the same supplier have different surface characteristics.
Hardware Maintenance Cycles: An inference made one hour after a chamber wet-clean is in a fundamentally different process state than the same inference made 400 wafers into the run. The wafers-since-clean counter must be logged.

Data Retention Architecture for Automotive

A fab supplying automotive customers must retain sufficient records to support field failure investigation for the vehicle service lifetime plus a regulatory buffer: in practice, 20 to 25 years after the last device from a lot is shipped. Every feature vector, model prediction, and lot disposition decision that goes into an automotive lot release record must be archived in this system.

The archival format must be chosen for long-term accessibility. Appropriate for 20-year retention: plain-text CSV, JSON, or XML; documented binary formats (Parquet with schema metadata, HDF5 with full attribute documentation). Inappropriate for 20-year retention: proprietary binary formats tied to a specific software version, database dumps without schema documentation, formats that require a commercial license to read.

Case study: the unreadable archive

An automotive-grade chip supplier was asked by an OEM to provide complete manufacturing records for a specific lot shipped 11 years earlier, following a field failure investigation. The supplier retrieved the archived records and discovered that the metrology data had been stored in a proprietary binary format specific to a metrology tool vendor that had since been acquired and dissolved. No software capable of reading the format existed anymore.

The supplier was unable to produce the requested records and was found in breach of its quality agreement. The OEM initiated a recall of all vehicles containing devices from the affected lot based on a conservative assumption that the unverifiable lot might have the same defect as the returned device.

The fix: Archival format policy updated to require conversion to open, documented formats (CSV and JSON) at archive creation time, regardless of the native format of the source system.

← Layer 1: Wisdom Layer 3: Systems