Technology Risk Assessment

The Failure That Was Not on Anyone’s List

Post-mortems on space-system failures often arrive at a discomforting conclusion. The component that failed was not, in the program’s documentation, considered a leading risk. The failure mode that manifested was not on any hazard register. The causal chain that produced the loss — a minor thermal excursion coinciding with a software state most operators did not know could occur, triggering a fault response that had been tested in isolation but never in that combination — was nowhere described before the event. The review board eventually reconstructs the chain, writes a cause-and-effect narrative, and adds the pattern to the next program’s training material. The program that lost the asset does not benefit from the lesson.

This is not a story about bad engineering. It is a story about the limits of what engineers can imagine about their own systems without a disciplined method for imagining systematically. Every complex space technology contains more failure modes than any individual can enumerate from memory, and the ones that matter most are often the least intuitive — low-probability, low-detectability, catastrophic-severity combinations that a casual risk review treats as negligible because the probability dominates the reading. Technology risk assessment is the method designed to surface these combinations before they surface themselves.

From Apollo-Era FMEA to Leveson’s Systems Safety

The method sits on three lineages that matured separately and now operate together. The oldest is Failure Mode and Effects Analysis, codified in US military practice as MIL-STD-1629 and internationally as IEC 60812, but with origins in mid-twentieth-century reliability engineering and substantial refinement during the Apollo program. FMEA’s contribution was the insistence that every element of a system be examined systematically for how it could fail, not merely whether it would work as intended. The shift from validation thinking (“does this work?”) to failure thinking (“how can this break?”) was the foundational move that made structured reliability analysis possible.

The second lineage is Fault Tree Analysis, developed in the early 1960s at Bell Labs for Minuteman missile launch control and generalized through the NASA Fault Tree Handbook. Where FMEA works bottom-up — starting from components and asking what happens when each fails — fault tree analysis works top-down, starting from a catastrophic outcome and asking what combinations of events could produce it. Fault trees identify single points of failure, where one element’s failure alone produces the top event, and common-cause failures, where a single underlying condition triggers multiple simultaneous failures. The two methods are complementary: FMEA surfaces failure modes, fault trees surface failure combinations.

The third lineage is the systems safety tradition elaborated by Nancy Leveson, whose Engineering a Safer World (2011) argued that traditional reliability analysis was inadequate for modern software-intensive systems whose failures emerged from interactions rather than from component breakdowns. Leveson’s STPA framework and the broader systems-safety discipline it represents pushed the field toward thinking about failure as a control-structure problem, not just a component problem. In parallel, NASA’s NPR 8000.4 on risk-informed decision making codified how risk analyses feed agency-level decisions rather than remaining artifacts of engineering review.

Technology risk assessment as practiced today combines all three. FMEA provides the bottom-up enumeration. Fault trees provide the top-down reconstruction. Bow-tie analysis — a more recent synthesis — provides the visualization that makes threat pathways, preventive barriers, consequence pathways, and mitigative barriers legible to decision-makers who are not themselves safety engineers.

Lineage Canonical reference Analytical direction
FMEA/FMECA MIL-STD-1629, IEC 60812 (Apollo-era refinement) Bottom-up: from components to failure modes
Fault Tree Analysis Bell Labs (Minuteman, early 1960s); NASA Handbook Top-down: from catastrophic outcome to causal combinations
Systems Safety Leveson, Engineering a Safer World (2011); NASA NPR 8000.4 Failure as control-structure problem, not component breakdown

What Risk Assessment Sees That a Hazard List Does Not

The characteristic analytical gesture of technology risk assessment is its refusal to conflate probability with priority. A naive hazard review lists risks in order of likelihood and focuses attention on the ones most likely to occur. The method’s first correction is the insistence that priority depends on severity, probability, and detectability together. A low-probability failure mode that is both catastrophic and undetectable is a higher-priority risk than a higher-probability failure mode that is merely degrading and well-instrumented, because the first type of failure is exactly the class that program reviews miss and post-mortems reconstruct.

The remaining four moves give the method its structure.

Decomposition into assessable elements
A system-level risk summary hides the subsystem-level modes that actually drive exposure. The method requires the system to be broken into functional blocks, subsystems, and critical interfaces, with each element examined individually against a consistent set of failure categories: design limitations, manufacturing defects, environmental stresses, integration failures, operational errors, and wear-out. The output is a structured catalog of failure modes with explicit severity, probability, and detectability readings and with explicit cause-effect chains.
Single points and common causes
Identification of single points of failure and common-cause failures through fault tree analysis of the most severe end effects. Redundancy is the conventional answer to single-point-of-failure exposure, but redundancy is only effective to the extent that the redundant paths are truly independent. Common-cause failures defeat apparent redundancy by taking out multiple paths simultaneously — a shared power bus, a common qualification lot, a shared environmental exposure. Rigorous fault tree analysis exposes these dependencies that block diagrams miss.
Bow-tie translation
For the highest-priority risks, the hazard sits at center. Threat pathways enter from the left — the sequences of events that can lead to the hazard materializing. Preventive barriers stand between those pathways and the hazard, each with an integrity and independence rating. On the right, consequence pathways show what happens after the hazard occurs, and mitigative barriers stand between the hazard and the consequences. A barrier's rating is not whether it exists but whether it is independent of other barriers and whether it degrades gracefully. A chain of six barriers that all depend on the same power system is one barrier with six labels.
Translation to decision
Raw FMEA tables are engineering artifacts, not strategic findings. For executive review, the hundreds or thousands of failure modes must be condensed into the risk narratives that drive decisions: dominant risks, acceptability judgments, recommended mitigations, residual risk after mitigation, and comparison against alternatives. The method's value at the strategic level depends on this translation being honest rather than reassuring.

A Refueling Interface, Read Through Three Lenses

Consider the assessment applied to an on-orbit refueling interface for a generic servicer. The decomposition yields several elements: the fluid coupling mechanism, the guidance-and-navigation system that achieves the required alignment, the thermal management of the transfer line, the cleanliness controls on both oxidizer and fuel paths, and the fault management logic that arbitrates aborts.

The bottom-up FMEA surfaces a familiar pattern. A fluid coupling leak during mate, caused by seal degradation from thermal cycling, produces a critical local effect and at least major mission consequences. The probability is occasional on current seal qualification evidence; detectability is medium, because leak detection requires specific sensor geometry and acknowledged latency. The priority reads high. A misalignment greater than two degrees at docking, caused by GNC sensor drift and control-loop latency, produces major consequences; probability is remote, detectability is high, and priority reads medium.

The third entry is the one the method earns its keep on. Contamination of the oxidizer line caused by residual manufacturing particulate in the coupling produces catastrophic consequences — combustion-compatible material in an oxidizer path is the classical space-flight disaster. Probability is improbable on clean-room evidence; detectability is low, because the contamination would not manifest until the transfer was already under way. Naively, improbable × catastrophic reads as low priority. The method’s discipline produces the opposite reading: a low-probability, low-detectability, catastrophic mode is a dominant risk precisely because its improbability is the only defense against it and because detectability cannot catch it in time. The priority reads high.

The fault tree for the catastrophic top event — loss of servicer during refueling — reveals that several superficially distinct failure paths share the same underlying cleanliness-control process at integration. This is a common-cause vulnerability: the program’s apparent redundancy against oxidizer contamination depends on a single process whose failure would undermine all the redundant paths simultaneously. The bow-tie visualization completes the picture, showing that the primary preventive barrier against contamination is a qualification protocol whose independence from adjacent barriers is weaker than the block diagram suggested.

The recommended mitigation follows from the analysis. A mandatory cleanliness verification protocol at integration — specifically designed as an independent barrier, not a dependent one — raises the detectability rating from low to high and reduces the residual risk materially. The protocol carries a schedule cost; the method’s strategic output is to make that cost visible against the risk it retires, so that the decision to absorb it is taken consciously rather than deferred.

The non-obvious insight, the one the method produces and a naive probability-ranked list would have missed, is that the program’s dominant technology risk is not the high-profile docking alignment problem that executive briefings focus on. It is the low-profile cleanliness problem that is rarely mentioned because its probability looks comfortable on paper. Rigorous risk assessment redirects attention from the headline hazard to the structural one.

Where It Earns Its Keep and Where It Falls Short

The method’s strength is the disciplined surfacing of failure modes that intuition systematically misses. FMEA’s bottom-up enumeration, fault tree analysis’s top-down reconstruction, and bow-tie’s barrier integrity reading together produce a risk profile that decision-makers can trust and challenge. For any technology whose failure consequences are measured in mission loss or safety hazard, the method is foundational.

Its weaknesses are equally structural. Quality depends on the completeness of failure mode identification, and unknown unknowns remain the greatest risk — FMEA cannot discover what the analyst does not imagine. Pairing with red-team analysis, which deliberately seeks to enumerate overlooked failure pathways, is how the blind-spot problem is addressed honestly. Probability estimation for novel technologies without flight heritage is speculative; the method works best on technologies with operational history and should flag explicitly when its probability readings are based on analysis rather than data.

The method is labor-intensive for complex systems, and strategic-level application requires disciplined scoping to critical subsystems rather than exhaustive coverage. It focuses on technical risks and does not address programmatic risks (schedule, budget, organizational), market risks, or regulatory risks — these require separate frameworks. It tends toward conservatism: systematically identifying failure modes can create a risk-averse bias that underweights the strategic cost of inaction or delay. Practitioners producing strategic analysis should acknowledge this bias explicitly, because “we found many risks, so recommend delay” is not a strategic finding; it is a finding plus a suppressed assumption about the cost of waiting.

Fault tree analysis assumes static architecture and struggles with adaptive or reconfigurable systems, where the control structure itself changes during operation. STPA and related systems-safety methods handle this case better. Classical fault trees should not be extended beyond the architectures they were designed for.

The library treats technology risk assessment as tightly coupled with neighbors. TRL scores contextualize probability estimates, because low-TRL elements carry higher intrinsic uncertainty in their failure-mode likelihood. Individual technology risks feed the broader strategic risk matrix as calibrated entries for the technology dimension. Risk profiles provide the risk-adjusted lens on technical benchmark comparisons, distinguishing well-understood from poorly-characterized alternatives. Critical failure scenarios identified here supply trigger events for scenario planning downstream. And threat modeling, which examines adversarial threats rather than intrinsic failures, covers the complementary risk category that this method explicitly does not.

For the Practitioner

Articles Using This Method