Technology Risk Assessment
The Failure That Was Not on Anyone’s List
Post-mortems on space-system failures often arrive at a discomforting conclusion. The component that failed was not, in the program’s documentation, considered a leading risk. The failure mode that manifested was not on any hazard register. The causal chain that produced the loss — a minor thermal excursion coinciding with a software state most operators did not know could occur, triggering a fault response that had been tested in isolation but never in that combination — was nowhere described before the event. The review board eventually reconstructs the chain, writes a cause-and-effect narrative, and adds the pattern to the next program’s training material. The program that lost the asset does not benefit from the lesson.
This is not a story about bad engineering. It is a story about the limits of what engineers can imagine about their own systems without a disciplined method for imagining systematically. Every complex space technology contains more failure modes than any individual can enumerate from memory, and the ones that matter most are often the least intuitive — low-probability, low-detectability, catastrophic-severity combinations that a casual risk review treats as negligible because the probability dominates the reading. Technology risk assessment is the method designed to surface these combinations before they surface themselves.
From Apollo-Era FMEA to Leveson’s Systems Safety
The method sits on three lineages that matured separately and now operate together. The oldest is Failure Mode and Effects Analysis, codified in US military practice as MIL-STD-1629 and internationally as IEC 60812, but with origins in mid-twentieth-century reliability engineering and substantial refinement during the Apollo program. FMEA’s contribution was the insistence that every element of a system be examined systematically for how it could fail, not merely whether it would work as intended. The shift from validation thinking (“does this work?”) to failure thinking (“how can this break?”) was the foundational move that made structured reliability analysis possible.
The second lineage is Fault Tree Analysis, developed in the early 1960s at Bell Labs for Minuteman missile launch control and generalized through the NASA Fault Tree Handbook. Where FMEA works bottom-up — starting from components and asking what happens when each fails — fault tree analysis works top-down, starting from a catastrophic outcome and asking what combinations of events could produce it. Fault trees identify single points of failure, where one element’s failure alone produces the top event, and common-cause failures, where a single underlying condition triggers multiple simultaneous failures. The two methods are complementary: FMEA surfaces failure modes, fault trees surface failure combinations.
The third lineage is the systems safety tradition elaborated by Nancy Leveson, whose Engineering a Safer World (2011) argued that traditional reliability analysis was inadequate for modern software-intensive systems whose failures emerged from interactions rather than from component breakdowns. Leveson’s STPA framework and the broader systems-safety discipline it represents pushed the field toward thinking about failure as a control-structure problem, not just a component problem. In parallel, NASA’s NPR 8000.4 on risk-informed decision making codified how risk analyses feed agency-level decisions rather than remaining artifacts of engineering review.
Technology risk assessment as practiced today combines all three. FMEA provides the bottom-up enumeration. Fault trees provide the top-down reconstruction. Bow-tie analysis — a more recent synthesis — provides the visualization that makes threat pathways, preventive barriers, consequence pathways, and mitigative barriers legible to decision-makers who are not themselves safety engineers.
| Lineage | Canonical reference | Analytical direction |
|---|---|---|
| FMEA/FMECA | MIL-STD-1629, IEC 60812 (Apollo-era refinement) | Bottom-up: from components to failure modes |
| Fault Tree Analysis | Bell Labs (Minuteman, early 1960s); NASA Handbook | Top-down: from catastrophic outcome to causal combinations |
| Systems Safety | Leveson, Engineering a Safer World (2011); NASA NPR 8000.4 | Failure as control-structure problem, not component breakdown |
What Risk Assessment Sees That a Hazard List Does Not
The characteristic analytical gesture of technology risk assessment is its refusal to conflate probability with priority. A naive hazard review lists risks in order of likelihood and focuses attention on the ones most likely to occur. The method’s first correction is the insistence that priority depends on severity, probability, and detectability together. A low-probability failure mode that is both catastrophic and undetectable is a higher-priority risk than a higher-probability failure mode that is merely degrading and well-instrumented, because the first type of failure is exactly the class that program reviews miss and post-mortems reconstruct.
The remaining four moves give the method its structure.
A Refueling Interface, Read Through Three Lenses
Consider the assessment applied to an on-orbit refueling interface for a generic servicer. The decomposition yields several elements: the fluid coupling mechanism, the guidance-and-navigation system that achieves the required alignment, the thermal management of the transfer line, the cleanliness controls on both oxidizer and fuel paths, and the fault management logic that arbitrates aborts.
The bottom-up FMEA surfaces a familiar pattern. A fluid coupling leak during mate, caused by seal degradation from thermal cycling, produces a critical local effect and at least major mission consequences. The probability is occasional on current seal qualification evidence; detectability is medium, because leak detection requires specific sensor geometry and acknowledged latency. The priority reads high. A misalignment greater than two degrees at docking, caused by GNC sensor drift and control-loop latency, produces major consequences; probability is remote, detectability is high, and priority reads medium.
The third entry is the one the method earns its keep on. Contamination of the oxidizer line caused by residual manufacturing particulate in the coupling produces catastrophic consequences — combustion-compatible material in an oxidizer path is the classical space-flight disaster. Probability is improbable on clean-room evidence; detectability is low, because the contamination would not manifest until the transfer was already under way. Naively, improbable × catastrophic reads as low priority. The method’s discipline produces the opposite reading: a low-probability, low-detectability, catastrophic mode is a dominant risk precisely because its improbability is the only defense against it and because detectability cannot catch it in time. The priority reads high.
The fault tree for the catastrophic top event — loss of servicer during refueling — reveals that several superficially distinct failure paths share the same underlying cleanliness-control process at integration. This is a common-cause vulnerability: the program’s apparent redundancy against oxidizer contamination depends on a single process whose failure would undermine all the redundant paths simultaneously. The bow-tie visualization completes the picture, showing that the primary preventive barrier against contamination is a qualification protocol whose independence from adjacent barriers is weaker than the block diagram suggested.
The recommended mitigation follows from the analysis. A mandatory cleanliness verification protocol at integration — specifically designed as an independent barrier, not a dependent one — raises the detectability rating from low to high and reduces the residual risk materially. The protocol carries a schedule cost; the method’s strategic output is to make that cost visible against the risk it retires, so that the decision to absorb it is taken consciously rather than deferred.
The non-obvious insight, the one the method produces and a naive probability-ranked list would have missed, is that the program’s dominant technology risk is not the high-profile docking alignment problem that executive briefings focus on. It is the low-profile cleanliness problem that is rarely mentioned because its probability looks comfortable on paper. Rigorous risk assessment redirects attention from the headline hazard to the structural one.
Where It Earns Its Keep and Where It Falls Short
The method’s strength is the disciplined surfacing of failure modes that intuition systematically misses. FMEA’s bottom-up enumeration, fault tree analysis’s top-down reconstruction, and bow-tie’s barrier integrity reading together produce a risk profile that decision-makers can trust and challenge. For any technology whose failure consequences are measured in mission loss or safety hazard, the method is foundational.
Its weaknesses are equally structural. Quality depends on the completeness of failure mode identification, and unknown unknowns remain the greatest risk — FMEA cannot discover what the analyst does not imagine. Pairing with red-team analysis, which deliberately seeks to enumerate overlooked failure pathways, is how the blind-spot problem is addressed honestly. Probability estimation for novel technologies without flight heritage is speculative; the method works best on technologies with operational history and should flag explicitly when its probability readings are based on analysis rather than data.
The method is labor-intensive for complex systems, and strategic-level application requires disciplined scoping to critical subsystems rather than exhaustive coverage. It focuses on technical risks and does not address programmatic risks (schedule, budget, organizational), market risks, or regulatory risks — these require separate frameworks. It tends toward conservatism: systematically identifying failure modes can create a risk-averse bias that underweights the strategic cost of inaction or delay. Practitioners producing strategic analysis should acknowledge this bias explicitly, because “we found many risks, so recommend delay” is not a strategic finding; it is a finding plus a suppressed assumption about the cost of waiting.
Fault tree analysis assumes static architecture and struggles with adaptive or reconfigurable systems, where the control structure itself changes during operation. STPA and related systems-safety methods handle this case better. Classical fault trees should not be extended beyond the architectures they were designed for.
The library treats technology risk assessment as tightly coupled with neighbors. TRL scores contextualize probability estimates, because low-TRL elements carry higher intrinsic uncertainty in their failure-mode likelihood. Individual technology risks feed the broader strategic risk matrix as calibrated entries for the technology dimension. Risk profiles provide the risk-adjusted lens on technical benchmark comparisons, distinguishing well-understood from poorly-characterized alternatives. Critical failure scenarios identified here supply trigger events for scenario planning downstream. And threat modeling, which examines adversarial threats rather than intrinsic failures, covers the complementary risk category that this method explicitly does not.
For the Practitioner
Reach for technology risk assessment when technology reliability, availability, or safety is central to the strategic argument — design reviews, operational commitment decisions, procurement and investment decisions on risk-adjusted terms, post-failure analysis. Do not reach for it when the concern is adversarial exploitation (threat modeling carries that load) or when the concern is programmatic or market risk (separate frameworks apply).
Pair it with TRL assessment for probability calibration, with red-team analysis for blind-spot coverage, and with scenario planning to stress-test the risk profile against divergent futures. The operational version of the method, with its FMEA tables, fault trees, bow-tie diagrams, and explicit severity-probability-detectability discipline, remains the reference for practitioners who want the assessment to survive both engineering challenge and executive review.
spacepolicies.org