Resilience Analysis

The Wrong Question About Failure

Planners of space architectures spend most of their risk energy on the wrong question. They ask how to prevent failure — how to harden a satellite against radiation, how to secure a ground station against intrusion, how to reduce the probability of a supply-chain disruption. These are reasonable questions, and the discipline that answers them is mature. But they are not the decisive questions for critical space infrastructure.

The decisive questions are what happens after the failure that cannot be prevented: how badly does the system degrade when a ground station goes offline, how long until partial service returns, how does the loss of a single vendor cascade through the constellation, how does graceful degradation play out when two disruptions arrive within the same recovery window. These are resilience questions, not risk questions, and they have a different analytical shape.

Consider a GNSS augmentation service that has been hardened, audited, and insured against every plausible threat. Its preventive posture is excellent. A targeted cyber intrusion nonetheless succeeds against one of its three redundant ground stations, and the recovery team discovers that the hardware replacement supply chain depends on a single vendor with an eighteen-month lead time. The system did not fail to prevent the intrusion in any culpable sense; it failed to have thought about what the eighteen months after the intrusion would look like. The preventive posture was thick, and the resilience posture was thin, and the gap between them produced the service outage that mattered.

Resilience analysis is the discipline of the second question. It accepts that disruptions will happen — some predictable, some not — and examines how well the system performs once they have happened. For critical space services, whose failure propagates into terrestrial consequences, it is the analysis whose absence is most consistently felt after a bad event.

Ecology, Infrastructure, and Complex Systems

The intellectual roots of resilience thinking are older than space and older than security. Crawford Stanley Holling, writing in ecology in 1973, introduced a distinction that the field has lived with ever since: stability is the property of a system that returns quickly to equilibrium after disturbance, while resilience is the property of a system that absorbs disturbance and maintains its essential structure even when the equilibrium itself shifts. Ecological systems, Holling argued, can be unstable and resilient simultaneously, flipping between configurations without losing coherence. The insight disturbed an engineering tradition that had conflated the two concepts.

The critical-infrastructure community absorbed resilience thinking in the wake of the post-2001 security shocks, when the realisation arrived that defensive postures against specific threats were producing systems that were brittle to unanticipated ones. The United States, through its Department of Homeland Security, produced a series of frameworks from the mid-2000s that placed resilience alongside protection as a co-equal objective. The European Union’s NIS and subsequent directives pushed in a similar direction. By the 2010s, resilience had become a standard category in critical-infrastructure analysis, with recognised subdomains — energy, transport, telecommunications — and a shared vocabulary.

Complex adaptive systems theory, drawing on work at the Santa Fe Institute and elsewhere, supplied the third strand. Systems with many interacting components, feedback loops, and emergent behaviour were shown to exhibit distinctive failure patterns: cascading propagation, phase transitions, brittle equilibria that hid fragility until a trigger arrived. The vocabulary of the field — tipping points, feedback loops, emergence — entered resilience practice and shaped the analytical questions that mature resilience analysis now asks.

For space systems, the intellectual convergence matters because space architectures are, in the relevant sense, all three things at once: they are ecosystems of interacting sensors, buses, links, and ground infrastructure; they are critical infrastructure whose degradation imposes costs far beyond the operator; and they are complex adaptive systems whose behaviour under stress is not always predictable from component-level specifications. Each of the three lineages supplies a necessary part of the analytical vocabulary.

The Characteristic Move

What resilience analysis does that risk analysis does not is reverse the assumption about failure. Risk analysis asks how likely an adverse event is and what its magnitude would be, and from this produces a prevention priority. Resilience analysis accepts that the event will happen — not because every event is inevitable, but because the portfolio of possible events is too large for prevention alone to cover — and asks what happens after.

The first analytical move is the definition of critical function. A resilience analysis begins with a statement of what the system must do to be considered operational. “GNSS augmentation” is not specific enough; “position accuracy within a defined tolerance, available to users in a defined region, with continuity-of-service guarantees” is specific enough. The performance threshold defines what “acceptable operation” means, and the analysis that follows is an assessment of how well the system maintains that threshold under stress. Analysts who skip this step produce resilience findings that float detached from consequences.

The second move is the scenario set. A resilience analysis does not evaluate the system against a single disruption; it evaluates against a portfolio of disruptions that exercise different vulnerabilities. Kinetic disruption, cyber compromise, supply-chain disruption, space weather, regulatory change, market failure, and — critically — slow-onset chronic stresses such as debris accumulation or workforce attrition. The scenario set is deliberate and disciplined, not a laundry list. Each scenario is specified with enough detail that the system’s response can be assessed.

The third move is the decomposition into absorptive, adaptive, and recovery capacities. Each capacity is assessed independently for each scenario, and the three together form the resilience scorecard.

Capacity Question it answers Typical indicators
Absorptive Can the system withstand the initial impact without degrading function? Redundancy, diversity, robustness, buffering margins
Adaptive Can the system reconfigure under stress? Flexibility in routing, decision speed, interoperability, graceful-degradation paths
Recovery How quickly and completely does the system return to acceptable operation? Reconstitution plans, supply-chain depth, recovery-time objectives

The fourth move is the failure propagation analysis. Space architectures contain single points of failure and cascading dependencies, and the analysis is incomplete until these are identified. A shared ground station whose loss degrades multiple services; a common software stack whose compromise propagates across a constellation; a sole-source vendor whose disruption creates an eighteen-month recovery gap — these are the hidden failure modes that component-level analysis tends to miss.

The fifth move is the comparative scorecard. Resilience scores are meaningful only in relation to alternatives or benchmarks. A single number — “this system has resilience score 7” — is not useful. A comparison — “this architecture has higher absorptive capacity than its predecessor but lower recovery capacity than a distributed alternative would” — is useful. The method is comparative by construction, and analysts who produce absolute scores have misapplied it.

What distinguishes resilience analysis from neighbouring methods is the structural acceptance of failure combined with the decomposition into absorb, adapt, and recover capacities. Risk matrix assessment asks about probability and severity of events; threat modelling enumerates attack paths; disruption theory asks whether a different architecture would be better. Resilience analysis is the one method whose explicit question is how well the current architecture performs after the disruption has arrived.

The Method at Work: A Regional GNSS Augmentation System

Consider a regional GNSS augmentation service whose users include aviation, maritime, and precision-agriculture customers. The critical function is specified: position accuracy to a defined integrity level, continuously available within the service region, with a maximum tolerated outage duration. The scenario set is deliberately diverse: a kinetic disruption to a space segment asset, a cyber compromise targeting a ground-segment element, a supply-chain interruption affecting a critical hardware replacement, an extended space weather event, and a chronic workforce attrition in the operating authority.

The absorptive-capacity assessment produces favourable findings. The ground segment operates with triple redundancy across geographically distributed stations. The space segment maintains operational margin sufficient to survive the loss of one or two assets. The power and communications infrastructure is hardened. Initial impact of most scenarios is absorbed without visible service degradation.

The adaptive-capacity assessment is more mixed. The system’s failover between ground stations is manual rather than automated, producing a response lag measured in hours rather than minutes. Interoperability with allied augmentation services exists at a technical level but has not been exercised operationally, so the adaptive path is theoretical rather than rehearsed. The result is adaptive capacity that looks acceptable on paper and would likely prove uneven under pressure.

The recovery-capacity assessment surfaces the decisive finding. The critical hardware elements at the ground stations depend on a single vendor with a limited production line. Replacement lead time, in the scenario where a ground station is physically damaged or cyber-compromised beyond remediation, is measured in months rather than weeks. The service could absorb the initial event; it would struggle to adapt around it within the service-level window; and it would recover slowly because the supply chain that restores full capacity is brittle. The resilience scorecard reads: absorptive capacity high, adaptive capacity moderate, recovery capacity low.

The failure-propagation analysis confirms the concern. A scenario in which a cyber compromise disables one ground station and a concurrent supply-chain disruption delays replacement produces a cascading exposure: degraded service for a period long enough to impose material operational costs on aviation and maritime users, with knock-on regulatory and insurance consequences. The compound scenario is more likely than simple probability calculations suggest, because both disruptions can be triggered by the same upstream event — a sophisticated adversary might engineer both simultaneously.

The analytical finding, and the value the method delivers, is that the system’s preventive posture is adequate and its resilience posture is structurally brittle. Further investment in hardening is subject to diminishing returns; investment in supply-chain diversification, automated failover, and exercised interoperability with allied services would produce much larger resilience gains per unit of resource. The recommendation is not “harden more” but “diversify and exercise.” That recommendation would not emerge from a traditional risk analysis, which is oriented toward prevention rather than performance-under-stress.

The scorecard itself becomes a shared analytical object. A downstream scenario-planning exercise consumes the finding about recovery weakness, branching on scenarios in which ground-station loss coincides with supply-chain disruption. A deterrence-escalation assessment references the same scorecard to ask whether the vulnerability creates an attractive target for an adversary seeking to impose disproportionate cost. A procurement review uses the scorecard as a decision criterion in choosing the next generation of ground-segment hardware. The resilience analysis is produced once and consumed repeatedly by methods whose questions it informs without duplicating.

Where It Holds, Where It Limps

Resilience analysis holds where the system is well-enough defined that its components, dependencies, and performance thresholds can be specified, and where the question is how it performs under disruption rather than whether disruption can be prevented. For critical space infrastructure whose degradation imposes broad costs, it is the analytical discipline that most reliably surfaces the brittleness hidden beneath preventive strength.

Its limits are significant.

Architecture detail required
The method requires detailed knowledge of system architecture, which may be unavailable for classified or proprietary systems. Resilience analyses of adversary systems are necessarily more speculative than analyses of one's own. State data gaps explicitly and scope findings to the information available.
Context-dependent
A system resilient against one class of disruption can be fragile against another. A resilience analysis that evaluates a single scenario generalises dangerously. The scenario set must be deliberately diverse, and a finding that the system is "resilient" without specifying against what is misleading.
Scores are comparative
A score of "high" means "high relative to the benchmark or alternative architecture under this scenario," not "high in absolute sense." Analysts who present scores without this caveat invite complacency.
Complacency trap
A resilient-enough finding can be read as a finished verdict rather than as input to continuous improvement. Good practice re-runs the analysis periodically, especially when architectural changes or new threats emerge.
Not an architectural chooser
The method asks how the existing architecture performs; it does not ask whether a different architecture would be better. Disruption theory is the necessary complement: it asks what option is right, resilience analysis evaluates how well the chosen option holds up under stress.
Chronic-stress blind spot
Resilience analysis instinctively reaches for acute disruption — kinetic attack, cyber intrusion — and underweights slow-onset pressures such as debris accumulation, market consolidation, or workforce erosion. These accumulate invisibly until a threshold is crossed; deliberately include at least one chronic scenario.

Resilience analysis pairs naturally with scenario planning (which uses the scorecard as a stress-test baseline), with deterrence-escalation analysis (which uses vulnerability findings as adversary-incentive inputs), with investment analysis (which uses the scorecard as a risk-adjustment factor), and with geopolitical risk frameworks (which supply the scenario inputs the method tests against).

A Note for the Practitioner