Building Resilient Systems for Critical Services

In 2022, when Hurricane Fiona devastated Puerto Rico, GiveDirectly did something remarkable. Using an AI system called "Delphi," they delivered emergency cash aid to 90% of applicants the same day they applied.

For organizations delivering critical services, this story offers inspiration and highlights a potential problem. AI can transform how quickly and effectively we serve communities in need. But what happens when the same disaster that creates urgent need also knocks out the infrastructure our AI systems depend on?

The Infrastructure Behind AI

AI systems rely on a foundation of technical and human infrastructure to do their work. A typical AI deployment depends on:

  • Servers and cloud computing resources

  • Data pipelines that feed information to and from the system

  • Network connectivity

  • Application programming interfaces (APIs) that allow different systems to communicate

  • Power grids and backup systems

  • Human oversight and intervention capabilities

Each component represents a potential failure point. When organizations discover these vulnerabilities, it's often at the worst possible moment: during a crisis when their services are needed urgently.

Building Systems That Degrade Gracefully

The National Institute of Standards and Technology (NIST) defines resilient AI systems as those that can "withstand unexpected adverse events or unexpected changes in their environment or use." For mission-driven organizations, this means building systems that bend without breaking.

Consider a crisis helpline using AI to route calls. Instead of an all-or-nothing approach, the system could operate in four progressively simpler modes:

  1. Primary mode: AI analyzes caller needs and routes them to the right counselor

  2. Degraded mode: Pre-programmed rules route calls based on user inputs (“press 1 for. . . “)

  3. Emergency mode: All calls go directly to the next available counselor

  4. Offline mode: Callers hear recorded information with callback options

Each level maintains the core mission—connecting people in crisis with help—even as capabilities decrease. But this approach requires planning, training, and implementation well in advance.

Practical Strategies for Building Resilience

Create Offline Alternatives

For any AI-enhanced workflow, document what happens when the AI isn't available. Can workers override the system and do the work manually? Do they know how? Have they practiced? These questions need answers before an outage.

For organizations using large language models, something as simple as training staff to periodically download important conversations and project files can prevent work loss during outages.

Diversify Your Vendors

While maintaining subscriptions to multiple AI services isn't always financially feasible, pay-per-use options can provide crucial backup access. For organizations using APIs, setting up protocols to switch between providers during outages—though technically complex—can mean the difference between service interruption and continuity.

Services like OpenRouter provide access to multiple AI models with pay-as-you-go pricing, offering a practical emergency backup option for organizations dependent on LLMs.

Negotiate Clear Service Level Agreements

Your service level agreement (SLA) with AI vendors should specify more than just uptime percentages. Consider including:

  • Response time expectations

  • Error rate thresholds

  • Support response times during outages

  • Clear escalation procedures

  • Compensation or remedies for service failures

These agreements also reveal failure modes beyond simple outages. How would your services be affected if the AI system became slow rather than completely unavailable? Planning for partial failures is just as important as planning for complete outages.

Monitor for Model Drift

AI models can become less accurate over time as the world changes around them: a phenomenon called "model drift." A benefits eligibility model trained before an economic crisis might make increasingly poor decisions as unemployment patterns shift.

Establish regular testing procedures with clear performance thresholds. When drift is detected, have alternative processes ready while the issue is addressed. This might mean temporarily reverting to manual review or using simpler rule-based systems.

Prepare for Adversarial Attacks

AI systems face unique security challenges. Attackers can manipulate inputs to make systems behave incorrectly, plant prompts surreptitiously, poison training data to introduce harmful patterns, or craft prompts that cause language models to ignore safety instructions (we can talk more about adversarial attacks in a future post if folks are interested). For organizations serving vulnerable populations or working in controversial areas, these risks multiply.

Work with security experts to understand your vulnerabilities. Train staff, especially those handling high-stakes decisions, to recognize anomalous AI behavior. Treat AI as a support tool, not an infallible oracle.

Planning for the Humans in the Loop

Technology resilience is only part of the equation. Human factors often determine whether a degraded system continues serving its mission or fails completely:

  • Training: Staff need to understand both normal operations and fallback procedures

  • Documentation: Emergency procedures need to be accessible even when primary systems are down

  • Communication: Clear channels for alerting staff and service users about degraded functionality

  • Decision authority: Designated people who can make quick decisions about switching to backup modes

The Path Forward

AI offers transformative potential for organizations serving critical needs, but require us to do some planning in advance to mitigate the risks of being reliant on all that infrastructure.

Building resilient AI implementations requires acknowledging these dependencies from the start: designing systems that fail gracefully rather than catastrophically, maintaining human capabilities alongside automated ones, and testing not just whether systems work, but how they fail.

By planning for infrastructure dependencies and building systems designed to degrade gracefully, organizations can harness AI's benefits while maintaining their commitment to serving communities when they need it most urgently.

LLM disclosure:
I asked Claude Opus 4.1 to draft this based on a chapter from my upcoming book.

Previous
Previous

The unsung critical skill for the future of work

Next
Next

I cloned my voice to make my audiobook with AI. It’s a controversial choice, and here’s why I made it.