Building Resilient Systems for Critical Services
In 2022, when Hurricane Fiona devastated Puerto Rico, GiveDirectly did something remarkable. Using an AI system called "Delphi," they delivered emergency cash aid to 90% of applicants the same day they applied.
For organizations delivering critical services, this story offers inspiration and highlights a potential problem. AI can transform how quickly and effectively we serve communities in need. But what happens when the same disaster that creates urgent need also knocks out the infrastructure our AI systems depend on?
The Infrastructure Behind AI
AI systems rely on a foundation of technical and human infrastructure to do their work. A typical AI deployment depends on:
Servers and cloud computing resources
Data pipelines that feed information to and from the system
Network connectivity
Application programming interfaces (APIs) that allow different systems to communicate
Power grids and backup systems
Human oversight and intervention capabilities
Each component represents a potential failure point. When organizations discover these vulnerabilities, it's often at the worst possible moment: during a crisis when their services are needed urgently.
Building Systems That Degrade Gracefully
The National Institute of Standards and Technology (NIST) defines resilient AI systems as those that can "withstand unexpected adverse events or unexpected changes in their environment or use." For mission-driven organizations, this means building systems that bend without breaking.
Consider a crisis helpline using AI to route calls. Instead of an all-or-nothing approach, the system could operate in four progressively simpler modes:
Primary mode: AI analyzes caller needs and routes them to the right counselor
Degraded mode: Pre-programmed rules route calls based on user inputs (“press 1 for. . . “)
Emergency mode: All calls go directly to the next available counselor
Offline mode: Callers hear recorded information with callback options
Each level maintains the core mission—connecting people in crisis with help—even as capabilities decrease. But this approach requires planning, training, and implementation well in advance.
Practical Strategies for Building Resilience
Create Offline Alternatives
For any AI-enhanced workflow, document what happens when the AI isn't available. Can workers override the system and do the work manually? Do they know how? Have they practiced? These questions need answers before an outage.
For organizations using large language models, something as simple as training staff to periodically download important conversations and project files can prevent work loss during outages.
Diversify Your Vendors
While maintaining subscriptions to multiple AI services isn't always financially feasible, pay-per-use options can provide crucial backup access. For organizations using APIs, setting up protocols to switch between providers during outages—though technically complex—can mean the difference between service interruption and continuity.
Services like OpenRouter provide access to multiple AI models with pay-as-you-go pricing, offering a practical emergency backup option for organizations dependent on LLMs.
Negotiate Clear Service Level Agreements
Your service level agreement (SLA) with AI vendors should specify more than just uptime percentages. Consider including:
Response time expectations
Error rate thresholds
Support response times during outages
Clear escalation procedures
Compensation or remedies for service failures
These agreements also reveal failure modes beyond simple outages. How would your services be affected if the AI system became slow rather than completely unavailable? Planning for partial failures is just as important as planning for complete outages.
Monitor for Model Drift
AI models can become less accurate over time as the world changes around them: a phenomenon called "model drift." A benefits eligibility model trained before an economic crisis might make increasingly poor decisions as unemployment patterns shift.
Establish regular testing procedures with clear performance thresholds. When drift is detected, have alternative processes ready while the issue is addressed. This might mean temporarily reverting to manual review or using simpler rule-based systems.
Prepare for Adversarial Attacks
AI systems face unique security challenges. Attackers can manipulate inputs to make systems behave incorrectly, plant prompts surreptitiously, poison training data to introduce harmful patterns, or craft prompts that cause language models to ignore safety instructions (we can talk more about adversarial attacks in a future post if folks are interested). For organizations serving vulnerable populations or working in controversial areas, these risks multiply.
Work with security experts to understand your vulnerabilities. Train staff, especially those handling high-stakes decisions, to recognize anomalous AI behavior. Treat AI as a support tool, not an infallible oracle.
Planning for the Humans in the Loop
Technology resilience is only part of the equation. Human factors often determine whether a degraded system continues serving its mission or fails completely:
Training: Staff need to understand both normal operations and fallback procedures
Documentation: Emergency procedures need to be accessible even when primary systems are down
Communication: Clear channels for alerting staff and service users about degraded functionality
Decision authority: Designated people who can make quick decisions about switching to backup modes
The Path Forward
AI offers transformative potential for organizations serving critical needs, but require us to do some planning in advance to mitigate the risks of being reliant on all that infrastructure.
Building resilient AI implementations requires acknowledging these dependencies from the start: designing systems that fail gracefully rather than catastrophically, maintaining human capabilities alongside automated ones, and testing not just whether systems work, but how they fail.
By planning for infrastructure dependencies and building systems designed to degrade gracefully, organizations can harness AI's benefits while maintaining their commitment to serving communities when they need it most urgently.
LLM disclosure:
I asked Claude Opus 4.1 to draft this based on a chapter from my upcoming book.

