Failure Mode Planning
2 min read
Design for the failures you expect, not the happy path you hope for. Three categories: slow, broken, wrong.
Design for the failures you expect, not the happy path you hope for. Three categories: slow, broken, wrong.
How It Works
Failure mode planning means enumerating what can go wrong before production does it for you. Three categories: (1) slow — dependencies degrade but don't fail (latency spikes, partial responses); fix with timeouts, circuit breakers, load shedding. (2) broken — dependencies return errors or are unreachable; fix with retries-with-backoff, fallback data sources, graceful degradation. (3) wrong — dependencies return bad data (corrupted, malicious, stale); fix with validation, checksumming, poison-pill detection. The anti-pattern is designing only for category (2) with a retry loop and calling it resilient. In interviews, for every critical path in your design, explicitly name its slow / broken / wrong failure mode and how you'd handle each.
Real-World Example
AWS published an entire book on this — the Well-Architected Framework. Their Netflix-inspired Chaos Monkey deliberately kills production instances during business hours to force engineers to handle category-(2) failures in advance. Later tools like Chaos Gorilla test multi-AZ failures, and Chaos Kong tests entire-region failures. The cultural shift: if you haven't rehearsed a failure mode, you've assumed it away — and assumptions rarely survive production. Design reviews that ask "what happens when X is slow or broken or returning bad data?" catch more bugs than any code review.
Test Yourself
Scenario: You're designing a service that calls an external payment processor. Enumerate the slow, broken, and wrong failure modes and how you'd handle each.
Get notified when we launch
One email when the full practice product is live. No spam.