Scaling strategy

Failure Mode Planning

2 min read

Design for the failures you expect, not the happy path you hope for. Three categories: slow, broken, wrong.

How It Works

Failure mode planning means enumerating what can go wrong before production does it for you. Three categories: (1) slow — dependencies degrade but don't fail (latency spikes, partial responses); fix with timeouts, circuit breakers, load shedding. (2) broken — dependencies return errors or are unreachable; fix with retries-with-backoff, fallback data sources, graceful degradation. (3) wrong — dependencies return bad data (corrupted, malicious, stale); fix with validation, checksumming, poison-pill detection. The anti-pattern is designing only for category (2) with a retry loop and calling it resilient. In interviews, for every critical path in your design, explicitly name its slow / broken / wrong failure mode and how you'd handle each.

Real-World Example

AWS published an entire book on this — the Well-Architected Framework. Their Netflix-inspired Chaos Monkey deliberately kills production instances during business hours to force engineers to handle category-(2) failures in advance. Later tools like Chaos Gorilla test multi-AZ failures, and Chaos Kong tests entire-region failures. The cultural shift: if you haven't rehearsed a failure mode, you've assumed it away — and assumptions rarely survive production. Design reviews that ask "what happens when X is slow or broken or returning bad data?" catch more bugs than any code review.

Test Yourself

Scenario: You're designing a service that calls an external payment processor. Enumerate the slow, broken, and wrong failure modes and how you'd handle each.

Get notified when we launch

One email when the full practice product is live. No spam.

Previous← Event-driven vs RPC

NextFunctional vs Non-functional Requirements→