SPOF Detection
1 min read
Find the single point of failure — especially the hidden ones nobody thinks about.
Find the single point of failure — especially the hidden ones nobody thinks about.
How It Works
A SPOF (single point of failure) is any component whose death takes the whole system down. Obvious SPOFs — like a single database or a single load balancer — get caught in design reviews. Hidden SPOFs are the dangerous ones: DNS, configuration services, secret stores, CI/CD pipelines, shared caches, leader nodes in a cluster, monitoring dashboards. The test: trace through each component in your design and ask "if this one thing dies, what else dies with it?" The answer should be a bounded failure domain (only this one feature breaks), not "everything."
Real-World Example
The 2017 AWS S3 outage took down large fractions of the internet — including many teams' status pages, incident response dashboards, and CI/CD pipelines, because all of those were hosted on S3. The lesson: your disaster recovery tooling cannot depend on the thing you're recovering from. Netflix specifically runs their incident response tools on a stack fully independent of their production AWS region.
Test Yourself
Scenario: A fintech company runs a mobile banking app. The architecture has multi-region active-active app servers, a multi-AZ Postgres with automatic failover, a Redis cluster with 3 replicas, and a CDN in front of static assets. The on-call lead says "we have no SPOFs." What hidden SPOFs are you most suspicious of, and how would you test them?
Get notified when we launch
One email when the full practice product is live. No spam.