Bottleneck analysis

Tail Latency Reasoning

1 min read

Reason about p95/p99, not just averages — tail latency is what users actually feel.

How It Works

Tail latency behaves very differently from median latency in composed systems. Three common amplifiers: service composition (a request fanning out to 10 microservices inherits the slowest tail of each, so your overall p99 gets dominated by whichever service has the worst tail), shared-resource contention (garbage collection pauses, cold caches — see Caching — or network retries occasionally slow requests down in unpredictable bursts), and cold-path rarity (the 1% of requests that hit a cache miss plus a lock wait plus an index scan). In interviews, always state latency targets as "p50 = X ms, p99 = Y ms" — vague "fast on average" answers miss the tail entirely.

Real-World Example

Google's "The Tail at Scale" paper showed that a request fanning out to 100 microservices will have its p99 dominated by the slowest leaf service. Google mitigates with hedged requests — fire the same query to two replicas in parallel, take whichever returns first, cancel the loser. This cuts p99 by roughly 50% at the cost of about 5% extra load.

Test Yourself

Scenario: A product detail page on an e-commerce site fans out to 8 microservices in parallel (catalog, pricing, inventory, reviews, recommendations, shipping, promotions, personalization). Each service has p50 = 20ms, p99 = 200ms. Page p99 is measured at ~1.1s. Explain the math and name the fix.

Get notified when we launch

One email when the full practice product is live. No spam.

Previous← Sync vs Async Communication