Building Resilient Systems: Anatomy of Failures

Part of Reliability Under the hood Series

“If you haven’t experienced a production outage, you either don’t have enough users or you’re not monitoring closely enough.” - Google SRE Team

The Uncomfortable Truth

Last week, we talked about embracing failure. This week, let’s get specific: What actually fails in production?

Here’s a statistic that surprises most people:

70% of production outages are caused by application-level failures, not infrastructure. most organizations spend 80% of their resilience budget on infrastructure: redundant servers, backup data centers, load balancers, and failover systems.

Key Insight: Application failures are 2.3× more common than infrastructure failures. Yet they’re often harder to detect and recover from.

The Failure Cascade: How One Error Becomes an Outage Case Study: The ~$X(Million) Configuration Error

Series of Events occurs:

12:00 AM Deployment starts -> New feature releases -> Config change (timeout increases by mistake from 2s to 30s) -> Deploy Method (Rolling Update)

12:20 AM First signs of trouble -> API latency increases from 100ms to 2s -> Error rate climbs from 0.1% to 5% -> No alerts triggered (thresholds too high)

refer below depicts to understand the flow

Result: 1. Auth Service waits 30s for fraud check | Thread pool exhausted (all threads waiting) | New requests queue up | Memory usage spikes | Service crashes (boom boom) !!!

After 2 hours

Complete Outage: All authentication failing | No users can log in | Payment processing stopped | Customer support overwhelmed

Article content Root Cause: Simple timeout configuration setting from 2s to 30s has caused whole system on stack.

Ok, so far we have seen common failure type across the system, We also explored one example of configuration setting - how it can make our whole system unstable.

Let’s try to break down the each failure category bit more(might not be as detailed as the one highlighted above and only few top more failures on applications) before we move towards the building the failure detection system.

Article content Application Failure Types Early Detection:30 Second Rule The Golden Rule: Detect failures in under 30 seconds

Why 30 seconds?

Automated mitigation can start immediately Minimal customer impact Faster recovery (MTTR -Min Time To Recover < 5 minutes)

Article content Different Metrics The Detection-to-Recovery Pipeline When issue occurred in production how Incident response process look like where each process is bound to some SLA and response time.

Article content Detection to Recovery Pipeline Building Your Detection System: A Practical Guide Based on the SRE teams The Golden Rule is Detect failures in under 30 seconds

Article content Detection System Guide Key Takeaways 70% of failures are application-level - Focus your monitoring there Detect in < 20 seconds - The faster you detect, the faster you recover Monitor business metrics first - Infrastructure can be “healthy” while business fails Use the Golden Signals - Latency, Traffic, Errors, Saturation Implement synthetic monitoring - Detect issues before customers do Automate detection - Humans are slow, computers are fast

Next Week: Resilience Patterns Now that we understand what fails and how to detect it, next week we’ll explore resilience patterns that prevent failures from becoming outages:

Circuit breakers (stop cascading failures) Bulkheads (isolate resources) Timeouts and retries (fail fast, try again) Chaos engineering (test in production) Building antifragile systems

Subscribe to get notified when Part 3 drops in coming week.

Your Turn Question: What was the last production issue you missed in monitoring? How long did it take to detect? What would have helped you find it faster?

Share your story in the comments. Let’s learn from each other.

“You can’t fix what you can’t see. And you can’t see what you don’t measure.”

About This Series: Part 2 of 4 on building resilient systems for financial services and high-volume transaction processing.

Read Part 1: [Embracing Failure - The Foundation of Resilient Systems]

Read Part 2: [What actually fails - how to detect them?]

Coming soon Part 3: [Resiliency Patterns - prevent failures from becoming outages]

Coming soon Part 4: [Ultimate isolation pattern: Cell-Based Architecture]

Tags: #Monitoring #Observability #SRE #IncidentResponse #SystemResilience #CellBasedArchitecture #Resiliency #ApplicationResiliency #SystemResiliency

References: Golden Signals and Monitoring Google SRE Book - Monitoring Distributed Systems - Definitive guide to the Four Golden Signals Prometheus Documentation - Industry-standard time-series database for metrics Grafana Documentation - Visualization and dashboards

Observability Platforms OpenTelemetry - Open-source observability framework (metrics, logs, traces) AWS X-Ray - Distributed tracing for AWS Jaeger Tracing - Open-source distributed tracing

Failure Statistics Uptime Institute: Outage Analysis - Annual report showing 70% application vs 30% infrastructure failures Awesome Post-Mortems (GitHub) - Curated list of public post-mortems to learn from

Synthetic Monitoring AWS CloudWatch Synthetics - Create canaries to monitor endpoints Datadog Synthetic Monitoring - API and browser testing from multiple locations

Books “Observability Engineering” by Charity Majors - Modern approach to observability “Practical Monitoring” by Mike Julian - Effective monitoring strategies and anti-patterns

Building Resilient Systems: Anatomy of Failures — Part 2

Part of Reliability Under the hood Series

The Uncomfortable Truth

💬 Comments

Building Resilient Systems: Anatomy of Failures — Part 2

Part of Reliability Under the hood Series

The Uncomfortable Truth

💬 Comments

Related Posts