Microservices and distributed systems are unreliable. Here's what to do about it

Software Engineering
Jacob SimonJacob SimonPublished

Consider all the unpredictable, real-world problems that occur over networks:

  • Hardware failures
  • Deployment issues
  • Network congestion
  • Transient errors
  • ... the list goes on!

Let's look at a simple real-world example and do the math.

Example

Imagine a simple blog website with a web server and a database.

In this scenario, we have two points of failure: if either the web server or the database goes down, the whole system becomes unavailable.

Our overall availability is the probability that 𝗯𝗼𝘁𝗵 services are available at the same time:

Availability = (server uptime %) 𝗑 (database uptime %)
Availability = 0.9999 𝗑 0.9999 = 99.98%

Notice how the availability of our system is lower than either of the individual components.

Now what happens as our system scales and becomes even more complex?

Say we introduce more dependencies – new data stores, micro-services, or SaaS integrations.

Our availability continues to drop exponentially with every new service we add:

Availability = (service uptime %) ^ (number of services)

Complexity leads to unreliability. We must proactively work against this principle to build reliable products and services.

But how?

Here are four practical tips I recommend:

Evaluate your critical path

  • Make your product fault-tolerant and gracefully recover when non-essential services are unavailable.
  • Code defensively by isolating errors with 𝚝𝚛𝚢/𝚌𝚊𝚝𝚌𝚑 and logging them as warnings.

Try, try again.

Most API libraries support retries out of the box - use them!

  • Avoid the thundering herd problem by using exponential backoff.
  • Consider if your request is idempotent and can be safely retried.

Reduce, reuse, recycle.

  • If your application doesn’t rely on real-time data, cache it instead and refresh when needed.
  • Fall back to cached data when the service becomes unavailable.

Scale horizontally

Finally, you can avoid single points of failure by scaling each component horizontally to create redundancy.

  • Database -> Replication with automatic failover
  • API -> Multiple servers with high-availability load balancer
  • Frontend -> Use CDN to distribute static assets

Learn everything you need to ace your software engineering interviews.

Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.

Create your free account

Related Courses

Software Engineering Interview Prep

4 courses19.4k students

Land your dream software engineering role at Google, Amazon, Microsoft, Meta, Apple, and other top companies. Learn from mock interviews, frameworks, and advice from senior candidates—practice data structures, algorithms, system design, people management, behavioral interviews, and more.

Amazon Software Development Engineer (SDE) Interview Course

5 courses2.1k students

Our Amazon software engineering interview course helps you review the most important data structures, algorithms, and system design principles, with detailed questions and mock interviews with a focus on Amazon's leadership principles.