Consider all the unpredictable, real-world problems that occur over networks:
Let's look at a simple real-world example and do the math.
Imagine a simple blog website with a web server and a database.
In this scenario, we have two points of failure: if either the web server or the database goes down, the whole system becomes unavailable.
Our overall availability is the probability that 𝗯𝗼𝘁𝗵 services are available at the same time:
Availability = (server uptime %) 𝗑 (database uptime %)
Availability = 0.9999 𝗑 0.9999 = 99.98%
Notice how the availability of our system is lower than either of the individual components.
Now what happens as our system scales and becomes even more complex?
Say we introduce more dependencies – new data stores, micro-services, or SaaS integrations.
Our availability continues to drop exponentially with every new service we add:
Availability = (service uptime %) ^ (number of services)
Complexity leads to unreliability. We must proactively work against this principle to build reliable products and services.
Here are four practical tips I recommend:
Most API libraries support retries out of the box - use them!
Finally, you can avoid single points of failure by scaling each component horizontally to create redundancy.