This post is part of a series that covers the concept of resilience in a distributed environment, and how to improve a systems handling of transient errors.
Wikipedia - In computer networking: "Resiliency is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation."
Large distributed systems are made up of many nodes, over many clustered layers that communicate in long chains of calls, either synchronously or asynchronously.
The more nodes involved in a system, the greater the probability of a failure. A data center with 100,000 servers and millions of users will experience more hardware failures than a data center with 10 servers and a few hundred users. So what happens when one of these nodes fails?
Our user gets this!
So what kind of things can cause a transient error in our system?
These are all errors that can happen in a large system, but need not be instantly fatal to a given request. Using resilience techniques, we can ensure the system better handles these events.
Some kinds of systems benefit much more from resiliency than others.
These systems all provide a feed of data and interruptions result in connections to the client being severed. Where the cost of re-establishing connections is high, then the lack of resilience is more pronounced.
This covers what resilience is and what symptoms you can expect from a non-resilient system. In my next post, I will go over how you can implement resilience in your system.