Designing Reliable Systems

We already told a bit about designing reliable systems before. Today, we’ll go over how to design services to meet requirements for availability,
durability, and scalability. We will also discuss how to implement fault-tolerant systems by avoiding single points of failure, correlated failures, and cascading failures. We will see how to avoid overload failures by using design patterns such as the circuit breaker and truncated exponential backoff.

When designing for reliability, consider these key performance metrics:

AvailabilityDurabilityScalability
The percent of time a system is running and able to process requestsThe odds of losing data because of a hardware or system failureThe ability of a system to continue to work as user load and data grow
Achieved with fault tolerance.Achieved by replicating data in multiple zonesMonitor usage
Create backup systemsDo regular backupsUse capacity autoscaling to add and remove servers in response to changes in load.
Use health checksPractice restoring from backups
Use clear box metrics to count real traffic success and failure

Now, let’s look at the most common problems while creating distributed system and way to resolve it.

Single points of failure

Avoid single points of failure by replicating data and creating multiple virtual machine instances. It is important to define your unit of deployment and understand its capabilities. To avoid single points of failure, you should deploy two extra instances, or N + 2, to handle both failure and upgrades. These deployments should ideally be in different zones to mitigate for zonal failures.

Consideration
  • Define your unit of deployment
  • N+2L: Plan to have one unit out for upgrade or testing and survive another failing
  • Make sure that each unit can handle the extra load
  • Don’t make any single unit too large
  • Try to make units interchangeable stateless clones
Example

Consider 3 VMs that are load balanced to achieve N+2. If one is being upgraded and another fails, 50% of the available capacity of the compute is removed, which potentially doubles the load on the remaining instance and increases the chances of that failing. This is where capacity planning and knowing the capability of your deployment unit is important. Also, for ease of scaling, it is a good practice to make the deployment units interchangeable stateless clones.

Correlated failures

It is also important to be aware of correlated failures. These occur when related items fail at the same time.

Consideration
  • If a single machine fails, all requests served by machine fail.
  • If a top-of-rack switch fails, entire rack fails.
  • If a zone or region is lost, all the resources in it fail.
  • Servers on the same software run into the same issue.
  • If a global configuration system fails, and multiple systems depend on it, they potentially fail too.
Example

At the simplest level, if a single machine fails, all requests served by that machine fail. At a hardware level, if a top-of-rack switch fails, the complete rack fails. At the cloud level, if a zone or region is lost, all the resources are unavailable. Servers running the same software suffer from the same issue: if there is a fault in the software, the servers may fail at a similar time.

Correlated failures can also apply to configuration data. If a global configuration system fails, and multiple systems depend on it, they potentially fail too. When we have a group of related items that could fail together, we refer to it as a failure or fault domain.

The group of related items that could fail together is a failure domain. Avoid putting critical components in the same failure domain.
Avoid correlated failures

Several techniques can be used to avoid correlated failures. It is useful to be aware of failure domains; then servers can be decoupled using microservices distributed among multiple failure domains. To achieve this, you can divide business logic into services based on failure domains and deploy to multiple zones and/or regions.

  • Decouple servers and use microservices distributed among multiple failure domains.
  • Divide business logic into services based on failure domains.
  • Deploy to multiple zones and/or regions.
  • Split responsibility into components and spread over multiple processes.
  • Design independent, loosely coupled but collaborating services.

At a finer level of granularity, it is good to split responsibilities into components and spread these over multiple processes. This way a failure in one component will not affect other components. If all responsibilities are in one component, a failure of one responsibility has a high likelihood of causing all responsibilities to fail.

When you design microservices, your design should result in loosely coupled, independent but collaborating services. A failure in one service should not cause a failure in another service. It may cause a collaborating service to have reduced capacity or not be able to fully process its workflows, but the collaborating service remains in control and does not fail.

Cascading failures

Cascading failures occur when one system fails, causing others to be overloaded, such as a message queue becoming overloaded because of a failing backend.

Example

Cascading failures occur when one system fails, causing others to be overloaded and subsequently fail. For example, a message queue could be overloaded because a backend fails and it cannot process messages placed on the queue.

The graphic on the left shows a Cloud Load Balancer distributing load across two backend servers. Each server can handle a maximum of 1000 queries per second. The load balancer is currently sending 600 queries per second to each instance. If server B now fails, all 1200 queries per second have to be sent to just server A, as shown on the right. This is much higher than the specified maximum and could lead to cascading failure.

Avoid cascading failures

Cascading failures can be handled with support from the deployment platform. For example, you can use health checks in Compute Engine or readiness and liveliness probes in GKE to enable the detection and repair of unhealthy instances. You want to ensure that new instances start fast and ideally do not rely on other backends/systems to start up before they are ready.

  • Use health checks in Compute Engine or readiness and liveliness probes in Kubernetes to detect and then repair unhealthy instances.
  • Ensure that new server instances start fast and ideally don’t rely on other backends/systems to start up.
This setup only works for stateless services.

Query of death overload

You also want to plan against query of death, where a request made to a service causes a failure in the service. This is referred to as the query of death because the error manifests itself as overconsumption of resources, but in reality is due to an error in the business logic itself.

Problem

Business logic error shows up as overconsumption of resources, and the service overloads. This ‘query of death’ is any request to your system that can cause it to crash. A client may send a query of death, crash one instance of your service, and keep retrying, bringing further instances down.

Solution

Monitor query performance. Ensure that notification of these issues gets back to the developers.

Positive feedback cycle overload

You should also plan against positive feedback cycle overload failure, where a problem is caused by trying to prevent problems.

Example

This happens when you try to make the system more reliable by adding retries in the event of a failure. Instead of fixing the failure, this creates the potential for overload. You may actually be adding more load to an already overloaded system.

Avoid positive feedback overload

Implement correct retry logic.

Use one of patterns to avoid positive feedback cycle overload: truncated exponential backoff or circuit breaker.

Truncated exponential backoff pattern

Consideration

If service invocation fails, try again:

  • Continue to retry, but wait a while between attempts.
  • Wait a little longer each time the request fails.
  • Set a maximum length of time and a maximum number of requests.
  • Eventually, give up.
Example
  • Request fails; wait 1 second + random_number_milliseconds and retry.
  • Request fails; wait 2 seconds + random_number_milliseconds and retry.
  • Request fails; wait 4 seconds + random_number_milliseconds and retry.
  • And so on, up to a maximum_backoff time.
  • Continue waiting and retrying up to some maximum number of retries.

Circuit breaker pattern

We told about the pattern implementation for microservices and Kubernetes infrastructure in one of the previous article.

The circuit breaker pattern can also protect a service from too many retries. The pattern implements a solution for when a service is in a degraded state of operation. It is important because if a service is down or overloaded and all its clients are retrying, the extra requests actually make matters worse. The circuit breaker design pattern protects the service behind a proxy that monitors the service health. If the service is not deemed healthy by the circuit breaker, it will not forward requests to the service. When the service becomes operational again, the circuit breaker will begin feeding requests to it again in a controlled manner.

Consideration
  • Plan for degraded state operations.
  • If a service is down and all its clients are retrying, the increasing number of requests can make matters worse.
  • Protect the service behind a proxy that monitors service health (the circuit breaker).
  • If the service is not healthy, don’t forward requests to it.
  • If using GKE, leverage Istio to automatically implement circuit breakers.
If you are using GKE, the Istio service mesh automatically implements circuit breakers.

We covered how to deploy our applications for high availability, durability, and scalability. We described the most common design patterns to avoid single point of failure, correlated and cascading failure. How to implement correct retry logic.

Please visit or #CyberTechTalk WIKI pages for much more information about designing reliable systems, monitoring and information security.

If you still experience a problem with system reliability, not sure how exactly your system react on load increased, mind disaster recovery or incident management process. We as a BiLinkSoft, provide and support fully managed and automated solutions to archive your operational excellence in:
Reliability-as-a-Service
Monitoring and Observability
Cloud adoption
Business Continuity and Disaster Recovery
Incident Management
Release Management
Security

Contact Us for FREE evaluation.

Be ethical, save your privacy!

subscribe to newsletter

and receive weekly update from our blog

By submitting your information, you're giving us permission to email you. You may unsubscribe at any time.

Leave a Comment