Designing Reliable Systems

We already told a bit about designing reliable systems before. Today, we’ll go over how to design services to meet requirements for availability,durability, and scalability. We will also discuss how to implement fault-tolerant systems by avoiding single points of failure, correlated failures, and cascading failures. We will see how to avoid overload failures by using … Read more

What is reliability engineering?

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that needs to ship continuously and the operations team that’s responsible for the reliability of the production environment. Site reliability engineering shifts the responsibility of production … Read more

System Reliability: implementing ‘golden metrics’

Before we start lets think first what is a system reliability means. In simple words, this is the probability of a product performing its intended function under stated conditions without failure for a given period of time. It means, among other things, continuous monitoring of the state of the system. Why this is so important … Read more