Service Reliability Hierarchy

Site reliability engineering describes the stability and quality of service that an application offers after being made available to end users. It is crucial to understand the methods and principals SRE borrow to archive the goal. This is why so important to see and understand service reliability hierarchy.

A Site Reliability Engineering (SRE) pyramid, also known as the ‘Dickerson pyramid’ provides a set of principles that an organization can use to define and improve reliability to promote engineering excellence. The system based on Maslow’s hierarchy of needs was reinvented for system reliability by Google’s SRE manager Mike Dickerson:

The idea behind the SRE pyramid is that we can categorize the health of a service in a similar way to how Abraham Maslow categorized human needs. The most basic elements required by a service are at the bottom of the pyramid, and the elements get more advanced as we move further up the pyramid. Each level is not exclusively dependent on the level below it. But when each level is fulfilled, the levels above it can benefit.

Monitoring

Monitoring is at the very base of the pyramid because it is the most basic requirement for a functioning system. Without monitoring, you have no way to tell whether the service is even working; absent a thoughtfully designed monitoring infrastructure, you’re flying blind. Maybe everyone who tries to use the website gets an error, maybe not – but you want to be aware of problems before your users notice them.

At the most basic level, monitoring allows you to:

Alert on conditions that require attention.
Investigate and diagnose those issues.
Display information about the system visually.
Gain insight into trends in resource usage or service health for long-term planning.
Compare the behavior of the system before and after a change, or between two groups in an experiment.

Monitoring should be built into every feature as a basic requirement. It informs organizations whether the system is even working, and means that they can become aware of potential issues before users notice them.

Incident Response

An effective incident response protocol ensures that you can successfully deal with a problem if you find one during the monitoring phase. Incident response can be difficult.

SREs don’t go on-call merely for the sake of it: rather, on-call support is a tool we use to achieve our larger mission and remain in touch with how distributed computing systems actually work (and fail!). If we could find a way to relieve ourselves of carrying a pager, we would. In Being On-Call, we explain how we balance on-call duties with our other responsibilities.
Google’s SRE

Once you’re aware that there is a problem, how do you make it go away? That doesn’t necessarily mean fixing it once and for all – maybe you can stop the bleeding by reducing the system’s precision or turning off some features temporarily, allowing it to gracefully degrade, or maybe you can direct traffic to another instance of the service that’s working properly. The details of the solution you choose to implement are necessarily specific to your service and your organization. Responding effectively to incidents, however, is something applicable to all teams.

Without a proper procedure in place, it can feel tempting to respond immediately without putting in enough thought, in an attempt to deal with the problem as quickly as possible. Figuring out what’s wrong is the first step in troubleshooting .

Postmortem and Root-Cause Analysis

We aim to be alerted on and manually solve only new and exciting problems presented by our service; it’s woefully boring to “fix” the same issue over and over. In fact, this mindset is one of the key differentiators between the SRE philosophy and some more traditional operations-focused environments.

Postmortems and root cause analysis are both critical when it comes to identifying the root cause of a failure so that it can be avoided in the future. The goal of this is to prevent the same mistakes from happening over and over again by thoroughly addressing the problem and taking the appropriate corrective action the first time it happens.

Building a blameless postmortem culture is the first step in understanding what went wrong (and what went right!). Collaboration should be encouraged, and team members should be visibly rewarded for doing the right thing.

Testing

Once we understand what tends to go wrong, our next step is attempting to prevent it, because an ounce of prevention is worth a pound of cure. Test suites offer some assurance that our software isn’t making certain classes of errors before it’s released to production.

Testing software before releasing it is vital when it comes to reducing the number of errors present.

There are many different types of software tests, each with specific objectives and strategies:

Acceptance testing: Verifying whether the whole system works as intended.
Integration testing: Ensuring that software components or functions operate together.
Unit testing: Validating that each software unit performs as expected. A unit is the smallest testable component of an application.
Functional testing: Checking functions by emulating business scenarios, based on functional requirements. Black-box testing is a common way to verify functions.
Performance testing: Testing how the software performs under different workloads. Load testing, for example, is used to evaluate performance under real-life load conditions.
Regression testing: Checking whether new features break or degrade functionality. Sanity testing can be used to verify menus, functions and commands at the surface level, when there is no time for a full regression test.
Stress testing: Testing how much strain the system can take before it fails. Considered to be a type of non-functional testing.
Usability testing: Validating how well a customer can use a system or web application to complete a task.

In each case, validating base requirements is a critical assessment. Just as important, exploratory testing helps a tester or testing team uncover hard-to-predict scenarios and situations that can lead to software errors.

Capacity Planning

Capacity planning enables an organization to meet the changing demands for its products by allocating the correct resources.

At its core, capacity management must follow three basic principles in order to keep a service scalable, usable, and manageable:

Services must use their resources efficiently. Large services that require a considerable amount of resources are expensive to deploy and maintain.
Services must run reliably. Limiting resource capacity to improve service efficiency can put the service at risk of malfunctioning and suffering user-facing outages. There is a tradeoff between service efficiency and reliability.
Service growth must be anticipated. Adding resources to a service can take a long time and has real world limitations around deployment. This may involve buying and deploying new equipment or building new datacenters. It may also require increasing capacity for other software systems and infrastructure that are dependencies of the service.

Development

Before development begins, it is important to do some research to make sure the product that you are creating doesn’t already exist. After doing some research, you can often find that an off-the-shelf product that meets your requirements already exists.

Ensuring that you are solving a problem that will be widely impactful, instead of focusing on fixing a pet peeve that has been bothering you, is critical.

If you do decide to build the product, you should take the time to figure out what has already been done elsewhere, and what other companies have done to address the problem you are trying to solve.

Product

Finally, having made our way up the reliability pyramid, we find ourselves at the point of having a workable product.

It’s really hard to design products by focus groups. A lot of times, people don’t know what they want until you show it to them.
Steve Jobs

Hope you like it!

Please follow me on Twitter and LinkedIn, subscribe to our newsletter to be in touch.

Save your privacy, be ethical!