System Reliability Engineer checklist

Earlier I was trying to describe you who is really system reliability engineer and what is he responsible for. Today, I would like to focus on a set of skills and approaches, any SRE need to follow.

To understand how really your system behave, how reliable is it and what is best way to adopt common SRE practices, I will list the TOP 10 of the most important practices, which any reliability engineer should have on his checklist.

1. Incident Management

There must be a well defined process to how incidents are managed.
Where incidents are reported/raised?
Who monitors the alerts at any given point of time?
Is there any team that gets the alerts before the SRE team and tries to handle the issue?
Is the process automated?
Do you need to manually open a ticket?
Do you need to go to incidents platform/page or do you get an alert?
Do you schedule a meeting to talk about
Create incident management flow. Be sure everyone is familiar with it.
Establish incident management and response team.

2. Define SLO (Service Level Objective)

Describe golden rules and define the most valuable indicators (SLI)
Define critical objectives for indicators (SLO)
These is no identical systems. Try to understand it architecture first.

3. Provisioning

How do you provision the infrastructure required for deploying the application? (Terraform, Pulumi, CloudFormation, …)
How to install the application and its dependencies? (Container, Bash Script, Ansible, …)
How to deploy the application or the application service? (k8s, cloud instances)
How to configure the application? (k8s, Ansible, …)

4. Resiliency and disaster recovery

Is there single point of failure?
Your app is able to withstand outages (usually by implement multi-region or multi-cloud architectures)
Your app will scale up and down in response to load change
Resources deployed across availability zones, regions, etc.
Is there disaster recovery (DR) plan?
Do your team exercise DR regularly?

5. GitOps

Have you adopt GitOps?
How do you monitor Flux reconsilation?
Do you use ArgoCD dashboards and notifications?
Use kustomize and Helm manifest deployment.
Auto Prune (resources deleted when files/content deleted)
Self-heal (cluster state corrected based on Git state and when manual changes done to the cluster)

6. Monitoring and alerting

Choose monitoring solution. Prometheus/Grafana/Loki or ELK stack
If you can afford it, consider going for ready monitoring solutions like DataDog, NewRelic, …
Be aware of maintenance and how much time you are willing to invest in developing and maintaining monitoring solution
Be sure that critical alerts defined
Team gets notifications on critical alerts (Slack, Phone, Email, … perhaps all. whatever works best for the team)
Implement reactive monitoring – alerting
Considering automatic ticket creation in ITSM or firing runbook on critical alert
Continuously improve your monitoring. Establish baseline and observe deviations.
Use Dashboards

7. IaC and CI/CD

Choose one solution if posible.
Follow DRY (Don’t Repeat Yourself) principle as in make sure there are no code duplication
Readable code – use naming conventions, formatting
Use pull requests for infrastructure. Treat infrastructure code as your application code
Consider inserting cost considerations (e.g. test whether a change will raise the bill significantly if you are using a public cloud)
Make any changes ONLY with new commit, No manual interaction.

8. Security

Do not store credentials in plane text
Use principal of least privileges
Encrypt your data in rest and transit
Use service account to connect to cloud resources

9. Leadership

Set goals. Define SLO/SLI/SLA
Setup KPI and conduct regular meetings to monitor.
100% reliability is not a good goal!
Is there an onboarding page for SREs joining the team?
Schedule 1:1 meeting with team (probably…manager or lead?
Identify possible gaps.
Eliminate toil
Does development team waits on SRE for infra related operations?
Identify SRE team maturity and work on improving it
- Step 1: Operations: SRE is focused on resolving issues, dealing with requests
- Step 2: Automation: SRE is moving towards automation and self-service. Providing tooling, documentation, etc.
- Step 3: Product: SRE is focused on improving the product itself – reliability, performances, etc.
Keep learning ALL THE TIME

10. Post incident analysis

Use root cause analysis meeting after each incident
Note everything
Promote postmortem analysis and blameless culture

Many thanks for Arie Bregman for awesome checklist. Many thanks everyone, who comment me on LinkedIn. It really help us to grow and create more interesting content.

For more SRE content please subscribe to our newsletter, follow us on Twitter and check our sre posts if not done yet.

Save your privacy, bean ethical!