How To Tell Your Team Is Actually Ready For Site Reliability Engineering

- Updated March 14, 2018

Site Reliability Engineering is the latest hotness.  But is it also just the latest hype-train?  Short answer: yes…but also, uh…it’s complicated.  Considering SRE?  Today, I talk about how to set yourself up for success.

 

What I’ve seen recently with regards to SRE is a lot of organizations took their cues from Google, LinkedIn, Facebook, etc. and rushed to follow suit.   What needs consideration is the specific shape, size, and culture of your team before if (and how) to carry out an SRE practice.  I’ve compiled a quick guide for how to tell if you should start building that SRE team.  But first…

What is Site Reliability Engineering?

Popularized by Google with their formative publication, Site Reliability Engineering , SRE (for short) is simply put the practice of taking already established concepts like Capacity Planning, Event Management, and Incident Management and applying Software Development principles.  What this ultimately means in practice is that the group of Engineers responsible for these kind of Reliability concerns move into the light and stop being “that team over in the corner that does the Ops things”

Results of a highly functioning Site Reliability team is almost always a product more resilient to failures, a better balance of new feature development vs. tech debt reduction, and a more “woke” Operations team (woke because they’ve gotten more sleep at night because of less pager duty. duh.)

 

Do SRE: Get Woke AF?

Not necessarily. There is foundational and questions that need answering before jumping into building SRE.  It has trade-offs that need consideration for adopting at your organization.

 

How’s Your DevOps Culture?

First and maybe the most important is to take a spot check at your DevOps Culture.  In another post, I’ll do a deeper dive into how to do this.  For now, there is a few starter questions to ask you and your team on my article: What Exactly is DevOps and Why Should We Care?

A strong, DevOps culture that promotes continuous improvements, tight feedback loops between Dev and Ops, and Systems Thinking, is necessary to succeed at SRE. SRE flows out of DevOps and is a natural tactic to work along a broader, DevOps strategy.  If it helps, DevOps is the strategy and SRE is a tactic and tactics need Objectives to capture.

Understand What You Want To Achieve with SRE

Write down at least the three goals you want to carry out with an SRE team. Make sure they have numbers associated with them.  These should also relate to things such as Capacity Planning or Incident Management.  For example: “Reduce SEV1 incidents by 50%” or “Support 20% more API requests”

If this doesn’t come easy, it is the right time to build out an SRE practice.

Mistake: SRE Can’t Fix “My Team is Drowning In Incidents”

A common mistake I see is trying to build an SRE team because the existing team is indefinitely backlogged and disrupted by a stream of never-ending incidents.  This is not a good fix for this situation because SRE needs a stable foundation and practice to grow successfully.  This is an “Incident Spiral“.

Think of it this way:

  • DevOps culture can get you out of the Incident Spiral
  • SRE practice can prevent Incidents from getting out of control

If you are finding your teams struggling with Incidents, the only way to fix it is to stop the I Love Lucy chocolate factory, get feedback on incident patterns, prioritize, fix, and repeat.

With stability established, SRE can now prevent this situation from re-surfacing.

Summary

  • SRE is a useful tactic for hitting issues of system stability, capacity, and incident management
  • Before jumping into SRE practices described in things like Google’s Site Reliability Engineering book, check your team and situation.
  • Make sure DevOps exists in your team’s culture
  • Have a clear understanding of what you want from an SRE team
  • Get your team out of an Incident Spiral before proceeding