Arctic SRE Adoption Framework Part 1

Posted by Vishnu Vardhan Chikoti

Posted on 20-Jun-2021 14:48:50

Introduction

We all are aware of the extent at which the COVID-19 pandemic has accelerated digital transformation. Many services which were previously in-person only are also being offered digitally. And those which were digital, they have scaled multi-fold. Everything from work, school, shopping, entertainment, conferences, etc are all now virtual. Organizations offering services through digital channels can have their customer base grow from several hundreds to several thousands and some of them to few millions. As the world is moving into a new normal of remote working and an increase in online shopping, online education, online entertainment, etc, there is a need for offering the digital services reliably. There is a risk of losing the customer to a competitor if the services are unreliable.

Given the growing focus on infrastructure and service/application reliability, more and more companies are adopting Site Reliability Engineering (SRE). It will be beneficial for organizations to use a framework for SRE transformation like Scrum, XP or Kanban that exists for Agile transformation. Without the availability of framework(s) to help in transformation, it will be challenging for organizations as they need to spend a lot of effort upfront to understand the SRE concepts and do the planning before they begin their actual SRE transformation journey. Here is the high level overview of a framework I created for SRE adoption - Arctic ! I will be posting about Arctic in two parts.

Part 1

It is this blog that will provide the high level overview of the 2 pillars of Arctic.

Part 2

It will be released in couple of weeks and will provide the following details about Arctic.

Supporting frameworks/concepts that can go hand-in-hand with Arctic as there is no one framework or concept which will be successful on it’s own and will need to be combined with related frameworks/concepts for successful results.
What to look for when hiring SREs - both in terms of personality types and skill sets.
A way to do the goal setting for the transformation.
Finally, a list of things that should NOT be done

Ground rules

There are few rules I made for this framework.

This framework will not introduce any new titles. Site Reliability Engineer (SRE) is already being used as a job title. And related roles in the IC and leadership ladders.
This framework will not focus on “what is SRE” but will be about “how to adopt SRE”. There are quite a few great resources - books, articles and videos to learn about SRE. Very recently, we saw Google’s first SRE book turn 5 years.
This framework will not focus on the decision making process of whether an organization should implement SRE practices. This is the how to part after that decision is taken.

The two pillars

Following are the two key elements or pillars under the Arctic framework.

Visibility
Accountability

Visibility

First and foremost important thing is Visibility. Right from the start, there is a need for visibility of where the organization currently stands in the SRE practices. The reason for saying right from the start is because organizations might naturally be already using some of the practices and there is a need to know where the organization currently stands to make a roadmap to reach the target state.

Organizations can start with a focus on certain applications and not the entire suite of them to start with. Similar to Agile adoption, organizations can also create a phased approach to reach the target state of complete adoption. This will ensure the organization overall is culturally unique in thinking about reliability of infrastructure and applications.

There are different things where visibility would be needed. Following is a list of things where visibility is needed.

Adoption

A successful Site Reliability Engineering adoption relies on three main things - Practices, Tools and Policies/Procedures. Following is the list of the practices, tools and policies/procedures/standards that would need to be focused on. Detailed explanation of these practices, tools and policies/procedures/standards is out of scope for this blog.

Practices

Following are the SRE practices that will be required.

Monitoring
Observability
Defined SLOs
Measured SLIs and Error budgets
Incident response
Incident management
Blameless postmortems
Change management
Release management
Eliminating toil
Capacity planning
Infrastructure automation
Elastic environments
AIOps
ChatOps
Chaos Engineering
Security best practices
Compliance with regulatory standards

Tools and platforms

There are a number of tools that will be required for successful implementation of SRE. Following are the categories of the important tools. Examples of these tools are currently not covered in this overview blog.

CI/CD tools
Runtime platforms (on-prem or PaaS)
Infrastructure provisioning
Backup and recovery
Patch management
Monitoring
Observability
Alerting
Alert correlation
On-call management
Incident communications management
Ticketing
Self-healing
Change management
Chat applications
Natural Language Understanding (NLU)
Fault injection
Security Information and Event Management (SIEM)

Policies and procedures

Following are the policies and procedures that will be required.

Incident management
Change management
Error budget policy
SRE onboarding procedure

It looks interesting to look at the overall adoption status in terms of practices, tools and policies. The next important aspect where visibility is needed is on the results of SRE adoption. After all the work, it will be important to showcase the benefits from SRE. The results of SRE adoption can be viewed through metrics.

Metrics

Following are some of the metrics that can be tracked to see the benefits from SRE.

Toil elimination
Reduction in Mean time to Acknowledge (MTTA)
Reduction in Mean time to detect (MTTD)
Reduction in Mean time to recover (MTTR)
Reduction in Mean time to insight (MTTI)
Reduction in Mean time between failures (MTBF)
Reduction in resolution time for incident postmortem action items
SLO breaches/error budget exhaustion

Following are other benefits to track as SRE not just results in reliable services but also helps in these other things directly or indirectly.

Better utilized/planned infrastructure
Improved tech staff experience
Improved productivity of tech staff
Improved business launch experience
Improved reputation through reduced downtime or reduced improper experience messages in social media. Historic down time and issues are also visible in some of the time down time tracking sites.

Accountability

The accountability of SRE functions can be set in an organization by deciding upfront about how the SRE team(s) is/are structured. There are a number of choices to choose from. The structure can also be evolved by starting as one way and changing to other ways as SRE practices are adopted at scale in large organizations. However, whichever structure is chosen, it is important to define clear roles and responsibilities (R&R) for each of the teams so that things dont fall through the gaps.

Possible structures

Following are some of the ways that the teams can be structured. One of the structure has to be chosen and roles and responsibilities clearly defined. Organizations can move from one structure to other as well. However, roles and responsibilities need to be re-defined when the change of structure is made.

Central SRE

In this case there will be a single SRE team who takes care of everything SRE. They drive the adoption of practices, bring in the tools from open source or vendor or build in-house, create the necessary policies and standards. The central team can also embed SREs into application/product teams to work closely with them and ensure they are built for reliability and ready for SRE support onboarding during go-live. Embedding SREs is done with the shift-left mindset so that there are no surprises or re-works later.

Benefits of this approach is that central team sets the standards, tools and also drives the adoption. There is a central focus and good standardization in the organization.

The challenge of this approach would be in larger organizations. As central team can get overwhelmed with coordination with multiple teams and may become a bottleneck, primarily in making the right decisions during times of conflicting priorities. Efficient leadership and sufficient resourcing would be required for the success of a central SRE team operating at scale.

Split by Function

The SRE responsibilities can be split across multiple teams based on function while having the central SRE team having the overall visibility, setting up the standards and driving the SRE agenda forward in the organization. Following are some of the functions that can be handled outside the central SRE team.

Infrastructure SRE
Data SRE
SRE Tools team (team that focuses on bring in vendor tools or open source tools or build internal tools to be used across the organization)

The benefits of this approach is to separate out teams primarily based on the skill set that will be required. While its a great idea to hire SREs who know it all or train SREs at that level, it might be difficult to find enough number of SREs in larger organizations who get to that level.

The challenge of this approach is the risk of each of the teams becoming siloed and may drift away from core SRE focus, especially when resources for SRE end up getting up re-used for something else. This is where the visibility aspect will play a role.

Federated product SRE teams

There can be further variation on how product/application teams meet the reliability needs. Instead of embedding SREs from the central SRE team, SRE teams can be federated and spread across departments/product teams.

This again will be beneficial in large organizations to have that focus on specific products/applications and also have these teams can have tools or platforms that are more suited for them than the ones that are generically brought in for use across the organization.

There are two challenges with this approach. One is that teams might end up with duplicate tools or platforms which do the exact same thing. I have seen this and most of you reading this might have seen this in large organizations. It may be because of familiarity of a different tools or platforms for someone in that federated SRE team or just a generic bias towards one over the other based on what was known to that federated team or it may be a cost or it may also be just a tendency to try something new and something that looks more cool than the other. The other challenge again is that this approach may end up creating silos. It will be important for the federated team to still adopt the practices, policies and procedures set up by the central SRE team even though they drift away to use different tools than the ones standardized for the organization. This is again where the visibility aspect will play a role. Also, federation should only be done where needed and to the extent that is needed and everything else should stay with central or teams that are split by function.

Once the structure is formed, the next is to have clear roles and responsibilities. Following are the important roles and responsibilities to be clear on.

Who in the product teams will be the decision maker for onboarding to SRE? If an organization thinks of adopting SRE for a subset of products, who will decide the selection criteria?
If its not an embedded or federated SRE model where SREs are not part of the product teams, who will be responsible to onboard to SRE and meet the necessary requirements once the decision is taken?
Who in product teams or may be from project management will inform SRE about new business launches? It may be a launch into a new market with existing products or a new product deeply integrated into the existing products and can increase traffic to existing products. So this information needs to be passed over for pro-active preparation.
Who from senior leadership will be able to help resolve any conflicts that might arise during the transformation or at a later stage if any team/department drifts away from the agreed roles and responsibilities?

Would love to hear any feedback/thoughts/comments on this framework to improve this. You can comment it here or use the Contact Us menu option in the footer or send it over to vvcofjs@gmail.com.

About the author

Vishnu Vardhan Chikoti is a co-author for the book "Hands-on Site Reliability Engineering". He is a technology leader with diverse experience in the areas of Application and Database design and development, Micro-services & Micro-frontends, DevOps, Site Reliability Engineering and Machine Learning.