Posted on 16-Jul-2021 19:45:06
This is part 2 of the two part series about the SRE adoption framework - Arctic. Part 1 was published last month and can be accessed from the below link. If you have not looked at part 1, I recommend you to read it before this part 2.
Other frameworks and concepts.
What to look for when hiring SREs - both in terms of personality types and skill sets.
A way to do the goal setting for the transformation.
Finally, a list of things that should NOT be done
There is nothing in this world that will survive on its own or solves all the problems in this world. And so are frameworks that no one framework can cover it all. It is a connected world where multiple pieces should be logically picked and utilized for successful results. Following are the frameworks/concepts that can be used along with Arctic. Detailed explanation of each of these is outside the scope of this overview blog.
Following explains some of the frameworks that will go hand-in-hand with Arctic.
As described in part 1 blog, there are different ways SRE team(s) can be structured. Whichever way they are structured, SREs have a need to perform manual operational tasks, develop automation scripts and develop internal tools/frameworks. While it is suggested that SREs spend not more than 50 percent on support/operational/interrupt work, it is important to look at how that 50% is distributed. Instead of splitting that time within a day, it will be beneficial to do the operational/automation and development work for new tools in rotations. An alternative would be to have one team for operational and automation work and a seperate team for purely building internal tools. The distribution of this work between SRE team(s) can also be based on the available talent pool in the location. It is not easy to find resources who will have all three skills and be willing to do all three types of work.
Also, it is to be noted that it is NOT recommended to seperate manual operational work and automation work between different teams. That will end up in traditional state of having a seperate operations team outside of automation team.
Following are the agile frameworks based on the type of work.
Scrum framework is suitable for SRE teams developing internal tools. They can adopt Scrum framework for iteratively delivering the PSPIs at the end of each sprint.
Kanban is about continuous flow of work and WIP limits. It is suitable for SRE teams handling both manual operational work and automation work. For reference, one of my earlier blogs gives an overview of Kanban.
SRE leadership will need to have visibility into the productivity of developers and SREs. The recently introduced framework called SPACE can be used for this. While there are other HR surveys to understand how satisfied employees are, the point here is to have visibility on the developer/SRE experience. The five letters in SPACE mean the following.
Satisfaction and well-being
Communication and collaboration
Efficiency and flow
There are number of useful concepts for SRE team(s) to understand. The concepts are spread across the following areas.
In addition to the concepts and technologies that SREs know through their core work, following are the additional technical concepts that will need to be understood. SREs can utilize these during discussions with application teams and also utilize these in the tools that they themselves build.
12 factor application principles
Use of resilience libraries like Hystrix or Resilience4j
Test driven development
Behavior driven development
Design thinking is specifically useful when SREs decide to build a tool in-house. When I say design thinking, I mean the following.
Looking at the human desirability/technical feasibility/business viability.
Iterating over empathize, define, ideate, prototype and test cycle.
This will be especially useful to validate the decision of building something new. The tool may be based on a totally new innovative idea where such a tool does not exist in the market or it may be based on a decision to build an in-house tool to avoid vendor costs. As techies, its our excitement to build something new and cool. However, from an organization perspective, it will need to add value and built for successful usage later.
And for tools where there are UIs, especially for the dashboards that are built by SREs, it is good to build the wireframes, get feedback and iteratively improve those. The UI design can follow the initial wireframes and again can be iteratively improved. The concepts of Design Sprint or Sprint Zero can also be used to get the initial prototypes.
While at a startup few years ago, I learnt two interesting concepts - one around building new products and other around building new e-commerce or O2O commerce platforms. One of them is based on the book Crossing the chasm. The other is the chicken and egg problem which is explained shortly.
Crossing the chasm is based on the innovation adoption lifecycle by Everett M Rogers. In the book Crossing the chasm, the author Geoffrey A Moore explains about the chasm that exists between initial adopters (two groups of innovators and early adopters) and the majority which includes three of the other groups (early majority, late majority and laggards). While this book focuses on selling products externally, the same can be applied internally when new tools get built and get adopted across the organization. Similar chasm exists even for tools built inside the organization. It is important to meet the needs of all the groups within the organization for successful adoption. Successful adoption of tools across the organization helps in standardization.
Chicken and egg problem is a problem that is to be solved when something is built where the success depends on producers and consumers coming together onto a single platform/product in the right order and right time. The success also depends on bringing in the right content and producers for a targeted consumer or in reverse way, finding the right consumers for producers. An example of this in the SRE world is where we try to build knowledge bases, best practice guides, etc. Without sufficient content, the ones who are interested to learn something new will not come forward as they might not find what they need. So sufficient content starting with most useful content should be made available for the successful use of the knowledge bases.
Similarly, there are certain organizations that encourage InnerSource, a term for building open source within organizations. Depending on the way services are built and deployed, there might be a need to provide standardized scripts, libraries or tools. SRE team(s) can build these commonly useful things across teams and make it part of organizations InnerSource. For InnerSource to be successful, SREs can start with developing something that will be useful for most teams.
There is another book from Geoffrey A Moore titled Zone to Win. Zone to Win is about how to organize an enterprise into four zones - Performance, Productivity, Incubation and Transformation. It would be good to understand this concept to understand how newer revenue generating products can be started and moved into the main stream. Not every organization might be following the zone concept but would be useful to know.
In the book, IT organization that supports core businesses (in Performance zone) exists in the Productivity zone. Critical revenue impacting services exist in the Performance zone and SRE team(s) fall within the Productivity zone supporting these critical services in the Performance zone.
There are two other zones called Incubation zone and Transformation zone. Incubation Zone is where POCs and experimentation are done in newer products or technologies, transformed into revenue generating business lines or other digital transformation through Transformation zone and merged into the Performance zone. SREs can play a good role as these products move from incubation to performance through transformation. With a shift-left mindset, SREs can be embedded into product teams in these zones to in-build the appropriate levels of reliability into the services. The level of reliability becomes more and more important for the organization as they move from one zone to the other.
In certain cases, the SRE team(s) can themselves incubate with new technologies to reduce operational cost and/or improve the reliability further. In this case, the incubated ideas are tried and tested within the Productivity zone.
As with any transformation program, different types of personalities will be required for successful transformation. While technical skills are required, identifying different personality types with the required skill sets is also important.
Succesful transformation programs need different types of people outside just having the technical skills. There are different ways to identify personalities. For example by Predictive Index (PI) test or the DISC profile test. Other is based on experience on working with someone or interacting with someone. Also, in the book Surrounded by Idiots, author Thomas Erikson mentions about the four human behaviors and that most people fall have one or generally a combination of the four behaviors.
Similar analysis can be applied while hiring candidates to determine what is their behavior. For example, going by the book, people with the Yellow behavior are persuaders and are good at bringing others together.
Within the organization itself, as SREs work with various stakeholders across the organization, it will be good for the SRE leaders interacting across the organization to understand the type of personality they are interacting with and communicate effectively.
The technical skills of SREs are spread across the tool set mentioned in part 1.
There are different ways to set goals like SMART goals, OKRs, etc, I personally liked the Salesforce way of goal setting in the form of V2MOMs. While SMART goals are good, V2MOMs are broader than that and the one I liked is the way vision can be stated and obstacles can be highlighted. Having broader idea on the overall vision and thinking of what obstacles may be faced by the transformation program is utmost important for the successful adoption of SRE across the organization or the targeted part of the organization to start with. For reference, V2MOMs are made up of the following five parts.
If writing in the format of all five is not easy at all levels, atleast the last three should be considered.
While most of the part 1 and part 2 is about what can be done for adoption of SRE, following are certain things to be kept in mind and avoid doing.
No bandwagon bias. Use the right tool that serves the purpose that you want to achieve. Do not force fit something just because someone else is using it. Always remember you dont need a hammer to fix a screw, all you need is a screw driver.
Do not over-engineer solutions. Stay focused on making it to a level that is needed to meet the quality and SLO requirements.
Mixing traditional policies that contradict with automation of SREs. For example, in order to perform fully automated releases, certain changes in services that clear automated validation and tests should be allowed to happen without any manual approval in between. The policy of a manual change approval should not be forced on fully automated changes where a service and tests for it are built at a level for automated releases.
Would love to hear any feedback/thoughts/comments on this framework to improve this. You can comment it here or use the Contact Us menu option in the footer or send it over to email@example.com.
Vishnu Vardhan Chikoti is a co-author for the book "Hands-on Site Reliability Engineering". He is a technology leader with diverse experience in the areas of Application and Database design and development, Micro-services & Micro-frontends, DevOps, Site Reliability Engineering and Machine Learning.