Sponsored Content by Broadcom
Have you heard about site reliability engineers (SRE)? This is an evolving role becoming more vital within IT, including on the mainframe.
Site reliability engineering started at Google more than a decade ago. It focuses on applying a software engineering mindset to system administration topics. An SRE’s mission is to protect, provide for, and progress the software and systems behind all mainframe services. Availability, latency, performance, and capacity are metrics which are always on an SRE’s radar. They treat operations as a software problem. And there are other key concepts we will talk about, including system knowledge, monitoring, and automation.
SREs cover a broad range of topics, being granular when they need to be, drilling down problems to the bits and bytes, while also remaining high level and seeing the bigger picture when it comes to things like capacity and architecture.
SRE Principles
SREs follow some key principles:
- Automate, helping to reduce Mean Time to Resolution (MTTR) and risks.
- Eliminate toil, reducing manual and repetitive activities (see details in the following section).
- SREs split their work 50% on toil and 50% on improvements.
- Define and monitor Service Level Agreement (SLAs) and Service Level Objective (SLOs), assuring target levels for service reliability.
- Define Error Budget, which is the tool an SRE uses to balance service reliability with the pace of innovation.
- Blameless postmortem, encouraging lessons learned and learning with the failures, where the focus is about solving and improving the problem, not pointing fingers.
At some point, we need to embrace risks for the sake of innovation; however, at the end of the day, organizations are looking for uninterrupted quality of service, fewer fire drills, prioritization, and more efficient problem solving.
If organizations have mainframe as part of the IT infrastructure, they have one of the most reliable platform in the world. So, how do we balance that tension between innovation and reliability? To start, mainframe is already modern, and new innovative capabilities are continuously being released that enable organizations to run their mainframe efficiently and securely for a more reliable hybrid IT environment.
What Is Toil?
Let’s explore some of the different SRE principles from a mainframe perspective. One aspects an SRE tackles, which can apply to mainframe, is “eliminating toil.”
The SRE Book defines “toil” as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as a service grows.”
When we are talking about the mainframe, I would say the three key categories of toil are:
- Manual
A clear example is when an Subject Matter Expert (SME) needs to take manual action to restore a problem. A mainframe scenario would be when a started task is hanging due to some memory problem and you recycle it manually.
- Repetitive
When you need to execute the same task repeatedly, like checklists. For example, checking for write-to-operator-with-reply requests (WTORs), contention, SMF, online started tasks, or availability.
- Automatable
This happens when there is a workflow of activities, often procedurally oriented, that could be automated. For example, you have a script triggered manually to restore a problem. That’s great, but means the process is only partially automated. With toil, we are aiming for a fully automated process with no human intervention.
Toil is also often tactical and reactive, such as when you’re distracted by a problem requiring your immediate attention. Toil typically lacks enduring value, meaning the activity may provide a temporary quick fix, but does not provide real improvement to the process or service. And when operational work grows as fast as the size of your underlying infrastructure, and you cannot escape from basic work, there are probably opportunities to eliminate toil.
Basic SRE Skills
- Topology
SREs should have a holistic view of their environment’s topology and know how resources are connected and have knowledge of all the key applications. Having senior expertise is valuable, but having a comprehensive understanding of the overall environment is crucial.
- Monitoring
Critical alert monitoring is key, and being able to correlate events is even better. SREs can leverage machine learning capabilities to help on this task. Historical insights are equally important, as they allow you to analyze trends and use data for continuous improvements.
- Automation
It’s not only important to automate repetitive tasks and reduce toil, but it’s also necessary to support standardization, which will enable fast repair of problems and save time for in-house teams.
SRE is a large subject, with varying opinions across the IT industry.
You might finish reading this blog and conclude that you already practice most of this, so you may ask yourself, “am I an SRE?” Or, you might decide to become an SRE or learn the principles in order to expand your skillset.
I don’t think the SRE role will be deployed equally across all organizations, but SRE concepts are essential. You need to balance the SRE benefits with how it would fit into your organization and then determine how to adapt to your reality.
And finally, is applying the SRE role to the mainframe any different from any other area in IT? The answer is no, not at all. The mainframe is as modern as any other platform, and it’s the most reliable one, so the SRE role is definitely a perfect fit for it. Want to learn more? Check out this blog.
Like what you read? Join our Mainframe Insights group to collaborate and ideate with us as we grow our mainframe ecosystem together.