Enterprise site reliability engineers (SREs) are tasked with ensuring the resilience of enterprise digital services by balancing reliability, incident response, and automation throughout the platform. SREs also can be platform specific roles. During SHARE Washington, D.C., Angelika Heinrich, product manager at Broadcom, outlined the role and its responsibilities within enterprises during her session, “A Day in the Life of an Enterprise Site Reliability Engineer.” Mainframes are upheld as the gold standard in platform reliability and scalability, and SREs are integral to their success because they are “the bridge between business strategy and the IT team.”
Roles & Responsibilities
SREs use their software engineering skills and specific platform experience to perform root cause analyses and use their visibility into applications to identify and improve system reliability. “Overall, the main goals for SREs are to ensure application uptime, scalability, and efficiency of services,” says Heinrich. “They also look to reduce operational toil through automation, which can include determining if there is a script that can be used instead.”
According to Heinrich, SREs’ main responsibilities are:
- Monitor and improve service level indicators to meet agreements and objectives at the service level.
- Respond to outages and minimize downtime.
- Ensure infrastructure supports traffic and scaling requirements and automate manual tasks working with subject matter experts.
- Learn from incidents and improve processes (e.g., post-mortem of an incident).
- Enable application teams to leverage observability practices (i.e., provide the teams with skills they need to do those things and prevent incidents).
SREs inside enterprises often engage in daily stand-up or on-call rotations to review incidents from the day, documenting the steps taken and any outcomes. During these review procedures, they participate or drive post-incident review and guide changes to avoid future incidents. Additionally, SREs partner with teams to improve blockers and increase visibility to reduce incidents and improve mean time to restore, creating a blame free zone. “Ideally, SREs should be embedded in DevOps and application teams for reliability,” she adds.
Strategies for Success
Among the strategies SREs lean on is enterprise observability, which relies on monitoring applications to determine how they operate. SREs look to ensure applications rely on standards and implement tools that achieve enterprise goals in the most efficient way. “Optimization and insights drive innovation and achieve cost savings,” says Heinrich. To achieve optimization, applications should be well-defined, use tools and standards required for those applications, adhere to compliance and security protocols, and employ automation as needed.
According to Heinrich, SREs would strive to achieve the business goal of improved customer experience and the observability outcome of reduced mean time to restoring an application. “In other words, how do we know something is wrong more quickly and how can it be restored?” she explains. To achieve operational efficiency and cost reductions, SREs and the IT team will need to adopt automated detection to reduce manual troubleshooting, which Heinrich adds, “will increase business revenue by improving application reliability and reducing downtime.”
Nitty-Gritty of Application Improvements
The mainframe platform with Datadog or Dynatrace can monitor applications by visualizing, alerting, and analyzing potential issues or errors. Another option is OpenTelemetry, which can use data and trace from multiple tools, such as Datadog and elastic. According to Heinrich, “You can use proprietary solutions and open-source solutions to improve extensibility. These tools can also be used to create the traces.”
SREs need to ensure that confidential data, passwords, and other sensitive information are removed from telemetry before using a third-party solution. These applications also should be in a test environment before being moved to production. She explains, “Telemetry data from specific business applications may not be permitted in SaaS/Cloud platforms. It is a stop in the collector before being reported out.”
To implement observability strategies, SREs should start with one business application that is the least complex to identify what might be blocking access to the agent. Then, they need to create a system of record that outlines the roadmap of the application’s processes. “Agents will attach running programs to a user interface, web service, gateway, API services, messaging, and system of record to create logs, traces, and metrics without changing the applications themselves,” Heinrich explains. “The visualization will enable the identification of errors and be routed to application teams for fixes.”
On the operational side of an application, “DevOps can enable telemetry to provide SREs with a look at what pipelines could be automated to gain quick wins from,” she says. For instance, she points to when Java applications are in garbage collection, “applications can be paused and stop taking transactions, but by deploying z/OS Connect, transactions can continue.” Above all, applications that have high numbers of dependencies or are critical to the enterprise should be prioritized for availability and reliability.
Automation and Care Improve Reliability
Discerning the cause of outages related to security issues will require access or control capabilities to detect authentication and access control vulnerabilities, particularly with regard to brute force attacks. “However, if SREs and IT teams are able to create additional attributes to describe which teams own an application or the appropriate subject matter expert (SME), this additional information in telemetry can accelerate root-cause analysis and resolution,” Heinrich explains. “This eliminates the need for manual review of error logs and helps speed up the process of identifying errors and fixing them.”
SREs need an understanding of enterprise strategy and business goals in order to engage IT teams in application performance review, incident response, and the adoption of tools and techniques to reduce downtime and improve performance and customer satisfaction. With their knowledge of business needs and strategy and through their relationships with IT teams, SREs can improve application reliability on mainframe, ensure security and governance are top of mind, and lead teams to greater efficiency and reliability.
Join us in August for SHARE Cleveland. Don’t leave your technical education to chance. Register today!
Serena Agusto-Cox has more than 20 years of editorial experience and six years of experience writing about mainframe and information technology. She interviews and crafts forward-looking and engaging technical updates related to the mainframe ecosystem, highlights the experiences of thought-leaders in the community, and shares important updates to technical education and training.