During SHARE Orlando in March 2024, I had the privilege of speaking to a packed room of mainframers, all interested in exploring how the mainframe can adapt to the evolving demands of modern observability. The central question was clear: How can we integrate mainframe performance data into tools commonly used by site reliability engineers (SREs) to enable enterprise-wide observability?
Why is this such a pressing question? Building the foundation for most business-critical enterprise services, the mainframe can’t be left out of an enterprise-wide observability strategy. For too long, information silos have hampered SREs and operations teams (ITOps), preventing them from working efficiently across both mainframe and distributed systems. This disconnect between the two worlds has been a significant barrier to comprehensive monitoring, and it’s time for that to change.
Why Enterprise Observability Matters
Evolving regulatory and compliance requirements and the increased use of machine learning have driven the need for more inclusive observability of hybrid IT stacks. However, as architectures become more complex, the challenge of understanding, monitoring, and maintaining them grows as well. This is where enterprise observability comes into play.
To gain a comprehensive view of the entire IT landscape, organizations must collect traces, metrics, and logs across all layers — distributed systems, cloud, and mainframe. By following the flow of business transactions with traces, monitoring key performance indicators through metrics, and capturing critical events with logs, SREs and ITOps teams can ensure business applications are functioning optimally. This holistic view — through integrated observability across the entire IT stack — enables enterprises to meet performance expectations and maintain their competitive edge.
Key Insights for Optimizing Observability
Thanks to the feedback I received during SHARE and throughout the last year, I have gathered a set of key action items and best practices SHARE’d with us, which I’d like to pass along. These insights reflect our ongoing efforts to refine and optimize enterprise-wide observability, with a special focus on integrating the mainframe into observability solutions.
- Tracing is just the beginning: Tracing remains one of the most powerful starting points for observability, as it enables end users to follow the flow of business transactions and service interactions across different environments. Service graphs are often derived from traces to visualize these relationships. However, for the daily tasks of an SRE, the attributes attached to traces provide the essential context.
- Contextual insights enable faster root cause analysis: To gain a deeper understanding of the observed systems — especially when something goes wrong — it’s crucial for the traces to provide a well-curated set of attributes. For a CICS transaction, attributes such as the transaction name, region name, connection type, and network details, along with others, provide the crucial context for understanding the transaction's behavior. These attributes provide a viable source of information for machine learning tools that can model system behavior over time. By detecting anomalies in system interactions, these tools empower SREs with the contextual insights necessary to quickly pinpoint and resolve issues.
- Metrics and logs complete the picture: While tracing provides insights into service interactions, integrating metrics and logs is essential for comprehensive observability.
Using system events and application error logs along with key metrics — such as zIIP and CP time spent on z/OS and its subsystems, transaction rates, and web request durations from services like CICS and z/OS Connect — complements the insights gained from traces. With all three signals combined, teams can track performance over time, spot trends, and address potential issues before they evolve into major incidents.
- Teams can choose the right tool for the task with OpenTelemetry: The key to making all of this work seamlessly across such a diverse ecosystem is standardization. By adopting the OpenTelemetry standard, all monitoring signals are consistent and interoperable within a hybrid IT stack. This enables the integration of a wide range of tools, from open-source solutions like Jaeger and Prometheus to enterprise-grade platforms such as DX APM, Datadog, Dynatrace, Elastic, Splunk, or Grafana. With OpenTelemetry, teams can confidently choose the most suitable tool for each task, ensuring both flexibility and efficiency.
Overall, the more integrated the observability across the entire IT stack, the better-equipped SREs and ITOps are to perform their roles efficiently. Additionally, a notable benefit that consistently emerged in discussions I’ve had with observability teams is that this holistic approach — besides its technical advantages — also enhances cross-team collaboration and communication between mainframe and distributed teams.
Get Involved: What’s Next in Mainframe Telemetry?
Kai Uwe Kirsch is a software engineer at Broadcom.