As a freight train approached East Palestine, Ohio, a railroad track side sensor alerted the crew that one of its wheel bearings had reached a critical temperature. The engineer immediately attempted to stop the train, but unfortunately, this corrective action was taken too late. On the evening of February 3, 2023, dozens of train cars derailed and many exploded — some carrying hazardous materials.
Later analysis by the National Transportation Safety Board (NTSB) revealed that a combination of sparse sensors and policies for interpreting that sensor data were likely contributors to the derailment. The trackside sensor that alerted the crew reported a critical temperature — but too late. Another trackside sensor (19 miles earlier) had also sensed an increased wheel temperature, but the increased wheel temperature did not exceed the train operator’s policy (alarm thresholds) for notifying the crew.
Hindsight is 20/20, and it is often easy to recognize trends in signals like this after the fact. Similarly, in this case, it is tempting to conclude that this derailment was preventable and that incorrect assessment of hot bearing detector sensor data was primarily to blame. However, trying to discern when a signal or metric is problematic in real time is often quite difficult.
Stewarding Complex Systems as Mainframers
As people who need to be aware of problems, and as people who are also often responsible for crafting and maintaining systems to detect and report those problems, I think we can all understand the positions of both the train engineer and the railroad operators in this case.
Most of us have responsibility for complex systems and applications — perhaps not freight trains — but systems which, if they are not operating properly, cause very real consequences when they fail. Credit card processing, utility operations, network integrity, loan approvals, insurance quotes, train wheel bearing temperatures, etc. Stewards of these systems, just like railroad engineers, need to be able to determine if these systems are healthy. And, if not, why not? Assessing the health of these systems often involves processing and analyzing log, metric, and trace data.
In IT operations, we have been analyzing this data for many years now in the form of alerts and dashboards. Mainframe, distributed, and hybrid cloud “spaces” each have tools and methodologies that facilitate deep visibility into systems and applications. In recent years, there has been an explosion of tools in the distributed and cloud spaces. However, as applications cross these “space” boundaries, it becomes increasingly difficult to maintain end-to-end visibility without requiring people to use many different tools and without drawing upon diverse sets of skills. In short, we are suffering from fractured tools, increasing application complexity, and a lack of end-to-end visibility. That is particularly true when hybrid applications include the mainframe (more on the mainframe in a bit).
OpenTelemetry as One Solution
OpenTelemetry is an open-source project of the Cloud-Native Computing Foundation (CNCF) that seeks to wrap specifications, conventions, and methodology around processing log, metric, and trace data (aka telemetry signals) to better facilitate the visibility of computing system health. The standards-based processing allows participating vendors to plug into these streams of telemetry signals and provide the visibility we need to keep the systems on which critical applications run healthy. The high-level processing model for OpenTelemetry looks like this:
(source: https://opentelemetry.io/docs/specs/otel/logs/ )
The technical details of OpenTelemetry can be found across the internet, but at a high level, OpenTelemetry provides customers increased choice, increased agility, reduced vendor lock-in, shorter time to value, and better visibility.
Visibility Into the Mainframe
That sounds great! But, what about the mainframe?
The mainframe is an extremely visible platform — I would argue that the mainframe offers some of the deepest visibility of any system on the planet. We have logs — SYSLOG, OPSLOG, log streams, LOGREC, and more — hundreds of different SMF records communicating events and metrics, dumps, GTF Traces, component traces, SYSTRACE, and more. The mainframe also has many commercially available products that add to that already rich set of telemetry. The mainframe is a great place for a nerd like me. But for application owners outside the mainframe ecosystem, the mainframe can be a daunting place if their applications rely on components running on the mainframe.
It’s Time for the Mainframe to Embrace OpenTelemetry
The OpenTelemetry project has made a lot of progress in a few short years, with nearly 1000 companies contributing to the project so far. Although it is still in the early innings, this is an opportune time for the mainframe ecosystem to embrace OpenTelemetry.
For example, at Broadcom’s Mainframe Software Division, we have begun working toward that goal. First, we’re currently participating in some of the OpenTelemetry specification sub-workgroups to engage with the community and bring our mainframe point of view to the table. Other mainframe vendors are likely participating as well.
In addition, Broadcom has been experimenting with some OpenTelemetry integrations in-house, and the results so far are very promising. We have demonstrated some of these integrations at recent conferences. At the SHARE conference in March 2023, we demonstrated the ability of our SYSVIEW Option for CICS to recognize and propagate OpenTelemetry instrumented trace signals for CICS transactions. In addition, we demonstrated the ability to export metrics from SYSVIEW, Vantage, NetMaster, and SYSVIEW for DB2. Finally, we demonstrated exporting those resulting trace spans and metrics to multiple back-ends via OpenTelemetry. We expect that other mainframe vendors are performing similar demonstrations of OpenTelemetry integration with their own products.
Embracing this standard will help to bridge the mainframe to the rest of the IT ecosystem by connecting its telemetry to over 30 OpenTelemetry supporting vendors.
Again, just like in the distributed and hybrid cloud spaces, this translates into increased choice, increased agility, reduced vendor lock-in, shorter time to value, and better visibility for mainframe customers and applications.
An additional benefit for the mainframe is that, by embracing OpenTelemetry, we can send telemetry data to multiple back-ends simultaneously without increasing CPU consumption or network bandwidth on the mainframe. This is a key benefit since the choice of back-ends (e.g. Splunk, Datadog, etc.) is often dictated by the application or line of business — and many large customers run more than a few concurrently.
Greg Shriver is a Software Architect, AIOps + Automation, for Broadcom.