By Joseph Gentile
To quote management guru Peter Drucker, “You can't manage what you don't measure.” While this saying may have originated from the desire to manage revenue or business efficiency, it also applies to your shop’s transaction processing systems. For z/OS, the role of system measurement is filled by SMF, which actually stands for System Management Facilities (Google it!). While SMF is a bit arcane, it is an incredibly powerful tool for measuring the performance of just about anything running on z/OS.
Who would have thought it would be a good idea for an operating system to have a monolithic log and recording infrastructure for entries representing a broad continuum of different things, from system IPLs to Java Virtual Machines? But because of this radical thinking in the early days of System/360, SMF records are automatically generated at regular intervals and by specific events, and they have the capability of recording every job, command, logon, transaction, network connection, and disk I/O, as well as overall system performance. SMF has a global interval which defines the cadence of interval records. The SMF interval can be as little as one minute but is typically set to 15 minutes. Interval records enable compute resources and workloads to be measured.
Since interval records are synchronized, data from different record types can be compared on the same timescale, which enables Performance Engineers to measure every layer of the HW/SW stack. If there is a performance problem, they can construct a timeline to drive problem determination. Data from relevant event-based records can be added to the timeline, adding context. And for as long as the data is kept, you can turn back the clock and see what was going on in the system at any point in the past. The challenges then become: How do you know what you’re looking for? And how do you reduce the data quickly enough to make using the data practical?
Over the past fifty-plus years since SMF has been introduced, the amount of data has exploded as technology has changed. LPAR and Sysplex increased the number of systems. Because of software changes, the number of record types expanded. While there is no substitute for knowledge and experience, machine learning approaches like anomaly detection can help narrow your search, so you may not need to be as much of an expert in Db2 to know which detailed metrics appear significant to a change in performance. But this does not address the data reduction problem.
Automation or scripting can streamline the process of reducing the data, but typically SMF is only dumped from the system once per day, so any analysis could not even begin until at least the next day. There are exceptions--for example, real-time monitors and some applications that include SMF exit routines to get access to SMF records as they are generated. Recently, IBM introduced the SMF Real-Time Interface, which allows programs to retrieve records from specific SMF record types as they are generated, without the need for coding an SMF exit routine. The real-time interface goes right along with the concept of streaming SMF data directly to a data reduction application server, either on- or off-z/OS.
While the approaches to SMF data analysis have changed since its inception, SMF has provided a strong foundation for making z/OS ideal for performance analysis. Not only does it provide a uniform recording platform for all types of data, but it has been widely adopted by z/OS components, middleware products, and ISVs, adding transparency to each layer of the mainframe HW/SW stack. As its name implies, System Management Facilities helps you manage your z/OS environment by allowing you to measure it.