In 2007, Gary King wrote a short paper to answer the questions “Can any single workload be run at 100% CPU busy?” and “Is there a growth in CPU time per transaction at higher utilization?”. Unfortunately, the answers are “No” and “Yes”.
In this article, I will walk through some of the more recent techniques and technologies that enable the IBM Z platform to perform very well at high utilization, despite these challenging realities.
Operating at high CPU utilization is just one of the requirements of a highly efficient system. It must also be the case that the lion’s share of the utilization is accomplishing useful work, rather than being consumed by overhead. Further, it is necessary that the available capacity first be devoted to the most important work to achieve its goals as opposed to being doled out helter-skelter, perhaps leaving the important work with insufficient capacity for the current demand.
Since a homogeneous workload cannot be efficiently run at 100% CPU busy, we can now change the question to: “How can a platform efficiently run multiple diverse workloads?” The primary technology that is used to run multiple workloads on a single processor is virtualization. IBM Z supports powerful, multi-tier virtualization with PR/SM™ in the hardware and z/VM® as a software product.
When running a single workload, based on simple queuing theory principles, there will be times when the processor is idle and other times when even 100% is not enough. Static partitioning of the CPU resource, as was done on some other platforms is not an optimal solution. It simply leads to having N partitions that suffer from the same problem — too much or too little CPU capacity for the immediate workload demand.
The PR/SM hypervisor, from the beginning, has made the unconsumed part of any partition’s share of CPU capacity available to other partitions that have sufficient work to benefit from it. This redistribution of unused share obviously goes a long way toward efficiently running multiple workloads by leveling off the peaks and valleys of the individual partitions. However, it creates a new, more subtle problem.
In order to use the extra capacity that another partition couldn’t use, a partition needs to have one or more extra logical processors beyond what it needed to consume its allotted share. If PR/SM is distributing the share equally across the logical processors (called horizontal CPU management), there is an increase in PR/SM dispatching overhead and a reduction in cache efficiency.
In a multitasking environment, there is the overhead of switching the use of a processor from one execution thread to another thread. This can happen frequently due to time-slicing or because the running thread is waiting for an event, like the completion of an I/O operation. The entire status of the running thread must be saved and that of the thread about to run must be restored. With more logical processors, there are more threads to dispatch and “undispatch”. For the PR/SM hypervisor, this means saving and restoring the entire architected status of a logical CPU. Also, with more logical processors, the probability of being redispatched on a processor that retains some of the thread’s cache footprint is reduced leading to greater cache delay. PR/SM, along with z/OS®, cooperatively address this problem with a technology called “HiperDispatch“.
Here is a sketch of how HiperDispatch works to address the overheads caused by defining more logical processors than are needed to consume the CPU share of a partition. First, it is calculated how many processors are needed to consume the defined share. This might be something like 4.5 logical processors. PR/SM and z/OS agree on four processors that will be considered “vertical high” processors. PR/SM treats vertical high processors as quasi-dedicated to the partition. That is, as long as z/OS keeps them busy and does not enter a wait state on them, they remain dedicated to the partition. This reduces the PR/SM dispatching overhead for these processors to nearly zero. It also provides a stable topology for z/OS to implement thread-level affinities so that threads can more frequently be redispatched on a logical processor that retains some of its cached data.
In our example, another logical processor is allotted the remaining half CPU of share and is designated as a “vertical medium” processor. This processor can absorb up to a half CPU’s worth of extra share that was unused by the other partitions. If an installation wants to prepare for even more workload variability, it can define even more logical processors to enable the use of even more extra share. In our example, if the partition has seven logical processors defined, there will be four vertical high processors, one vertical medium processor, and the remaining two processors will have no guaranteed CPU share (and are called “vertical low” processors). Unless additional capacity becomes available, vertical low processors remain in a “parked” state and have no work dispatched on them. There is no overhead caused by parked vertical low processors.
Periodically, z/OS asks PR/SM if there is any capacity that is not being used by the partitions it is allocated to. This is called “white space”. If there is more white space than can be consumed by the vertical medium processor, z/OS may bring in one of the vertical low processors and start dispatching work to it. This is called “unparking”. Later, there may be insufficient white space available so that z/OS will stop dispatching to the unparked vertical low processor and park it again.
This is all very good so far. We are achieving high utilization without introducing additional dispatching and cache damage overhead. But, we have introduced a new, yet more subtle, problem. Now, z/OS has no control over the availability of white space. The white space that it is consuming with a vertical low processor can disappear at any moment. Without the white space to power it, a vertical low processor is not dispatched by PR/SM. When it stops being dispatched by PR/SM, whatever work was running on that vertical low processor stops running. It is stranded and, theoretically, there is no limit to how long it may remain stranded. This can be very bad if it happens to an important piece of work. Fortunately, this too is addressed by PR/SM and z/OS with an architectural enhancement called “Warning Track”.
Warning Track is quite simple. Like the warning track in the outfield of a baseball field, it warns you that you are about to hit a wall. The wall, in our case, is that a logical processor is about to be undispatched from the physical processor by PR/SM. This could be bad if, as above, it’s a vertical low processor that may not get dispatched again by PR/SM for an undetermined amount of time. However, to free the running work from the logical processor being undispatched, PR/SM signals z/OS with an external interrupt, as a warning. In response, z/OS saves the status of the current unit of work and undispatches it so that it can now be redispatched on some other processor. This keeps the work from being stranded. Another advantage of reducing the probability of stranding work on stalled, vertical low processors is that an operating system can be more aggressive when considering whether to unpark a vertical low processor to consume some of the available white space. For more discussion on Warning Track, check out this IBM Destination z article.
Even with all of this, it is possible for the processor to run at less than 100% CPU busy. Nonetheless, in combination, PR/SM partitioning, HiperDispatch, and Warning Track enable IBM Z to excel at achieving high utilization — the first metric efficiency listed above.
The second metric listed was using the CPU capacity to do useful work rather than wasting it on unproductive overhead activities.
As systems increase in capacity based on employing more CPUs, each of which is faster than the preceding generation, two phenomena put a drag on the amount of productive work that can be accomplished. The first is called, “the large system effect”. As systems increase in size, they process more in-flight units of work. This means z/OS must traverse larger queues and manage larger data structures. When the work capacity of the system doubles, there are twice as many data items to visit twice as often. The overhead increases by a factor of four when the compute power only doubles. The negative effect must be attacked with improved algorithms. For z/OS, these include “true ready queue”, “reduced pre-emption”, and restructuring the work element queues. Although they go a long way in enabling z/OS to maintain high efficiency as systems get larger and larger, none of these inventions is particularly new so we will say no more about them here.
The other drag on efficiency is called “the MP (multiprocessor) effect”, which grows as the number of CPUs managed by a z/OS image increases and is caused by increased interprocessor communication. The first of these is serialization of processes. For example, if a control block structure is to be updated it may be necessary to prevent any other processor from viewing the structure while it is in a transient state. If the process needing to access the structure cannot be suspended, then serialization is achieved by the use of a “spin lock”. It is called a spin lock because a process that needs to acquire it just executes a tight loop until the lock can be obtained. The time spinning is 100% unproductive and should be minimized.
One approach under PR/SM is to yield the remainder of the logical processor’s timeslice so that PR/SM can dispatch some other logical processor. Since the trip through PR/SM itself is an expensive overhead, z/OS used to spin for a while before yielding. If the lock becomes available, soon enough this approach is a winner. However, if yielding becomes necessary after spinning, then it’s a loser — and losing is the likely case if the CPU holding the lock is not currently dispatched by PR/SM.
To reduce this spin overhead, z/OS now uses a z/Architecture extension to the Signal Processor (SIGP) instruction called “SIGP Sense Running Status to find out if the lock holder is currently running. If it is not, z/OS immediately yields the remainder of the timeslice; otherwise, it continues spinning until either the lock is freed or the holder is no longer running. On today’s systems with 10s of logical processors, this significantly reduces the wasted capacity spent spinning for locks.
Another improvement in this area is a capability that can be called “directed yield”. Rather than just yielding the timeslice and letting PR/SM pick some logical processor to dispatch, z/OS can now select a preferred logical processor that it wants PR/SM to dispatch. Of course, z/OS selects the logical processor that currently holds the lock so it can finish what it was doing and release the lock.
A better solution than efficient lock management would be to eliminate the need for a lock altogether. This can be done effectively, although not completely, with an architectural enhancement called “transactional execution” (also called “transactional memory”), which was introduced with the z12. Without getting into too much detail, what transactional execution allows z/OS to do is to update control structures in such a way that serialization can be done without holding a lock, so long as no other CPU is concurrently reading/writing the exact same part of the structure. The use of transactional execution eliminates false contention — that is serializing in cases when no serialization is actually required. There is overhead when serializing with a lock, even if there is no contention and therefore no spinning. Whenever the lock is obtained, the lockword in storage is modified. Modifying the lockword causes it to be expelled from the caches of the other CPUs so that they will experience a cache-miss delay when next they access the lockword. Using transactional execution does not modify the lockword so it is beneficial even when there is no contention.
Thus far, we see that IBM Z has sophisticated virtualization and other features that enable not only high levels of CPU capacity utilization but also more efficient use of that capacity. Both enable more useful work to be done per unit of time for any given hardware configuration. But, we have not yet addressed the third leg of efficiency as we have defined it. That’s up to the operating system, which in our case is z/OS.
Decades back, IBM started on the road to managing system capacity based on workload characteristics and business importance. This was manifested in the System Resource Manager (SRM). SRM was effective, but it required a human performance analyst to set about 60 parameters that would then guide system behavior. It also was not dynamically adjusted based on changing conditions. It was not unusual for an installation to be using specifications that were years old. These deficiencies were addressed with Workload Manager (WLM).
WLM relieves an installation of having to set the individual performance parameters. Instead, the installation can attribute business importance and performance goals to the diverse workloads running on a system. WLM dynamically adjusts internal performance parameters to ensure that the most important workloads continue to achieve the stated goals and distributes the remaining capacity to other workloads based on their relative importance and goals. While ensuring that the most important workloads achieve their goals, WLM does not waste capacity by allowing those workloads to overachieve their goal while starving less important workloads. Internal performance parameters are adjusted in real time, based on statistical observations, so that resource utilization is constantly optimized according to installation policy.
While the mainframe hardware, along with built-in virtualization, enables high utilization, it is WLM that enables high utilization without impacting performance goals. It is WLM that allows diverse workloads—with different demands on the system and different performance goals—to coexist in a system, with each achieving their respective goals up to the point where utilization approaches 100%. More importantly, when the system is running full tilt and that is still insufficient to satisfy all the current demand, WLM makes sure that it is the workloads with the highest business importance that get serviced. The less important workloads may experience some delay until the demand reduces to be within the system’s capacity.
However, WLM will not completely starve the low-importance workloads. It is necessary that all work maintain some forward progress because, even low-importance work can hold important resources. For example, a batch job with no specific performance goal (called “discretionary” in WLM terminology) can hold an important DB2 latch. If that job completely stalls while holding that latch, high-importance transactional work may need that latch and back up behind the stalled job that holds it. Depending upon the importance of the lock or latch involved, the negative impact can be dramatic. WLM prevents this problem with a process called “trickle”. An installation can specify a percentage of the capacity that z/OS can give “out of priority order” to lower-importance work to prevent these damaging stalls.
There are certainly other IBM mainframe technologies that further contribute to its ability to operate very efficiently at very high utilization. One underlying technology that plays a part in many of the items discussed above is the use of millicode in the hardware platform. The name millicode is given to the high-level microcode used to implement or augment many of the IBM Z architectural facilities. Unlike traditional microcode, millicode has no communication latency between the main CPU and a microcode engine because millicode runs on the same CPU as application code. In addition to implementing the published z/Architecture, the IBM Z CPU provides a millicode mode, which implements all the z/Architecture hardware instructions plus additional instructions available only in millicode mode. There is also another set of registers so that there is no need to save and restore the application program or operating system registers when entering and exiting a millicode routine.
As an example that illustrates the efficacy of this millicode layer, we can look at the SIGP Sense Running Status facility mentioned above. The SIGP instruction is used to perform a number of functions in a multiprocessor environment. It is implemented in millicode. In order to answer the important question of “Is that processor running?”, a new function was added to the SIGP repertoire. When z/OS issues a SIGP Sense Running Status, a zero-latency transition is made into millicode mode. Within a few dozen instructions, a millicode routine examines PR/SM control structures and returns the answer to z/OS. A query directed at PR/SM itself would have had a pathlength of several thousand instructions. This is just one example of how functions can be efficiently implemented in millicode.
There’s another relatively new technology that doesn’t just enable the mainframe to run at high utilization, but can be said to enable it to run at greater than 100%. That technology is Simultaneous Multithreading (SMT). Currently, SMT is supported on IBM Z only for specialty engines, but as more and more work — like Java, XML, DRDA, and Database utilities — has become specialty-engine eligible, the eligible work is becoming a greater percentage for many important workloads.
The way that SMT works is to enable a single processor core to appear to be multiple instruction processors, which can simultaneously process multiple instruction threads. For the IBM Z processor, the number of threads is currently two. An SMT core can thus execute more instructions at a time and increase the total instructions processed per unit of time.
An SMT core has additional instruction execution pipelines, but not all of the processors’ resources are duplicated, so the throughput is not doubled. For z/OS on the mainframe, the increase is around 15% to 25%, depending on workload. But still, that is a very substantial increase in work capability and, coupled with all the other innovations, z/OS on IBM Z can effectively use that added instruction processing capability to accomplish more real work.
We could go on to discuss these other innovations:
- Chip frequencies of greater than 5 gigahertz
- Independent I/O channels that operate in parallel and free the CPUs to do more work.
- High Performance FICON (zHPF) to greatly speed up Disk I/Os
- Parallel Sysplex – nearly unlimited capacity with nearly 100% availability
- Intelligent Resource Director
But I think we have already shown why you can efficiently run z/OS workloads on IBM Z at very high utilization without fear of response-time delays for critical transactional and batch work. So, to the question posed by the title of this article, the answer is that constant innovation, enhancement, and optimization on every level of the mainframe platform make it possible.
Bob Rogers worked on mainframe system software for 43 years at IBM before retiring as a Distinguished Engineer in 2012. He started with IBM as a computer operator in 1969. After receiving a B.A. in Mathematics from Marist College in 1971, he became a computer programmer at the Poughkeepsie Programming Center, where he worked on the OS/370 operating system. Bob continued to work on mainframe operating system development for his entire career at IBM. He contributed to the transitions to XA-370 and ESA/370, and was lead software designer for the transition to the 64-bit z/Architecture. He implemented the support for single z/OS images with more than 16 CPUs and was a lead designer of the z/OS support for the zAAP and zIIP specialty engines. Today's z/OS implements dozens of his design ideas. His last assignment before retiring was to foster greater synergy between System z hardware and software. Bob has been a popular speaker at the SHARE, System z Technical University and other venues for many years.Bob rejoined IBM part-time in 2015 as a member of the z/VM development team where he continues working on mainframe operating system software.