If you haven’t heard, the reliability of IBM Z has improved from seven nines to eight nines.* This is a big deal and didn’t happen by accident. It takes a lot of innovation and engineering to add an additional nine to the end of that string. That sounds all fancy, but what it means is we have reduced the downtime of our systems from three seconds per year, down to less than half a second per year. During the development phase of the IBM z16 and LinuxONE IV systems, a project known as the Quality Initiative was initiated. The focus was directed toward moving the needle forward in four key areas: firmware delivery, automation, tooling, and serviceability testing. As the IBM Z and LinuxONE Test Architect, it was my responsibility that our test and quality assurance organization were key contributors to this effort.
The firmware driver for z16 and LinuxONE consists of 160 different code streams that need to work together for the system to run. It’s a real challenge to make sure deliveries are consistent, have been unit tested, and have clearly defined dependencies on other code packages. On top of all that, the firmware development team must build a new driver daily for our system test environments and pick up all the latest fixes and features the firmware development teams committed the day before. Stabilizing the hardware and the firmware driver is mission number one of the final system test phase, and the quicker we can achieve this goal, the faster we can move on to testing new features and functions. One habit we observed in our test squads was that they were hesitant to move to the latest drivers due to a lack of confidence in the quality and stability of the driver.
To combat this, we took a two-pronged approach. First, we added additional system-level testing as part of the CI/CD driver build process. These tests were divided into three separate tiers with each tier increasing in complexity. Simple, good path, functional verification was in the lowest tier, and error inject scenarios with hardware recovery actions were in the highest tier. Every day, when we built a new driver, this battery of tests was run against it and the results were displayed on the dashboard. This gave the teams better insight into what was or wasn’t working in the latest build, so they could determine if it was better for them to pick up the new firmware load with new fixes or stay with the current code they had running on the system to make test progress. Second, we instituted a round-robin test, where we had a rotating schedule of test squads that had to pick up that day’s driver, run a set of pre-determined tests, and a sprinkle of ad hoc testing based on that test squad’s area of expertise, and report back on the test team daily standup on the status of that test. Using both methods, we saw that teams had high confidence in the driver builds and were quicker to move to the newest builds.
Next, we enhanced the testing process by building an automation framework around all our system test tools. Let me preface this by saying that even prior to this effort our test teams used automation at every level. We had automated unit test cases and the tools we use to run tests have automation and randomization built into their design. In this case, we then started to build a common framework where all of our test operating environments, including all the various operating systems we use to test (e.g. z/OS, zTPF, RedHat, etc.), as well as tooling, could work together and reduce the overhead needed to set up and reconfigure new environments.
Our first attempt at this was definitely a case of biting off more than we could chew. We quickly discovered that a lot of foundational pieces of automation were required before we could build the framework. We took this challenge as an opportunity to work on these foundations by partnering with our test floor users, holding several interviews, and running a few design-thinking sessions to help determine users’ common pain points. Based on those pain points, we refocused our automation around tooling to free up users’ time by automating some more of the repetitive tasks (code installation, error log collecting, and grooming) so that time could be spent on doing more artistic and devious testing.
Speaking of devious testing, one of the hardest tests that we do on IBM Z and LinuxONE platforms is for serviceability. To execute a service test on an I/O channel card, for example, first, we need to get one or multiple operating systems running. Next, we start a workload to generate traffic using that I/O adapter. Then, we validate the applications are operating correctly. Once that is all running well, we then inject an error into the I/O adapter. This could be a hardware fault delivered by using a specially engineered card. Or it could be a software-driven fault. Then, we validate the system correctly when it reports the error. Lastly, we go through the repair process to replace the faulted component with new hardware concurrently, with all workloads still running. This ensures that, once the error has been cleared, the system rebalances the workloads over the newly replaced hardware. A lot of the system needs to be healthy and functional before we can even attempt these sorts of tests.
Innovation is a persistent mindset we have as we continue to develop new tools, new test cases, and new methodologies in our test environments. The results are evident in the increase in reliability metrics as well as in the field quality data that we track and monitor. Without this drive for constant improvement, we would not be able to push the bar forward in terms of improving the quality and reliability of our mainframe systems.
*For clients running z/OS v2.5 and common IBM software stack on IBM z16, users can expect up to 99.999999% availability when the configuration includes an enabled Parallel Sysplex, System Recovery Boost (SRB), GDPS, DS8K with HyperSwap, and best practices.
PJ Catalano is the IBM Z and LinuxONE test architect at IBM. PJ is a senior technical staff member and IBM Master Inventor. He has been at IBM for 20 years, spending his entire career in different areas of test and quality assurance. His new focus is bringing awareness to the mainframe platform through short form videos and memes. You can find PJ and his content here on LinkedIn, Instagram, and TikTok.