Thirty years ago, IBM’s release of MVS/ESA SP510 introduced IBM Z Parallel Sysplex® and the Coupling Facility (CF) to enable a cluster of z/OS systems to achieve near linear scalability and near continuous availability through high performance data sharing with integrity. Eventual consistency of data is not an option for the transaction-oriented workloads running on these systems. As for scalability, I know of several installations whose sysplex exceeds 25 systems.
Though I’m likely biased by assiduous attention to availability during my career, I believe the prospect of near continuous availability of business services is the primary value proposition for Parallel Sysplex.
The oldest active sysplex I know has been running since October of 2002, more than 21 years (issue a DISPLAY XCF,SYSPLEX,ALL command to see when your sysplex started). However, this is a statement about the availability of the infrastructure. The real goal is highly available business services. A highly available infrastructure is necessary, but not sufficient. Indeed, I’ve seen installations with equally beautiful sysplex infrastructures have wildly different availability outcomes for their business services. Based on my experience with IBM’s High Availability Center of Competency (HACoC), I believe the following must be done well to successfully deliver highly available business services.
Set appropriate goals. Assume failures and service outages will occur. Since impactful disruptions tend to be those lasting “too long,” define the maximum time an impact can be tolerated. Call this the Service Restoration Objective (SRO). Design and implement solutions that enable the business service to be restored to normal operation within its SRO when failures occur. An SRO less than 15 minutes likely requires investment in automation since manual intervention and resolution will typically take longer than that. An SRO less than a couple minutes likely demands sysplex-enabled applications since recovery must be accomplished at “machine speeds”. A sysplex-enabled application is deployed with redundant instances, has no affinities to specific nodes, implements transaction routing and workload balancing using whatever middleware-oriented technique is suitable, and supports rolling updates so maintenance/upgrades can be applied to different systems in succession.
Quantify outage costs. The cost of a service outage is needed to make the business case for availability improvements. Outage costs include lost productivity for impacted users, staff costs for resolving the outage, customer service impacts, reputational damage, revenue loss, financial penalties, wasted goods, etc. Knowing the outage cost, you can then quantify the benefit of achieving a given SRO. Alternatively, you might justify availability investments as an insurance premium to avoid a catastrophic loss. You don’t cancel your home insurance because you’ve never had a fire. On the other hand, you wouldn’t pay a premium that exceeds the cost to replace the house.
Increase reliability. Highly reliable components reduce the likelihood of failure and thus the opportunity for service disruption. Remove known defects by regularly installing maintenance. Replace aging hardware before it is likely to fail. Run on supported levels so you can get fixes. Address issues identified by the IBM Health Checker for z/OS.
Effective change management. Change is inevitable. In contrast with “change control” which seemingly tries to impede change, an effective change management process enables change to be safely introduced into production. The purpose and value of the process is understood by those using it. The inputs are reasonable and make sense to the users. It is rigorous yet not so onerous that people try to skirt the process. Automate to make it reliably repeatable. Doing so can help ensure that the desired tests are performed as the change is rolled to various environments on its way to production.
Use robust testing to try to discover and remove defects before they reach the production environment. Use testing to enable the staff to observe, understand, and improve system behavior when perturbed by various conditions (load, delays, constraints, component failures, and more). A production-like test environment where it is safe to inject failures increases the likelihood of discovering relevant issues. Recovery testing is important because failed recovery procedures generally lead to long outages.
When such a test environment does not exist, take advantage of the available opportunities. For example, expand disaster recover testing to include production-like testing. Perform tests to identify boundary conditions and choke points that could limit the workload’s ability to scale. Recognize that production is in essence being used as a test environment if that is the place the combined business services first interact and operate at scale. Recognize that production will expose flows missed by test. Driving all possible flows in test is impossible. There are simply too many possible combinations.
Therefore, installations must be prepared to deal with the defects, failures, and problems that will inevitably appear in production. High impact events where service is not restored within the SRO will require human intervention. To minimize recovery time, the “first responders” need training and skills. Ideally, the staff has a test environment where they can proactively identify and remediate issues related to monitoring and alerting, develop appropriate automation, and improve failure recovery. By building and maintaining the skills to deal with unusual conditions, there will be greater confidence in the appropriateness and effectiveness of their recovery actions during a service disruption. The more they understand expected system behavior and how the system responds to failure, the sooner they will restore service. However, humans doing diagnosis and repair on-the-fly makes for long outages.
Avoid elongated outages by focusing on fast recovery that meets a suitable SRO. The Parallel Sysplex is all about creating redundancy that allows service to continue despite a component failure. For sysplex-enabled applications, the surviving instances can seamlessly absorb the work. Since the recovery might not begin until the sysplex recognizes the failure and isolates the failed component, exploit various sysplex technologies intended to minimize the time it takes to recover (System Recovery Boost, System Status Detection, Sysplex Failure Management, etc.). For applications that are not sysplex-enabled, the goal is to get the service running on the surviving infrastructure within the SRO. Defer diagnosis and repair of the failed component until after the service is restored. Use automation to initiate recovery actions. For example, trigger a hot-standby to pick up work or fire up a new application instance on a different system.
Ideally, work can run on any system in the sysplex. To do so, the service needs access to all its resources (hardware, middleware, application instances, or application data). If services cannot run everywhere, unique environments and procedures are needed (installation, test, monitoring, operations, recovery, etc.). It becomes more difficult to understand how systems interact. Uniqueness creates complexity. In a crisis, complexity makes for a plethora of options, which can create uncertainty that delays action and elongates outages. The exceptions provide opportunity for error. Cloning systems so everything is “the same” helps simplify. The staff can focus on making a smaller set of recovery procedures fast and reliable. Impediments to cloning certainly exist. Software pricing is typically cited. Installation conventions are another (SYSA is for X, SYSB is for Y).
But a lack of sysplex-enabled applications is often the biggest impediment. It was always intended that applications enjoy the availability benefits of running in a Parallel Sysplex without change. However, simply exploiting middleware capable of data sharing is not always sufficient. If an application can neither be cloned nor moved, it is a single point of failure. The inability to do rolling installs for application changes/upgrades is also problematic. No matter how beautiful and well-managed your sysplex infrastructure might be, services depending on that application will suffer an outage for as long as it takes to restore the application. To deliver highly available services, the applications must participate. The shorter the SRO, the more likely you’ll need to configure your middleware for data sharing and sysplex-enable your applications.
Achieving high availability requires participation and focus from everyone involved in delivering the service: managers, architects, designers, developers, testers, process owners, operations, etc. All parties must work together. The infrastructure team cannot do it alone. Building a redundant infrastructure without designing applications to take advantage of that redundancy undermines service availability. A comprehensive framework with application standards and reference architectures can help drive collaboration and appropriate exploitation of sysplex capabilities. Organizational structures that foster communication and cooperations between the application teams and infrastructure teams can help achieve higher availability more effectively. Management must support availability objectives across the enterprise and provide the resources to achieve those goals.
Finally, analyze and continuously improve. Inevitably, problems escape into production. What process failures allowed that to happen? Where should the problem have been caught? Where could it be caught? Is the SRO appropriate? Was the SRO met? What factors elongated the recovery time? What would reduce the scope of the impact? Are there patterns to the problems that escape? This analysis goes way beyond “X broke, fixed X.” Use this analysis to make and justify investments to improve service availability.
Some mistakenly believe running on a Parallel Sysplex is all that’s needed for highly available business services. Others are lulled into complacency because their sysplex has run for years without an impactful failure. Unfortunately, the pain of a significant outage is sometimes needed to garner support for addressing availability issues. And as for any clustered solution, sympathy sickness (aka “sick but not dead”) can be problematic. But those who build on a Parallel Sysplex foundation, anticipate problems, are vigilant and prepared, who collaborate and continually improve, are generally successful in delivering highly available business services.
Mark Brooks has done z/OS Parallel Sysplex development for nearly 40 years. He frequently speaks at SHARE on a variety of sysplex related topics.