Sponsored Content by Broadcom
How tough is your IT system? Resiliency is critical in nature, to us as human beings and to IT systems. It is defined as the ability to recover from or adjust easily to adversity or change. It doesn’t mean that you’re not going to have problems — it means that you can bounce right back when a problem happens.
Successful businesses maintain comprehensive resiliency plans. Unfortunately, as companies began offering digital services, not all of them carried resiliency into their operational IT plans. IT shops have traditionally been tasked with being as efficient as possible, which can sometimes seem opposed to resiliency, which requires duplication, redundancy, and preparation for outlying conditions that may never occur. Often, this has been viewed as an inefficient use of resources, especially when most IT failures generally do not materially impact business results.
But as we have seen in recent years, the exponential growth of data and transactions generated by accelerated digital transformation has made business resiliency highly dependent on IT resiliency. Operational resiliency alone is no longer an option. Business and IT resiliency are now permanently linked.
At Broadcom, we constantly work with our customers to develop and refine a set of best practices to better align IT and business resiliency. These best practices are based on the acceptance that problems are going to occur, and automatic recovery is the preferred response, with an ability to rapidly diagnose and adjust when automation is not possible.
Keep in mind that it’s OK to establish different resiliency plans for different types of services. A digital service that would result in substantial business issues if unavailable will require a predictive and preventative level of resiliency with automatic remediation, but a service that’s invoked, say, quarterly to support internal executive reviews could tolerate a longer mean-time-to-recovery (MTTR). In other words, take an application-centric view when establishing your resiliency plans.
1. Incident Management
Not every incident can be resolved through automated recovery processes. What is your incident management plan for dealing with issues that require manual resolution?
Setting up a “war room” (physical or virtual) is a traditional approach — gather all your cross-functional stakeholders, infrastructure team, all of your subject matter experts, and figure it out. The challenge is that in war room situations, it’s often not clear who’s calling the shots and who’s responsible for making decisions. This can create a chaotic, toxically uncomfortable environment with the potential for heated arguments and finger-pointing. Such dynamics not only derail the actual debugging of the issue but also inhibit some people from sharing ideas that might resolve the issue because the environment causes them to feel unsafe or not valued.
Let’s face it — war rooms are unlikely to go away anytime soon. But high-performing teams have shown that establishing a “no blame” culture, where team members are willing to openly communicate risks and opportunities. leads to fresh ideas and uncovers new possibilities in recovering from an issue. It can take a lot of work to establish such a culture, but it’s worth it. I outlined some of those ideas in this blog post.
2. Change Management
Business owners tend to discourage change in the digital systems that support their applications — often because there have been impacts associated with change in the past. In some shops, business pressure has limited IT to applying some fixes in ever-narrowing windows. The challenge here is that when you have fewer windows to make changes and the number of changes is accelerating, it becomes a self-fulfilling prophecy that you're going to hit an issue. Why?
Consider the extremes. If you apply one fix, and something breaks, you immediately understand the cause and can back that fix out. If you apply a thousand fixes simultaneously, and something breaks, it’s almost impossible to identify which fix was responsible. In some cases, the root cause of a problem can be obscured by other services in the update that mask the underlying issue. And, of course, delaying the rollout of preventative service leaves known defects in production systems longer, thereby increasing the odds of hitting a problem that has a fix available.
Frequently applying preventative maintenance minimizes the impact of outages. Having fewer changes in a service update makes it easier to diagnose what was changed and what caused the outage, as well as how to revert any of those changes. If you can’t identify which one affected you, backing out just a handful is less risky than backing out hundreds.
Due to historical impacts, businesses that advocate continuous delivery of new functionality do not always equate this agile way of working to IT. If your business application changes are frequent and granular, but your operational changes are infrequently applied in large batches, you’re facing a conflict that should be resolved. This can be an example of when tension is caused between operational resiliency and IT resiliency.
There are so many angles to the change management conversation. My best advice is to work closely with the business to help them understand that staying current with maintenance is a good thing. Examine your change management process and get a clear view of the end objectives, then you can go back and identify the minimum number of artifacts you must have to make a change.
In conjunction with that, automate as much as possible!
Some institutions segregate application change windows from infrastructure windows based on a perception that it’s difficult to debug both at the same time. The challenge here is that if you’re testing application changes and infrastructure changes at the same time, but only roll the infrastructure changes out to production, your production systems become disconnected from what you just tested. This only increases risk due to the fact that what you tested does not match what you have in production — it’s an unpredictable offshoot.
Imagine a scenario where thousands of application changes have been applied, perhaps in an agile process, but changes required within the underlying operating system have not been applied for a relatively long period. Once those infrastructure changes are applied and business applications begin to fail, how do you diagnose the problem? Both the application changes and infrastructure changes were thoroughly tested, but they were not applied together.
Consider moving toward a continuous testing model that uses self-verifying, automated test cases, and a change process where application teams confirm they've tested when a new set of packages for infrastructure have come across. Be sure to deploy both the application and infrastructure updates simultaneously.
There was a time when good systems programmers knew the z/OS control blocks, structures, and states and could easily work their way through a system dump. These days, due to retirements and changes in the way new IT professionals are educated, such deep core z/OS technical skills are hard to find, and many operators lack an understanding of the core operating system they're managing. Manual problem diagnosis is no longer the norm; it’s a luxury.
The best practice to compensate for this shift is to use tooling that generates intelligent insights about system-crippling situations and, even better, delivers automated responses.
This leads to the best practice of observability. IT tools can’t derive insights or recommend automation without first being able to measure a system’s current state by examining its outputs — the definition of observability. The best practice here is to work with companies that are committed to building products that expose their data via Application Programming Interfaces (APIs) — preferably open APIs — so that operations tools can help operations staff use that data to begin shifting their focus from reactive recovery to proactive avoidance.
Resilience is not just about you making changes, it’s a partnership between you and your IT vendors. At Broadcom, we take this partnership seriously, and we are committed to delivering programs to help your resiliency journey. A few brief highlights follow — and as a licensee of Broadcom mainframe software, you already have access to all these programs and more.
All-Access Pass to No-Cost Education: Whether you’re new to the mainframe or building your skills, you’ll find tremendous value in our comprehensive library of training, tutorials, product guides, and documentation — all kept fresh regularly. And, if you can’t find exactly what you need, use our online communities to collaborate with your peers and industry experts.
Expert Change Planning: Work together with Broadcom SMEs to proactively review your plans in collaborative on-site workshops, tailored to your needs.
Assess Your Mainframe Environment: Customized Mainframe Resource Intelligence (MRI) engagements analyze your mainframe health and environment to give you clear actions you can take right now to save money and allocate resources most effectively.
Mitigate Risk: Reduce the risk and cost of software conversions by taking advantage of a proven methodology and resources before you implement your next software release.
There are always opportunities to improve. We are continuously reviewing our documentation, publications, and knowledge documents to ensure that you will be able to quickly find and implement best practices. And we are always looking at ideas to refine our tooling so that it’s easier for you to be aware of whether your systems are in or out of compliance with best practices, and to try and eliminate — or at least reduce — the possibility of configuration drift over time.