Larger organizations often have to support a myriad of applications that live on several different platforms. Some are home-grown, some are third-party packages, and some are externally-hosted services. Some are relatively young and others date back decades. Even the ones that happen to be written in the same programming language may require different system configurations or different versions of libraries. It’s a challenging mixed bag of platforms and applications.
A common strategy to deal with this complexity is to avoid making any changes at all unless a vendor threatens to drop support for an application or a system platform, or a regulatory agency demands that a system be brought up to a release level that supports required protocols (a recent example that affected many financial institutions: lack of TLS 2 support in older third-party apps).
Usually, three key reasons are given for adopting a no-change strategy.
First, many of these systems are hard to update. In the worst cases, the source code was lost years ago and the original vendor went out of business before the customer’s technical staff was born. In more typical cases, the code was not originally designed to accommodate updates. Vendors concocted a wide variety of schemes to enable customers to update the code. These usually involve matching source code line numbers between the installed version of a product and a set of updates, or some “clever” method even more fragile than that. Complicating the matter further, most customers have made local custom changes to the code over the years, so the line numbers don’t match and the customized routines may not work with the updates installed. Once the technical staff manages to get a stable version of the application working, people leave it alone until it breaks. They may forego years of updates because the hassle of applying changes is just too great.
Second, the systems in a large enterprise are usually interconnected in numerous ways. Developers added connections between systems over the years, and data flows across numerous systems in ways that no one in the organization fully understands. Usually, the flow of data through all the systems is not documented anywhere, nor are the cross-application dependencies, nor are the interfaces that will break if someone upstream adds a field or extends the possible values of an existing field. No file, no database column, no program, no routine within a program is ever deleted, because nobody knows what the impact might be. So, there are many resources in the environment that are not actually used, but no one is confident they understand enough about dependencies to delete anything. Any change to any system runs the risk of causing downstream impacts in production. And “upstream” and “downstream” are optimistic words. The streams flow every which way.
Third, any long-established, large enterprise has upwards of 2,500 applications in production at any given time. It also has quite a few different platforms in the environment, and different versions of each platform to support different applications that vary from convention in unique ways or that have dependencies on obsolete versions of libraries. Even a company that has 100,000 employees doesn’t have enough technical staff to keep up with all that. If they tried to keep all their systems up to date, the technical staff would have no time to do any other work.
But…that doesn’t sound like false economy. It just sounds like plain, old economy. So, what’s the problem?
At the risk of oversimplifying, we can state the problem this way: The longer we defer updating our systems, the more difficult and more expensive it becomes to update or replace them when the inevitable finally happens.
It might surprise you to learn how many well-established and trustworthy institutions are running their systems just this close to disaster, every day. A single hardware fault could affect customers to the tune of tens of millions of dollars. The delay involved in updating key legacy applications to support a new regulatory requirement could cost tens of thousands per month in fines.
Most of the larger, older companies are running at least one mission-critical application on a hardware platform or operating system version that is no longer supported. Every year they leave that system in place they increase the financial impact that will hit them when it finally fails.
Large companies have clearly-defined disaster recovery (DR) plans. They routinely test their DR procedures at least once annually; sometimes more often than that. The problem is that they typically don’t keep their DR procedures up to date with respect to new applications and technical resources. Their DR testing processes don’t properly exercise the DR plan; they don’t go to the extent of bringing up the full data center at the hot site. They just hit a few high points.
The result is predictable. I’ve seen it more than once while working with clients. They lose the data center or a subset thereof, and they activate the DR procedure to move operations to the hot site, for which they pay high monthly fees. Inevitably, at least one key system fails to come up at the second site. Sometimes they learn the hard way that their database backups are corrupted; their DR test process didn’t verify that all databases could be restored.
If it’s hard, do it often
Human nature (I suppose; I’m not a psychologist) leads us to avoid things that are difficult or painful. The reasons companies give to avoid changing systems all boil down to this: It’s hard and painful.
One mantra of the Agile community is that if something is hard, do it often. Do it so often that it becomes routine. Do it so often that you fully understand how to do it when the pressure is turned up. Do it so often that if there’s a flaw, like an unreadable database backup, you discover it before there’s a crisis.
Agilists are usually referring to things like continuous integration when they say this; small potatoes, in perspective. But the same philosophy can apply to data center management and system updates.
The approach works well for application updates, too. I’ve worked with client personnel when they had to update a third-party package by matching up line numbers. When their management insisted they not apply updates until it was absolutely unavoidable, they had to apply updates for version 25.32 of a system, released in 2017, to their current production version 3.01, released in 1986. The vendor provides help for customers running old releases as far back as 23.0, just in case everyone isn’t quite up to date. Quite nice of them. It’s precisely as much fun as it sounds. (I made up those numbers, by the way. But you get the point.)
And this is where false economy becomes obvious. The amount of time technical staff must spend to sort out that kind of an update is very costly. Had they simply applied each update as it was released down through the years, it would be a painless no-brainer, completed in minutes or hours. And if the DR plan had been properly exercised, they could make the changes without fear, confident in their recovery process. Instead, the company pays direct cost of labor for an extraordinary number of work hours, plus opportunity cost of all the work the staff is not doing while they slog through the update, line by line.
Ultimately it’s far cheaper to keep everything up to date and to fix any inconsistencies that arise immediately than it is to defer updates and changes. Head in the sand and fingers crossed just isn’t a viable strategy.
What about the third reason to keep things the same…insufficient number of technical people to keep up with all the changes? It turns out that the more frequently they apply updates to the systems, the easier and quicker it becomes. The reason they don’t have time to update everything today is that they’ve allowed things to deteriorate for so long. Once they get to the point that updates are routine, they’ll be able to handle the workload.
And that DR procedure? Go ahead and bring up the full data center at the secondary site and run live production from it. Switch back and forth regularly. That way, you absolutely know that each site can stand in for the other. The more often you do it, the more you will iron out the kinks and automate the difficult parts.