Way back in the Dark Ages, it was common for programmers to receive phone calls in the wee hours of the morning to fix production issues in their applications. I remember joining a team in the late 1970s that supported an application used worldwide by a large corporation. When I started, the application suffered an average of 140 production incidents per month. The team resolved to drive that number down. After a few months, we had reduced the average number of incidents to 11 per month.
Experiences like that one led programmers of the era to adopt a philosophy we called defensive programming. The motivation was selfish. We were not driven by a fanatical dedication to quality. We were not driven by the 21st century buzzword, “passion.” We were driven by the simple biological imperative to get a good night’s sleep.
The Age of the Mainframe
Defensive programming consists of learning and using guidelines for software design and coding that tend to minimize the frequency and severity of problems in production. Back in the day, our quest for a good night’s sleep led us to build applications that were reliable and resilient; applications that could gracefully handle most unexpected inputs and unplanned partial outages of systems on which they were dependent, mostof the time, without waking us up in the middle of the night.
We built batch applications that could be restarted at any step without losing data. We built “online” applications that could, within limits, self-recover from common problems, and that could (at least) properly record what happened and notify us before the situation went too far. We built systems that could judge (well, follow rules about) which errors were truly critical and which ones could be logged and dealt with in the morning.
We just wanted to sleep through the night for a change. What we got were high availability, reliability, recoverability, and happier customers. And sleep, too.
The Age of the Webapp
The day came when the software community more-or-less forgot about defensive programming. Many systems have been built (I’m thinking largely of webapps, but not exclusively) that break easily and just die or hang without offering much information to help us figure out what went wrong and how to prevent it happening again.
Recovery came down to restoring data from a backup (if any) and restarting the application. Typically, this was (and still is) done without any root cause analysis. The goal is to get the thing up and running again quickly. Some people just expect the systems they support to fail periodically, and the “fix” is nothing more than a restart.
This may be a consequence of rushing the work. There’s a desire for rapid delivery. Unfortunately, rather than applying Systems Thinking, Lean Thinking, Agile thinking, sound software engineering principles, mindful testing by well-qualified testers, automated failover, and automated system monitoring and recovery tools, there’s been a tendency to use the same methods as in the Old Days, only pushing people to “go faster.” This generally results in more bugs delivered in less time.
Another cause might be the relatively high dependency on frameworks to generate applications quickly. It’s possible to produce robust code using a framework, but it requires a certain amount of hand-rolled code in addition to the generated boilerplate code. If we “go with the flow” and accept whatever the framework generates for us, we may be inviting problems in production.
It’s become a sort of game to take photos of obvious system failures and post them online. The ones I’ve seen in real life have included gas pumps, ATMs, and large advertising displays in shopping areas displaying Microsoft Windows error screens, and an airport kiosk displaying an HTTP 500 error complete with Java stack trace. The stack trace in particular is probably very helpful for vacationing families and traveling business people.
Some examples are rather beautiful, in their own way. Here’s just one example I found online:
The fact these errors are displayed to the general public demonstrates a level of customer focus on the part of the development teams. Not a goodlevel, but a level.
It’s amusing until one reflects on the potential consequences when a large proportion of software we depend on for everyday life is of this quality.
I guess that worked out okay for a generation of developers. There don’t seem to be many “war stories” of late nights, long weekends, or midnight wake-up calls from this era except in connection with manual deployment procedures. Manual deployment procedures are a pretty dependable recipe for lost sleep. But other than release nights, folks are apparently getting a good night’s sleep.
Sadly, today’s programmers seem to accept all-night “release parties” as the norm for software work, just as we did back in the Bad Old 1980s. When they take the view that this is normal and to be expected, it probably won’t occur to them to do anything about it. The curious and illogical pattern of forgetting or explicitly discarding lessons learned by previous generations of developers afflicts the current generation just as it did our own (well, I mean my own).
The Age of the Cloud
Then the “cloud” era came along. Now applications comprise numerous small pieces (services or microservices) that run inside (or outside) containers that live on dynamically-created virtual machines in an “elastic” environment that responds to changes in usage demand. Cloud infrastructures make it seem as if an application is long-lived, while in reality servers are being destroyed and re-created under the covers all the time.
Developers of microservices have no control over how developers of client applications will orchestrate the services. Developers of client applications have no control over how the microservices will be deployed and operated.
Cloud computing is evolving faster than previous paradigms did. Cloud, fog, and mist computing combined with the Internet of Things (IoT) and artificial intelligence (AI) add up to even more very small services interacting with one another dynamically in ways their developers cannot predict.
When two IoT devices equipped with AI come within radio range of one another, and they figure out how to interact based on algorithms they devised on their own, how can we mere humans know in advance all the possible edge cases and failure scenarios? What if it isn’t two, but twenty thousand IoT devices?
The opportunities for late-night wake-up calls are legion. Resiliency has become a Thing.
Challenges for Resiliency
When a system comprises numerous small components, it becomes easier to gain confidence that each component operates correctly in isolation, including exception scenarios. The trade-off is that it becomes harder to gain confidence that all the interrelated, interdependent pieces work correctly together as a whole.
With that in mind, I think we ought to be as rigorous as we can with the things we can control. There will be plenty of issues to deal with that we can’t control. Prevention isn’t enough, but it’s a strong start. Here are some things for managers, programmers, and testers to keep in mind.
Prevention – Programmers
The most fundamental way to ensure high quality when developing new code or modifying existing code is to learn and use generally-accepted good software design principles. Some of these principles are general, and apply to any language and any programming model. These are things like separation of concerns and follow consistent naming conventions. Building on those, people have identified good practices within each programming “paradigm,” where paradigm means things like Object-Oriented, Functional, Procedural, and Declarative Programming.
Object-Oriented Programming has guidelines like single responsibility principle and open-closed principle; Functional Programming has guidelines like referential transparency and avoid hidden side-effects; Procedural Programming has guidelines like higher-level logic first in the source file, then detailed routines and single exit from a subroutine or called block. Obviously those are not comprehensive desriptions. I’m just trying to set some context.
One of the most basic and general guidelines is the principle of least surprise (or least astonishment). Doing things in the most “standard” or “vanilla” way, given the programming paradigm and language(s) in use, will result in fewer “surprises” in production than will a “clever” design.
Paying attention to consistency across the board helps, too. Handle exceptions in a consistent way, provide a consistent user experience, use consistent domain language throughout the solution, etc. When you modify existing code, follow the patterns already existent in the code rather than introducing a radically different approach just because it happens to be your personal preference. (This doesn’t apply to routine, incremental refactoring to keep the code “clean,” of course; don’t add to that long conditional block just because it’s already there).
We can’t absolutely prevent any and all production issues when our solutions comprise massively distributed components that know nothing at all about one another until runtime. That isn’t an invitation to take no precautions at all, however. Some things programmers can do to minimize the risk of runtime issues in a world of cloud-based and IoT solutions:
- Be disciplined about avoiding short-cuts to “meet a date;” buggy software isn’t really “done” anyway, no matter how quickly it’s released.
- Emphasize system qualities like replaceability and resiliency over traditional qualities like maintainability. Treat your code as a temporary, tactical asset. It will be replaced, probably in less time than the typical production lifespan of traditional systems in the past.
- Make effective use of frameworks and libraries by understanding any inherent limitations or “holes,” especially with respect to security, interoperability, and performance.
- Learn and use appropriate development methods and guidelines to minimize the chance of errors down the line, such as Specification By Example to ensure consistent understanding of needs among stakeholders, Detroit School Test-Driven Development for algorithmic modules, London School Test-Driven Development for component interaction, Design By Contract to help components self-check when in operation, and others.
- Make full use of the type systems of the languages in which you write the solution. Don’t define domain entities as simple types like integeror string.
- Self-organize technical teams on a peer model rather than the Chief Programmer model, to help ensure common understanding and knowledge across the team and to increase the team’s bus number.
- Standardize and automate as many routine, repetitive tasks as you can (running test suites, packaging code into deployable units, deploying code, configuring software, provisioning environments, etc.).
- Don’t be satisfied with writing a few happy-path unit checks. Employ robust testing methods such as mutation testing and property-based testing as appropriate. You might be surprised at how many “holes” these tools can find in your unit test suite. It is perfectly okay if you write ten times as much “test” code as “production” code. There is a risk of writing some redundant test cases, but it would be far worse to overlook something important. Err on the side of overdoing it.
Prevention – Testers
A study I’ve cited before on this forum, from the University of Toronto, by Ding Yuan and several colleagues, Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems, suggests there is good value in making sure each small component that will become part of a distributed system is very thoroughly tested in isolation.
The good news is we have tools and techniques that enable this. The bad news is relatively few development organizations use them.
For those who specialize in testing software, there may be another bit of bad news (or good, depending on how you respond to it). The complexity of software has reached the point that purely manual methods can’t provide adequate confidence in systems. There are just too many variables. Even relatively routine functional checking of well-understood system behaviors requires so many test cases that it’s not reasonable to cover everything through manual test scripts.
As the general complexity of software solutions continues to increase, testers have to devote more and more time to exploration and discovery of unknown behaviors. They just don’t have time to perform routine validation manually.
That had become true before the advent of cloud computing. Now, and in the future world of high connectedness, that reality is all the more unavoidable.
That means testers no longer have a choice: They must acquire automation skills. I understand there is some angst about that, and I suppose you already know my thoughts on the subject. There’s no sense in sugar-coating it; you’ve got to learn something new, whether it’s automation skills or a different occupation altogether.
Demand for people who know how to change horse-shoes will never again be what it was in the 19th century. Similarly, 21st-century software can’t be supported properly with 20th-century testing methods.
When “test automation” started to become a Thing, it was mainly about writing automated functional checks. Since then, development teams have put automation to work to drive functional requirements (using practices called Behavior-Driven Development, Specification by Example, or Acceptance Test Driven Development) and to support “real” testing activities.
My prediction is that the proportion of “test automation” we do for functional checking will decline, while the proportion we do in support of “manual” testing will increase, because of the rising complexity and dynamic operation of software solutions. The growing area of artificial intelligence also has implications for software testing, both for testing AI solutions and for using AI tools to support testing.
The level of technical expertise required to use this kind of automation far exceeds that necessary for automating conventional functional checks. People who specialize in testing software will need technical skills more-or-less on par with competent software engineers. There’s just no getting around it.
Some things testers can do to minimize risk in a dynamic cloud and IoT world:
- Understand that rigorous and thorough testing of individual software components has been shown to correlate strongly with high software quality. Don’t assume you “can’t” find problems at the unit level that might manifest when components are integrated. Push as much testing and checking as low in the stack as you can, as a way to minimize the cost and overhead of tests higher in the stack.
- Automate any sort of predictable, routine checking at all levels. Your time is too valuable to waste performing this sort of validation manually.
- Engage the rest of your team in exploratory testing sessions on a regular basis. Teach them how to do this effectively. Share your knowledge of software testing fundamentals.
- Learn automation skills and keep up with developments in technology. When you learn something new, consider not only how to test the new technology, but also how the new technology might be used to support testing of other software.
The preventive measures suggested above are things we can do before our solutions are deployed. But such measures will not guarantee a massively-distributed, highly dynamic solution always behaves according to its design intent. We need to “bake in” certain design characteristics to reduce the chances of misbehavior in production.
We want to be as clear as possible about defining the behaviors of APIs at all levels of abstraction, from method calls to RESTful HTTP calls. Our unit test suite gives us a degree of protection at build time, and using Design By Contract gives us a degree of protection at run time. Nothing, however, gives us a guarantee of correctness.
Contracts vs. Promises
In a 2002 piece entitled The Law of Leaky Abstractions, Joel Spolsky observed that abstractions over any implementation can leak details of that implementation. He gives the example of TCP, a reliable protocol built on top of IP, an unreliable protocol. There are times when the unreliability of IP “leaks” through to the TCP layer, unavoidably. Most, if not all abstractions have this characteristic.
The observation applies to a wide range of abstractions, including APIs. An API can be thought of as an abstraction over an implementation. By intent, we expose aspects of the code’s behavior that we want clients to use. With Design By Contract, we can enforce those details at run time.
But that level of enforcement does not prevent clients from creating dependencies on the implementation hiding behind the API. Titus Winters observed, “With a sufficient number of users of an API, it does not matter what you promised in the contract, all observable behaviors of your interface will be depended on by somebody.” He called this Hyrum’s Law, after fellow programmer Hyrum Wright.
If this is already true in the “ordinary” world, just imagine how much more true it will be once AI-equipped IoT devices and other intelligent software is the norm. An AI could hammer on an API in millions of different ways in a few seconds, with no preconceptions about which interactions “make sense.” It would discover ways to interact with the API that the designers never imagined, creating dependencies on side-effects of the underlying implementation.
One of the most basic ways to cope with the complexity of massively-distributed services is the idea of Consumer-Driven Contracts. Based on the idea of consumers that request services of providers, the model allows for contracts to be either Provider-Driven or Consumer-Driven. These contracts have the following characteristics:
Clear definitions of expectations on both sides might sound like a good way to ensure distributed systems behave according to design, but there are limitations. Even the original write-up about this approach, cited above, recognizes this:
No matter how lightweight the mechanisms for communicating and representing expectations and obligations, providers and consumers must know about, accept and adopt an agreed upon set of channels and conventions. This inevitably adds a layer of complexity and protocol dependence to an already complex service infrastructure.
Producer Contracts, Consumer Contracts, Design by Contract…these depend on some defined manner of interaction actually occurring at run time. Just as with contracts between humans, the contract itself does not guarantee performance. Contracts are more like “promises” than guarantees. If the client adheres to the contract, the service promises to deliver a result.
Software design patterns based on the idea of “promises” have emerged. These can help us build slightly more reliable distributed solutions than the idea of “contracts” as such when requests and responses occur asynchronously, and when the components involved are not part of a closed system but are dynamically discovered at run time. The idea is summarized nicely in an answer to a question on Quora provided by Evan Priestley in 2012:
A promise (or “future”) is an object which represents the output of some computation which hasn’t necessarily happened yet. When you try to use the value in a promise, the program waits for the computation to complete.
It would be challenging enough to test for this sort of behavior with conventional solutions; consider what it could be like in a world of autonomous, mobile, AI-driven IoT devices and AI-enabled client applications. Imagine a service fabric with a registry of services that perform particular types of functions. Assume the services follow the rule of thumb that they should be stateless.
A human-defined algorithm in a piece of client code might use the registry to discover an available service to perform operations on, say, an invoice. Having discovered the service, the client would call APIs to add line items to an invoice, apply sales tax to the amounts, and apply customer loyalty discounts to the total price.
An AI-based client would very likely operate in a different way. It could explore the behavior and reliability of all the available services pertaining to invoices. It might determine that Service A is highly dependable for the function of adding line items to an invoice; Service B is dependable for calculating sales tax; and Service C is dependable for applying customer loyalty rules. It decides to invoke specific APIs exposed by different services for each function, based on empirical data regarding “performance to promise” in the near-term past. Which services are most likely to fulfill their promises? Maybe not the same ones today as yesterday, or in the last sliding 10-minute window.
Is there a practical way to test all the edge cases and failure scenarios in this type of solution prior to deployment? If so, I can’t think of it off the top of my head. It’s inevitable that there will be fairly complicated behavior happening in production that none of us could predict or test for in advance.
Stateless vs. Idempotent
When the server side of a client-server solution maintains session state, it can lead to problems. The client becomes locked to the server for the duration of a logical session. In an inherently unreliable network environment, this can lead to a transaction failure every time the slightest problem occurs. There can also be performance problems that result in erratic perceived response times from the perspective of an end user (human or otherwise).
To get around this, we’ve long used a simple workaround: We pass the logical session state back and forth on each request and response.
In the age of the mainframe, it was a standard practice to design CICS applications to be pseudo-conversational. Rather than a conversationaldesign that maintained the connection between a terminal and a CICS task, we passed enough data back and forth so that the back-end resources could be released in between requests.
We did exactly the same thing in the age of the webapp, and for the same basic reason. We passed state information back and forth between the browser and the HTTP server, either through URL rewriting or cookies.
One result, in both those cases, was that the back end remained ignorant of logical session state. Much goodness ensued.
In a highly dynamic and massively-distributed cloud/fog/mist environment, it’s possible that statelessness isn’t sufficient anymore. Add the activity of AI entities exploring the environment to teach themselves how to use available resources, as well as intentional denial-of-service attacks, and many more potential failure points become evident.
The idea of idempotence is often suggested as a remedy, or at least a step in the right direction beyond statelessness. An idempotent operation can be repeated an arbitrary number of times, and guarantees the effect will occur only once.
A familiar example might be a server provisioning script. If the script dies after installing, say, two packages out of five, then when the script is executed a second time it should not try to install the first two packages again. It’s easy to write idempotent scripts in common shell languages like Bash or PowerShell, and cloud-based provisioning utilities guarantee idempotence.
Services can be written with the same idea in mind. But how can a service distinguish a legitimate transaction that looks like a duplicate from an error or attack?
Some scenarios can be handled gracefully by infrastructure components. For example, if a run time condition causes an Apache ActiveMQ broker to failover to a second broker in the middle of a transaction, the software can filter out the duplicate request. Other tools offer similar features. But this sort of functionality doesn’t deal with many other potential duplication scenarios.
Imagine a service that credits one account and debits another. If it receives a second request from the same client to credit/debit the same accounts with the same amount, how can it know whether it’s a legitimate transaction or some sort of hiccup or a fraudulent transaction? There’s been no infrastructure failure, so it isn’t a failover situation; it’s just another request.
Back in the age of the mainframe, we used to pass correlation tags between clients and servers. We used the same mechanism in the age of the webapp, as well. Something like it can work well in a cloud environment, too.
Tip of the Iceberg
These considerations are only a few of the most obvious ones. There’s much more to consider. Resurrecting the old mindset of defensive programming can help us remember to think about what could happen in a production environment, even as we’re preoccupied with getting the basic functionality of our application working.
If we plan to get a good night’s sleep in the years to come, we’d better come to grips with these issues, and program defensively against them!