Skip to main content

Unit Test Shell Scripts:
Part One

Reading: Unit Test Shell Scripts:Part One

Unit Testing Shell Scripts

In the 1960s, it was considered a baseline good practice in software engineering to test your code as you wrote it. The pioneers of software development in that era were proponents of various levels of testing; some advocated “unit” testing (like Unit Testing Shell Scripts) and some didn’t, but all recognized the importance of testing code.

Executable tests may have first been introduced by Margaret Hamilton on the Apollo project in the mid-1960s, where she originated a type of executable checking that we now call “static code analysis.” She called it “higher-order software,” by which she meant software that operates against other software rather than directly against the problem domain. Her higher-order software examined source code to look for patterns that were known to lead to integration issues.

By 1970, people had largely forgotten about executable testing. Sure, people would run applications and poke at them here and there by hand, but as long as the building didn’t burn down around them, they figured the code was “good enough.” The result has been over 35 years of code in production worldwide that is inadequately tested, and in many cases does not work entirely as intended, or in a way that satisfies its customers.

The idea of programmers testing as they go made a come-back starting in the mid 1990s, although up to the present time the vast majority of programmers still don’t do it. Infrastructure engineers and system administrators test their scripts even less diligently than programmers test their application code.

As we move into an era where rapid deployment of complicated solutions comprising numerous autonomous components is becoming the norm, and “cloud” infrastructures require us to manage thousands of come-and-go VMs and containers at a scale that can’t be managed using manual methods, the importance of executable, automated testing and checking throughout the development and delivery process can’t be ignored; not only for application programmers, but for everyone involved in IT work.

With the advent of devops (cross-pollinating development and operations skills, methods, and tools), and trends like “infrastructure as code” and “automate all the things,” unit testing shell scripts has become a baseline skill for programmers, testers, system administrators, and infrastructure engineers alike.

In this series of posts, we’ll introduce the idea of unit testing shell scripts, and then we’ll explore several unit test frameworks that can help make that task practical and sustainable at scale.

Another practice that may be unfamiliar to many infrastructure engineers is version control. Later in this series, we’ll touch on version control systems and work flows that application developers use, and that can be effective and useful for infrastructure engineers, as well.

A Unit Test of Shell Script to Test

Vivek Gite published a sample shell script to monitor disk usage and to generate an email notification when certain filesystems exceed a threshold. His article is here: Let’s use that as a test subject.

The initial version of his script, with the addition of the -P option on the df command to prevent line breaks in the output, as suggested in a comment from Per Lindahl, looks like this:

df -HP | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{ print $5 " " $1 }' | while read output;
  usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1  )
  partition=$(echo $output | awk '{ print $2 }' )
  if [ $usep -ge 90 ]; then
    echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on $(date)" |
     mail -s "Alert: Almost out of disk space $usep%"

Vivek goes on to refine the script beyond that point, but this version will serve the purposes of the present post.

Automated Functional Checks

A couple of rules of thumb about automated functional checks, whether we’re checking application code or a script or any other sort of software:

  • the check has to run identically every time, with no manual tweaking required to prepare for each run; and
  • the result can’t be vulnerable to changes in the execution environment, or data, or other factors external to the code under test.

Pass, Fail, and Error

You might point out it’s possible the script will not run at all. That’s normal for any sort of unit test of Shell Scripts framework for any sort of application. Three outcomes, rather than two, are possible:

  • The code under test exhibits the expected behavior
  • The code under test runs, but doesn’t exhibit the expected behavior
  • The code under test does not run

For practical purposes, the third outcome is the same as the second; we’ll have to figure out what went wrong and fix it. So, we generally think of these things as binary: Pass or fail.

What Should We Check?

In this case, we are interested in verifying that the script will behave as expected given various input values. We don’t want to pollute our unit checks with any other verification beyond that.

Reviewing the code under test, we see that when disk usage hits a threshold of 90%, the script calls mail to send a notification to the system administrator.

In keeping with generally-accepted good practice for unit checks, we want to define separate cases to verify each behavior we expect for each set of initial conditions.

Putting on our “tester” hat, we see that this is a boundary condition sort of thing. We don’t need to check numerous different percentages of disk usage individually. We only need to check behavior at the boundaries. Therefore, the minimum set of cases to provide meaningful coverage will be:

  • It sends an email when disk usage reaches the threshold
  • It does not send an email when disk usage is below the threshold

What Should We Not Check in Unit Test Shell Scripts?

In keeping with generally-accepted good practice for unit test shell script isolation, we want to ensure each of our cases can fail for exactly one reason: The expected behavior doesn’t happen. To the extent practical, we want to set up our checks so that other factors will not cause the case to fail.

It may not always be cost-effective (or even possible) to guarantee that external factors won’t affect our automated checks. There are times when we can’t control an external element, or when doing so would involve more time, effort, and cost than the value of the check, and/or involves an obscure edge case that has a very low probability of occurring or very little impact when it does occur. It’s a matter for your professional judgment. As a general rule, do your best to avoid creating dependencies on factors beyond the scope of the code under test.

We don’t need to verify that the df, grep, awk, cut, and mail commands work. That is out of scope for our purposes. Whoever maintains the utilities is responsible for that.

We do want to know if the output from the df command isn’t processed the way we expect by grep or awk. Therefore, we want the real grep and awk commands to run in our checks, based on output from the df command that matches the intent of each test case. That’s in scope because the command-line arguments to df are part of the script, and the script is the code under test.

That means we’ll need a fake version of the df command to use with our unit checks. That sort of fake component is often called a mock. A mock stands in for a real component and provides predefined output to drive system behavior in a controlled way, so we can check the behavior of the code under test reliably.

We see the script sends an email notification when a filesystem reaches the threshold usage level. We don’t want our unit checks to spew out a bunch of useless emails, so we’ll want to mock the mail command as well.

This script is a good example to illustrate mocking these commands, as we’ll do it in a different way for mail than for df.

Mocking the df Command

The script is built around the df command. The relevant line in the script is:

df -HP | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{ print $5 " " $1 }'

If you run just df -HP, without piping into grep, you’d see output similar to this:

Filesystem      Size  Used Avail Use% Mounted on
udev            492M     0  492M   0% /dev
tmpfs           103M  6.0M   97M   6% /run
/dev/sda1        20G  9.9G  9.2G  52% /
tmpfs           511M   44M  468M   9% /dev/shm
tmpfs           5.3M     0  5.3M   0% /run/lock
tmpfs           511M     0  511M   0% /sys/fs/cgroup
tmpfs           103M  8.2k  103M   1% /run/user/1000

The grep and awk commands strip the output down to this:

0% udev
52% /dev/sda1

We need to control the output from df to drive our test cases. We don’t want the result of the check to vary based on the actual disk usage on the system where we’re executing the test suite. We’re not checking the disk usage; we’re checking the logic of the script. When the script runs in production, it will check disk usage. What we’re doing here is for validation, not production operations. Therefore, we need a fake or “mock” df command with which we can generate the “test data” for each case.

On a *nix platform it’s possible to override the real df command by defining an alias. We want the aliased command to emit test values in the same format as output from df -HP. Here’s one way to do it (this is all one line; it’s broken up below for readability):

alias df="shift;echo -e 'Filesystem Size Used Avail Use% Mounted on';
    echo -e 'tempfs 511M 31M 481M 6% /dev/shm';
    echo -e '/dev/sda1 20G 9.9G 9.2G 52% /'"

The shift skips over the ‘-HP’ argument when the script runs, so that the system won’t complain that -HP is an unknown command. The aliased df command emits output in the same form as df -HP.

The test values are piped into grep and then awk when the script executes, so we’re mocking only the minimum necessary to control our test cases. We want our test case to be as close as possible to the “real thing” so that we won’t get false positives.

Mocks created via a mocking library can return a predefined value when called. Our approach to mocking the df command mirrors that function of a mock; we’re specifying predefined output to be returned whenever the code under test calls df.

Mocking the mail Command

We want to know if the script tries to send an email under the right conditions, but we don’t want it to send a real email anywhere. Therefore, we want to alias the mail command, as we did the df command earlier. We need to set up something we can check after each test case. One possibility is to write a value to a file when mail is called, and then check the value in our test case. This is shown in the example below. Other methods are also possible.

Mocks created via a mocking library can count the number of times they are called by the code under test, and we can assert the expected number of invocations. Our approach to mocking the mail command mirrors that function of a mock; if the text “mail” is present in the file mailsent after we run the script, it means the script did call the mail command.

Pattern for Running Automated Checks

Automated or executable checks at any level of abstraction, for any sort of application or script, in any language, typically comprise three steps. These usually go by the names:

  • Arrange
  • Act
  • Assert

The reason for this is probably that everyone loves alliteration, especially on the letter A, as the word “alliteration” itself begins with the letter A.

Whatever the reason, in the arrange step we set up the preconditions for our test case. In the act step we invoke the code under test. In the assert step we declare the result we expect to see.

When we use a test framework or library, the tool handles the assert step nicely for us so that we need not code a lot of cumbersome if/else logic in our test suites. For our initial example here, we aren’t using a test framework or libary, so we check the results of each case with an if/else block. In the next installment, we’ll play with unit test shell script frameworks for shell languages, and see how that looks.

Here’s our crude-but-effective test script for testing Vivek’s shell script, which we’ve named


shopt -s expand_aliases

# Before all
alias mail="echo 'mail' > mailsent;false"
echo 'Test results for' > test_results

# It does nothing when disk usage is below 90%

# Before (arrange)
alias df="echo 'Filesystem Size Used Avail Use% Mounted on';echo '/dev/sda2 100G 89.0G 11.0G 89% /'"
echo 'no mail' > mailsent

# Run code under test (act)
. ./

# Check result (assert)
if [[ $(< mailsent) == 'mail' ]]; then 
echo "$tcnt. FAIL: Expected no mail to be sent for disk usage under 90%" >> test_results
  echo "$tcnt. PASS: No action taken for disk usage under 90%" >> test_results

# It sends an email notification when disk usage is at 90%

alias df="echo 'Filesystem Size Used Avail Use% Mounted on';echo '/dev/sda1 100G 90.0G 10.0G 90% /'"
echo 'no mail' > mailsent

. ./

if [[ $(< mailsent) == 'mail' ]]; then
  echo "$tcnt. PASS: Notification was sent for disk usage of 90%" >> test_results
  echo "$tcnt. FAIL: Disk usage was 90% but no notification was sent" >> test_results

# After all
unalias df
unalias mail

# Display test results 
cat test_results

Here’s a walkthrough of the test script.

First, you see we’re using bash to test a plain old .sh file. That’s perfectly fine. It isn’t necessary, but it’s fine.

Next, you see a shopt command. That will cause the shell to expand our test aliases when the subshell is invoked to run the script. In most use cases, we wouldn’t pass aliases into subshells, but unit testing shell scripts is an exception.

The comment, “Before all,” is for people who are familiar with unit test shell script frameworks that have set up and tear down commands. These are often named something like “before” and “after,” and there’s usually one pair that brackets the entire test suite and another pair that is executed individually for each test case.

We wanted to show that defining the alias for mail, initializing the test results file, and initializing the test case counter are all done exactly one time, at the beginning of the test suite. This sort of thing is normal in executable test suites. The fact we’re testing a shell script instead of an application program doesn’t change that.

The next comment, “It does nothing…” indicates the start of our first individual test case. Most unit test shell script frameworks offer a way to provide a name for each case, so we can keep track of what’s going on and so that other tools can search, filter, and extract test cases for various reasons.

Next, there’s a comment that reads, “Before (arrange)”. This one represents set up that applies just to the one test case. We’re setting the df alias to emit the output we need for this particular case. We’re also writing the text, “no mail”, to a file. That’s how we will be able to tell whether the script attempted to send a notification email.

The act step comes next, where we exercise the code under test. In this case, that means running the script itself. We source it instead of executing it directly .

Now we do the assert step, which we’re doing the hard way in this example because we haven’t introduced a test framework yet. We increment the test counter so that we can number the test cases in the results file. Otherwise, if we had a large number of cases it could become difficult to figure out which ones had failed. Test frameworks handle this for us.

The alias we defined for the mail command write the text ‘mail’ to the mailsent file. If calls mail, then the mailsent file will contain ‘mail’ instead of the initial value, ‘no mail’. You can see what the pass and fail conditions are by reading the strings echoed into the test results file.

Starting with the comment, “It sends an email notification…” we repeat the arrange, act, assert steps for another test case. We’ll have our fake df command emit different data this time, to drive different behavior from the code under test.

Where the “After all” comment appears, we’re cleaning up after ourselves by eliminating the definitions we created in the “Before all” setup near the top of the test script.

Finally, we dump out the contents of the test_results file so we can see what we got. It looks like this:

Test results for
1. PASS: No action taken for disk usage under 90%
2. PASS: Notification was sent for disk usage of 90%

Why Use a Test Framework/Library for Unit Test Shell Scripts?

We just wrote a couple of unit test shell script cases for a shell script without using a test framework, mocking library, or assertion library. We found that system commands can be mocked by defining aliases (at least on *nix systems), that assertions can be implemented as conditional statements, and the basic structure of a unit test shell scripts is easy to set up by hand.

It wasn’t difficult to do this without a framework or library. So, what’s the benefit?

Test frameworks and libraries simplify and standardize the test code and enable much more readable test suites than hand-crafted scripts containing a lot of conditional statements. Some libraries contain useful additional features, such as the ability to trap exceptions or the ability to write table-driven and data-driven test cases. Some are tailored to support specific products of interest to infrastructure engineers, such as Chef and Puppet. And some include functionality to track code coverage and/or to format test results in a form consumable by tooling in the CI/CD pipeline, or at least a Web browser.

Unit Test Shell Scripts Framework

In this series we’ll be exploring several unit test shell script frameworks for shell scripts and scripting languages. Here’s an overview:

  • shunit2 is a very solid Open Source project with a ten-year history. Originally developed by Kate Ward, a Site Reliability Engineer and Manager at Google based in Zürich, it’s actively developed and supported by a team of six people. From humble beginnings as a point solution to test a logging library for shell scripts, it has been intentionally developed into a general-purpose unit test shell script framework that supports multiple shell languages and operating systems. It includes a number of useful features beyond simple assertions, including support for data-driven and table-driven tests. It uses the traditional “assertThat” style of assertions. The project site contains excellent documentation. For general-purpose unit testing of shell scripts, this is my top recommendation.
  • BATS (Bash Automated Testing System) is a unit test shell script framework for bash. It was created by Sam Stephenson about seven years ago, and has had a dozen or so contributors. The last update was four years ago, but this is nothing to worry about, as this sort of tool doesn’t require frequent updates or maintenance. BATS is based on the Test Anything Protocol (TAP), which defines a consistent text-based interface between modules in any sort of test harness. It allows for clean, consistent syntax in test cases, although it doesn’t seem to add much syntactic sugar beyond straight bash statements. For instance, there is no special syntax for assertions; you write bash [ ] commands to test results. With that in mind, its main value may lie in organizing test suites and cases in a logical way. Note, as well, that writing test scripts in bash doesn’t prevent us testing non-bash scripts; we did that earlier in this post. The fact BATS syntax is so close to plain bash syntax gives us a lot of flexibility to handle different shell languages in our test suites, at the possible cost of readability (depending on what you find “readable;” the intended audience for this post probably finds plain shell language syntax pretty readable). One particularly interesting feature (in my opinion) is that you can set up your text editor with syntax highlighting for BATS, as documented on the project wiki. Emacs, Sublime Text 2, TextMate, Vim, and Atom were supported as of the date of this post.
  • zunit (not the IBM one, the other one) is a unit test shell script framework for zsh developed by James Dinsdale. The project site states zunit was inspired by BATS, and it includes the highly useful variables $state, $output, and $lines. But it also has a definitive assertion syntax that follows the pattern, “assert actual matches expected”. Each of these frameworks has some unique features. An interesting feature of ZUnit, in my opinion, is that it will flag any test cases that don’t contain an assertion as “risky.” You can override this and force the cases to run, but by default the framework helps you remember to include an assertion in each test case.
  • bash-spec is a behavioral-style test framework that supports bash only (or at least, it’s only been tested against bash scripts). It’s a humble side project of mine that has been around over four years and has a few “real” users. It isn’t updated much, as it currently does what it was intended to do. One objective of the project was to make use of bash functions in a “fluid” style. Functions are called in sequence, each passing the entire argument list to the next after consuming however many arguments it needs to perform its task. The result is a readable test suite, with statements such as “expect package-name to_be_installed” and “expect arrayname not to_contain value“. When used to guide test-first development of scripts, its design tends to lead the developer to write functions that support the idea of “modularity” or “single responsibility” or “separation of concerns” (call it what you will), resulting in ease of maintenance and readily-reusable functions. “Behavioral style” means that assertions take the form, “expect this to match that.”
  • korn-spec is a port of bash-spec for the korn shell.
  • Pester is the unit test shell script framework of choice for Powershell. Powershell looks and feels more like an application programming language than purely a scripting language, and Pester offers a fully consistent developer experience. Pester ships with Windows 10 and can be installed on any other system that supports Powershell. It has a robust assertion library, built-in support for mocking, and collects code coverage metrics.
  • ChefSpec builds on rspec to provide a behavioral-style test framework for Chef recipes. Chef is a Ruby application, and ChefSpec takes full advantage of rspec capabilities plus built-in support for Chef-specific functionality.
  • rspec-puppet is a behavioral-style framework for Puppet, functionally similar to ChefSpec.


What’s Next?

In the next installment, we’ll take a closer look at shunit2, BATS, and zunit. We’ll try them out to test scripts that perform common system administrator tasks and server provisioning/configuration tasks on an Ubuntu Linux instance.

Next Unit Testing Shell Scripts:
Part Two

Comments (4)

  1. Damian Rivas

    There is actually a fork of Bats that is maintained to this day called Bats-core:

    I linked to the “Background” section of their README which explains why the fork was created. It’s a good place to start, but is pushed to the end of the README unfortunately.

    I’m currently using Bats-core in my own projects and love it!

  2. Milan

    Great article. Very well put together.

    I think unit testing source code is getting more and more traction but we are far from where we should be as an industry. I think unit testing scripts is even less common so article like this one definitely help in this regard. Thank you.


Leave a comment

Your email address will not be published. Required fields are marked *