InformIT

Four Classic Problems with Scripted Testing

Date: Jun 2, 2006

Return to the article

Mike Kelly examines a recent testing experience that should have worked: plenty of scripted test cases, plenty of time developing and testing the scripts. So what went wrong? Plenty.

On a recent project, we were instructed by the test manager to create hundreds of scripted test cases. Our intention was to execute these tests on the first good build we received from the development staff. We spent months developing the test scripts and planning for their execution. We created the test cases, entered them in our test management tool, traced them to requirements, and reviewed them with the business and other testers.

In spite of months of planning and hard work, I noticed some problems on our first day of test execution.

Problem 1: Almost None of the Test Scripts Were Correct

The expected results didn’t match the actual requirements once the day of testing came around. Given the number of test cases and the number of changing requirements, it was impossible to keep them up to date. I won’t say all that time was wasted, because it did get us looking at the requirements, and it did give us something close to resembling an expected result. But it didn’t do what it was intended to do: It didn’t accurately capture the steps needed to execute a test, or serve as an effective oracle for expected results. That is, it didn’t reduce the need for brain-engaged testing.

Problem 2: Inattentional Blindness

When executing the tests, the testers didn’t look for any errors other than the specific errors identified in the test case. For example, I randomly selected a passed test case for review. I executed the test case myself and scanned the actual results looking for the expected value (the value defined in the test case) and I found it. I then re-scanned the actual results looking for other information about the product. I found some funny stuff.

All said and done, we logged six defects on this "passed" test case. The only way I can explain this result is that the tester wasn’t brain-engaged. Because we had a script, all the tester did was follow the script and turned off his or her powers of critical thinking and problem-solving. This was one example of many similar cases we found.

Computerworld has a great article in which Kathleen Melymuka interviewed Max H Bazerman, a professor of business administration at Harvard Business School. In the article, Bazerman talks about how "bounded awareness" can cause you to ignore critical information when making decisions. The example used by Bazerman is the same example that Cem Kaner and James Bach use in their Black Box Software Testing course on test procedures and scripts. In the course video (the same video to which Bazerman refers), Kaner points out some of the implications of inattentional blindness for our reliance on scripts:

"The idea (the fantasy) is that the script specifies the test so precisely that the person following it can be a virtual robot, doing what a computer would do if the test were automated."

Inattentional blindness teaches us that unless we pay close attention, we can miss even the most conspicuous events that occur while we’re executing our well-planned tests. This means that perhaps our tests are not as powerful as we think. Joel Spolsky provides a wonderful example of this principle:

"All the testing we did, meticulously pulling down every menu and seeing if it worked right, didn’t uncover the showstoppers that made it impossible to do what the product was intended to allow. Trying to use the product, as a customer would, found these showstoppers in a minute.

And not just those. As I worked, not even exercising the features, just quietly trying to build a simple site, I found 45 bugs on one Sunday afternoon. And I am a lazy man, I couldn’t have spent more than 2 hours on this. I didn’t even try anything but the most basic functionality of the product."

Joel’s example also points to the relative power of a test (the ability of a test to find defects). I don’t attribute all of the bugs missed to inattentional blindness, but assuming that Fog Creek has smart people doing its testing, it’s my guess that inattentional blindness played more than a small part.

Problem 3: Inaccurate Perceptions

The final problem I noticed was a perception problem. Since we had scripted test cases, progress was measured by the number of test cases executed. I don’t want to suggest that this type of information is not valuable. It is. But I don’t think it’s the only information that’s valuable—and that’s how it was being used.

I think the fact that the number of scripts executed was the driving metric added to the urgency many testers felt in passing test cases (see problem 1 above) and moving on to the next one as quickly as possible. It builds the mentality, "We can’t spend time looking for bugs if we’re measured by how many test cases we actually execute."

In his article "Inside the Software Testing Quagmire," Paul Garbaczeski illustrates the perception problem beautifully. Paul asks an important question and follows it with analysis that I couldn’t disagree with more:

Are test cases comprehensive and repeatable; are they executed in a controlled environment?

You’re really asking: Is testing ad hoc or disciplined?

You’re trying to determine: If testing is effective.

Interpreting the response: There should be a set of repeatable test cases and a controlled test environment where the state of the software being tested and the test data are always known. Absent these, it will be difficult to discern true software defects from false alarms caused by flawed test practices.

A related symptom to check: If temporary testers are conscripted from other parts of the organization to "hammer" the software without using formal test cases, it means the organization is reacting to poor testing by adding resources to collapse the test time, rather than addressing the problem’s root causes.

This example illustrates many of the perception problems surrounding scripted testing. Many people believe that the scripts developed for an application can actually capture all the aspects of the application worth testing. But many aspects of an application traditionally aren’t captured with test scripts. Typically we only capture functionality specified by some requirements specification. However, there are many other aspects of the application that we might want to test. Just check out the Product Elements list in the Satisfice Heuristic Test Strategy Model. How many of those elements do you create scripts for? If not all of them, are your scripts complete? Kaner and Bach talk more about the measurement problem and the impossibility of complete testing in their Black Box Software Testing course.

Paul Garbaczeski’s question asks for repeatability. Many people believe that if two people follow the same script, they’ll achieve the same result. However, different people following the same script sometimes get different results. As illustrated in problem 2 above, I found six defects with a test script in which one of the other testers (a very good tester, I might add) found none. James Bach has some great posts on repeatability that I use to guide my decision of when to repeat a test. Both are worth checking out:

The question also asks whether testing is ad hoc or disciplined. Many people believe that a tester who doesn’t document all test cases (completely?) is not a "disciplined" tester. I would offer another explanation. When I elect not to document a test case, it’s because I’m fighting bad test documentation, not because I’m undisciplined. I’m disciplined with my testing, regardless of whether or not I document my test cases. For me, disciplined testing is not a matter of recording steps and results—that’s recordkeeping. For me, discipline in testing comes by remaining focused on delivering value to your project stakeholders. An ad hoc test would be a test without a clear understanding of what value it delivers to the project team.

Garbaczeski’s conclusion that testers need formal test cases in order to be effective misses the point of testing entirely. The conclusion suggests that the person executing a script knows what was in the mind of the test designer when the test case was written. I’ve found that one of the most effective things you can do for a stale test group is to bring in a tester from another part of the organization, let her test in her own way (not using your scripts), see what she finds, and figure out why you didn’t find it. I would argue that temporary resources (testers, developers, or others) conscripted from other parts of the organization to collapse time is a risk regardless of what documents are available to those folks when they get to the project. That approach can work, but success most likely won’t depend on whether we documented all our test cases up front.

Problem 4: Scripted Test Cases Gave the Illusion of Progress

A final noteworthy aspect to scripted testing: I think many project managers equate scripted test cases to progress milestones. On some of my past projects, managers have had the impression that the more scripted test cases we have, the more testing we’re doing. They believe that the more pages we fill with steps, expected results, and test data, the more effective our testing will be. It also adds an element of predictability to a project plan: "If we have 200 test cases, then we know we’re halfway through our testing when we reach 100, right?"

I’m not saying that there’s no correlation between the number of test cases you have and the time it takes you to execute your testing. I’m also not saying that measuring your test case execution progress isn’t important—it is. But it’s not as important as making sure that you’re testing the right things (coverage). It’s not more important than making sure that you’re testing for the right types of errors or for the right information (risk). Each test case executed should have the potential to reveal new information that could potentially shorten or lengthen the test project. Measuring test progress by counting pieces of paper doesn’t reflect that aspect of test management. This is probably why some managers prefer scripted test cases. They have the potential to provide you with the illusion that you’re testing—even if you aren’t.

Additional Resources

For more information on the tradeoffs involved with scripted testing, I recommend the following resources (in addition to all the links included above):

800 East 96th Street, Indianapolis, Indiana 46240