An Interview with Watts Humphrey, Part 26: Catastrophic Software Failures and the Limits of Testing
This interview was provided courtesy of the Computer History Museum.
Initial PSP Use
Humphrey: So a lot of people
were very helpful with my programming problems. But basically coming up with
the PSP and the PSP idea and how to do it, by and large pretty much just about
everybody I was working with at SEI and elsewhere thought I was kind of a nut. I
mean “what in the world is the point of doing that?” was sort of the reaction
that I was getting. And remember I’d gone through and written the programs, and
then I gave a talk down in
There was a group --
a Siemens Research lab in
And so I went home all encouraged that I'll get these guys to actually use it. I'd given them the process, I'd described it to them. And so I kept calling like every week: where do you stand, how are you doing? Oh we're doing a prototype here, we don't have time to do it yet. And they never did. They never got to it. They were in too big a hurry to do other stuff. What was interesting was as part of that, I'd actually taken time to sit down with each of the engineers on the team. There were five engineers, and they had a process guy there. He was sort of a technology guy but he was sort of acting as the process guy I guess. But in any event, he and I would sit down and interview each of the engineers individually on the team. And what I did was I wanted to find out what process are you using personally. And they didn't know how to answer that question, so I'd kind of lead them through it.
I said, okay, when you get an assignment to write a module of code, say 2,000 lines of code, what do you get? So they'd tell me, and then say here's how I'd start. I'd say well, what do you do next? So they'd start to describe that. So they would describe what they did, and I'd lead them step by step, and I wrote down on the board what they were saying. And every so often they'd say, here's what they did next. And I'd say, "Well don't you have to do this too?" And so they, "Oh yeah, well I forgot that." And so we built the process for each of these five people. I don't know if it's in my notebooks or not. But I wrote it on the board. I'd write it where they could see it. But in any event, they were all -- the five of them on one team -- totally different. It was amazing. I mean, one young guy came right out of school, seemed like a real sharp guy, but he basically started coding in the middle and he didn't do any design work at all. He just sort of had this idea of what it was and how it was going to work, and so he started coding in the middle of this big thing, and he'd sort of build it.
And at the other
extreme, there was an engineer -- I think he was actually from
And so the individual
process is a critical piece of this, and that's why I went all the way to the
PSP. And so, as I say, I never could find anybody who would use it and so I was
really very frustrated. I think I may have mentioned I was at a conference in
Booch: And the time you were there that's when the wall had just gone down so the dynamic was amazing.
Humphrey: Oh yes, the wall had come down, I walked over there and around it, was looking at the whole area.
Booch: And you saw the piles of rubble
Humphrey: Oh exactly, exactly. It was a thrilling experience. But in any event I was looking at that and it's amazing, by watching how symphonies work what kind of feeling you get from the dynamics of doing beautiful work. And that's the software issue. We really need symphonic teams. The hacker business is so sad because, just like in a symphony, any individual instrument can destroy the whole effect, and that's exactly true of software. Any individual piece of code can destroy the whole thing. And that's the problem. I'm trying to remember the name of the medical instrument, remember that killed a bunch of people, the Therac-25 machine. And that was a trivial error in an error recovery program. I mean it wasn't something that normally gets used. It was off in the side somewhere. I remember an interesting sideline -- earlier at IBM on OS 360 we had performance problems.
Booch: The Therac-25, that was the machine.
Humphrey: That's what it was, the Therac-25. And that was an error recovery program that had a defect in it, and it missed getting a reset and I believe killed half a dozen people. And so we get those problems. And exactly the same thing with the V-22 Osprey. Do you know the story about the V-22?
Booch: I do but the people listening in might not so why don't you relay.
Humphrey: Well, on the V-22 Osprey, I actually I went out and talked to the executives of the company that built that system, and I was talking about the quality of their software. They bridled, they said we have very high quality software. I said, the fact that it's killed 13 marines is a good measure of quality. They didn't buy that. But in any event, the V-22 Osprey is this tilt-wing aircraft that you can fly as a regular airplane and then as you're coming in to land you tilt the wings up and so the propellers are pointed up instead of forward and it lands as a helicopter. I mean it's an enormous technical achievement to build that thing.
And, of course, one of the issues that they had was what happens when the hydraulics fail while you're tilting the wing? You've got a whole hydraulic system that does that, and so they put in a whole electronic backup system in case you have a hydraulic failure -- an electronic backup system that will fly the airplane electronically. It turned out in this particular case, they were in a test flight with a bunch of marines in the aircraft, they were coming in to land, and as they were tilting the wings to bring it in, the hydraulics failed.
And so the backup system took over, the software that controlled the electronics system took over, and the software had an error in it that essentially crossed the controls. And of course a pilot can figure that out if he's got a little time, but when you're coming in to land that's a little hard, and they crashed and killed everybody. And the point is, I run into this in all kinds of things, the number of possibilities that have to be tested, and this is what the executive was telling me, "Oh, we tested it exhaustively." And I agreed they tested exhaustively but exhaustive testing won't find all the defects. People don't know that. They don't understand that. And let me branch a little piece on here. When you think about a big program, big complex system program, 2 million lines of code something like that, and you run exhaustive tests, what percentage of all the possibilities do you think you've tested? Any idea?
Booch: Oh it's going to be an embarrassingly small number probably in the less than 20, 30% would be my guess. What would you say?
Humphrey: You're way off. Way off. I typically ask people and I get back numbers 50%, 30%, that kind of thing. I asked the people at Microsoft, the Windows people, what they thought. And then we chatted about it a bit and they said about 1%.
Booch: Oh my goodness.
Humphrey: And my reaction is they're high by several orders of magnitude. And let me explain the reason why: the conditions that actually affect testing. I mean, testing will only test a specific set of conditions and the conditions that will affect testing include, for instance, how many job streams are running, what the configuration is for the system at that time, all kinds of things. And also what the operator's doing, what the hardware conditions are. So you can have a hardware failure, you could have data errors, you can have operator errors, you can have just an enormous range of things. And you make a list of all the variations, and then by the way you need different data values too. And so you got different data values. So if you go through and actually see what are the conditions -- I did this on a program with 57 lines of code that I'd written for the PSP. I went through and analyzed exactly how many test cases I'd have to run to exhaustively test it. I didn't worry about different data values yet. I assumed I would classify those and I could come up with that and I never did go back and do that. But it was like 268 test cases for 57 lines of code. I mean, it's extraordinary. And that's true. So people can talk about automated testing and that sort of thing, but the number of possibilities is so extraordinary you literally couldn't do a comprehensive test in the lifetime of the universe today.
Booch: So in effect there's a combinatorial explosion due to the number of possible states.
Humphrey: Exactly. And you look at all the number of possible ways things can connect, I mean it's extraordinary. And so when people have this enormous faith in testing it's vastly misplaced. And so the quality problem is really severe. And that's the issue that I was getting at. My sense was if you didn't deal with quality exhaustively at the very beginning, at the smallest unique level of the program, you will never solve the problem anywhere else. And that you can do. And so I found, and this is what I was after finding out with the PSP prior to my quality study, could I produce defect-free programming? And my contention that a program was defect-free, if I wrote the program, I had a design, I went through it, I produced a comprehensive test and if I wrote the program and I compiled it without error and I ran all the tests without error, then I figured I probably had a pretty good program. Now that was without error the first time I ran the tests. So I'm now treating testing as not a way to find defects, it's a way to verify the quality of the work I've done.