An Interview with Watts Humphrey, Part 25: SEI Strategy and the Trouble with Trivial Errors
This interview was provided courtesy of the Computer History Museum.
Nico Haberman and the SEI Strategy
Humphrey: Fairly early on, you
remember now, there were a bunch of techies there at the SEI. I mean good
people, but they just weren’t on my wavelength. And Nico
Haberman had been the computer science chair or dean,
and I think was the guy who had actually conceived the whole idea for CMU
proposing that the SEI be there. He was a marvelous guy, a Dutchman. Highly revered. Unfortunately, he died prematurely. He had a
heart attack running on the beach, I think, in 1994. I remember because we were
In any event, Larry Druffel called a strategy meeting shortly after he became director. It was a little bit after, because we had gotten far enough along to put together the ideas for a maturity model. I mentioned that I had sort of dreamed up this five levels and worked off that “quality is free” idea and that sort of thing. We put together the five levels with these questions with the people from MITRE. So we arrived at this strategy meeting with all of the techie people, and Nico was invited. During the meeting, we broke into about three subgroups -- working groups -- to put together what we thought the SEI strategy ought to be and we were going to have these three presentations at the end. Well, by great good fortune I happened to be on Nico’s group. And we were going through a discussion, and they came to me and I got up and I described this five level model and how you use it for improving software development work and that sort of thing. And the guys there were panning it, right? Well, what’s that for? Why do I want to do that? And Nico interrupted and said, “This is why we formed the SEI.” He said that’s it, “That’s marvelous.” And I’ll tell you without his backing I don't think I would have been able to get it done at the SEI. I mean, he could see it. It was exactly what he was after. So I thought that this man had vision and he really did, he had marvelous vision, a wonderful guy and I was really sad that we lost him so early. But in any event, so he played a very powerful hand in forming the SEI and in putting us on the path that we’ve been on ever since. Because without that we would never have had a process program that went anywhere at all. Okay.
Booch: Very well.
Humphrey: Well, now, let’s
shift gears, because I think I’ve pretty much caught all the way up to where I
was made a fellow and Barbara had decided that she wanted to live in
And so Barbara rather wisely suggested, why don’t you put together a monthly report? And so I did. So I started writing a Fellow report, and I wrote one every month, and I've done it ever since. And I've got them. That's part of what I'll send to the museum, my monthly reports. So I wrote about the meetings I had and the trips I made and what I'd done. And I was really quite surprised when I started looking at the end of the month that I had a whole bunch of items I'd actually accomplished. I'd been off and done this and I'd been asked to come give the keynote speech at some place and I had some conference calls with people. And so I was actually accomplishing a lot more than I realized. And so that all of a sudden was a great feeling of more self-worth out of it. It's very helpful to do that. And that's one of the reasons I kept it up. I was sending copies to a distribution list at SEI, but since my cancer, I basically stopped. My activities are so limited and I'm writing it but not widely distributing it anymore.
Booch: Well, let's consider this interview here one of your extended reports.
Humphrey: Right, okay. So I was writing the PSP programs and a whole bunch of things came up as I was doing it. One of the questions was “exactly what do you want to make?” And I decided I wanted to get really basic measures. People get all confused about what measures are, and I wanted things that were auditable. And think about it this way: think about a measurement system that's scalable. Can you think of one that really is scalable from the smallest to the biggest business or operation?
Booch: Well, source lines of code comes to mind.
Humphrey: Well, yes, it is an auditable measurement system, and source lines of code is a measure of size. But a widely-used measure that is scalable from the smallest to the biggest business is cost accounting. I took cost accounting in my MBA which was very helpful, but they use it and it scales up to the largest corporation. Everybody uses that, and the reason is they're working from auditable real data. And you can count on the data. You can actually build it and define it. It's firm, it's clear. They say that the data don't lie. I mean liars can figure but the figures don't lie. You know what I mean? And so that's what I tried to put together: an auditable foundation of data. And what I've discovered was that to do that I had to have really well defined data.
And then I realized that the pretty basic data items that I concluded I needed were size, time and defects. And that's the size of everything you build, the time you spend doing every action, whatever you're doing, every task, and the defects you find at every step. Now to do that, to define that, however, that means you've got to define the steps of your job. So you must have a defined process, you must connect the tasks that you're doing to your process, and then you connect the data and the product so the tasks have to connect to the products and to the process.
And so it's a big interconnected system. But if it's auditable and if you have real sound data under it, what we have discovered is that I can put together a report on a program. I write of 100 lines of code and I can look at it and we can do that identical report with that information in it and we can scale it up to a system of several million lines of code. The identical numbers. Now as a matter of fact, you can then look at it and you can say, well what's this mean? And you can bore back down because it's now constructed from an auditable base. And you can go find all the data for all the parts to justify it: the defects, the time spent, what happened, how big this was, the changes, all the activity-- I mean it's amazing what you can get out of this stuff.
Booch: So the first two things you mentioned are fairly quantifiable but defects seem to need to be something that there's a spectrum of -- different kinds of defects. So in your early incarnation of this, how did you characterize what a defect was and did you have a sense for different classes of defects and if that would weigh things?
Humphrey: Yes, yes. Well that was, of course, the first question: “what is a defect?” Lots of debates with that, and then you try to get an agreement with the lawyers as to what a defect is, but don't bother because you'll never get there because they are all looking for blame. But the issue I found was if you focus on it in terms of actual actions, what you have to do, the programmers have no trouble identifying defects. And I didn't either. It's something I have to fix, period. It was very straightforward. Now I had to define them, and you have got to be very careful defining defects, because people want to confuse the cause with the actual nature of the defect.
And so a defect, you know, may cause a buffer overflow, and there may be things that it does and the results and the problems they cause, but it is in fact an error in a loop. Or it is, you know, some particular error that you've got in the code. And they could be trivial errors. I remember there was one defect. Some guys were teaching the PSP at DEC, and one of the guys called me from there. They'd been to PSP training, they never went very far with it, which was too bad because as you know DEC got bought, and things got moved, and management changed, and it was very hard to maintain this improvement in the face of that dynamic.
Booch: And their artifacts became
Humphrey: But in any event, DEC had this non-stop computing system where they could put two or more processors together, and if one processor had a problem, a second one or a third one would pick up. That was a marvelous system, and it was actually guaranteed to run without failure. The problem was it was failing. And so the guys called me about it. They told me it was really quite an amazing story because they had struggled with this thing. It didn't happen very often but when it hit, all the connected machines locked up, bam, and that was it. And they had to go and shut the whole thing down and restart. And I'll tell you, that was a disaster. And so this guy called me -- I think his name was Goldman. Matter of fact, interesting guy. He worked with Howie Dow, who's the guy that I had known at SEI that taught them the PSP. And he called me, he said what had happened was that they were just about -- I'm going to back up.
This problem finally became so severe that this manager decided he had to take his two best engineers and assign them full time to find and fix this problem. And so they did. And they worked for two to three months before they finally figured out what it was. And before they fixed it, the manager said, "Oh hold now, because where that defect is, is code that they're just about to do an inspection on for a revision of the system. So let's just participate in the inspection and see if they find it." So he did. And he called me to tell me the results.
The guys had gone through the inspection, took them several hours. Remember now, the two best programmers spent three months on this, the inspection team spent a couple of hours going through the code, and then they were going through the defects, and one of them pointed out this defect. And so the manager asked them, he said, "What kind of trouble do you think that would cause?" And the guys thought it would be kind of an annoyance but they'd probably find it in test and this sort of thing. And he asked how hard they thought it would be to find. You know, it'd probably take a little while in testing. And then he told them the story.
Well, it turned out it was a trivial little error, and they could fix it in a few minutes, and they found it real quick with an inspection when they really had a team focused on it. And that's what we're telling engineers they need to do. They need to do personal reviews; they need to do team inspections. Don't count on testing because some of these trivial little problems have enormous consequences, and that's what this was. It was a very simple little problem. It was basically the trigger problem where one computer would actually end up needing feedback from another one, and the other one would end up getting in a loop where it needed feedback, and the two of them would both essentially wait for feedback from the other one. So they’d basically hang. And it was a simple defect in terms of making sure the hang wouldn't occur. And it was a trivial one but as you know, those trivial errors can cause enormous trouble.
So the issue here is to separate out what the defect is. In this case it could be a bit that's set wrong or it could be an overlooked loop closure or it could be an off-by-one error. It could be, you know, you name it. The consequences are another thing and the causes are another thing. So the defect classifications that we have for the PSP simply relate to what the defect itself is: what you've got to fix. And if people want to go and analyze causes and all that, the defect data are very helpful and can be used. But when the engineers are just looking at what it is and what's got to be changed, they have no trouble identifying fixes, and we've got data now on something over 30,000 programs written for the PSP. And every PSP program the guys record every defect. And they don't have any trouble. No one argues about it; you just say here's all you do and everybody does it.
And so while people may say that you can't actually count the defects, we have no trouble. And I have thousands of engineers’ defect data. Now we're doing it with no trouble, no one debates it, no one argues about it, so it's a lot of smoke. And the reason I say it's smoke is because I’m talking to people who haven't actually tried to do it themselves. So if you sat down and were to track all the defects that you find in your programs, when you review them, when you compile them, when you test them, when you build them, whatever you do, track every one, look at it, figure out what it takes to fix it, track it. No big deal. So, that answer your question?
Booch: Absolutely. It does very much, thank you.
Humphrey: Okay. And that's why I wanted to sit down and write the PSP programs because I was getting all this smoke about stuff you couldn't do and why, and this and that, and so I decided to just try to do it. Here's what I thought we ought to do, so I was trying to act like a CMM level 5 organization of one person. I was trying to do everything I thought was needed and put the whole thing together and I did. And it was extraordinary. So that's what the PSP was all about.
Booch: And while you were doing this, who were you primarily collaborating with? Or was this springing largely from just your mind and work?
Humphrey: It was me. It just
did it myself. I was working on statistics because I
discovered a lot on my analysis work. I needed simple statistics and so I
worked through a statistics book, a graduate text on statistics. I remember it
was kind of funny. Barbara and I went on a trip to
And I took along my statistics textbook and spent a fair amount of the trip just going through that. Although we did see Capadocea and lots of other wonderful places. But so I remember going through that and learning statistics. I had some people in the statistics department at CMU that were very helpful -- John Lahosky and others. I had the people at the SEI help me with programming problems. I hadn't programmed in years. I didn't know the modern systems and so I'd call folks -- Jim Over and others -- and say, "Well, how do you do this?" And so I had all kinds of people help.