An Interview with Watts Humphrey, Part 33: The Boeing B2 System and the "Last Liar Problem"
This interview was provided courtesy of the Computer History Museum.
The Boeing B2 System
Humphrey: Let me move now to large systems. I mentioned one of the problems we've had is that software has been hard to manage since the very beginning, and I ran into that very early on. I'd go out and talk to software teams. They didn't know where they were, they couldn't tell me where they were. I couldn't tell by looking at them what was going on. It was just sort of this fog of “What's happening?” and you sort of hope something will come out some day. And that's basically what we were going through. And I was able at IBM to deal with that when we put in place some structured steps and a process, so I had some idea where they were. But it wasn't at the level we had to reach with the PSP and TSP, to really understand status.
The problem you have on large programs, and it’s the point that Fred Brooks made -- it was a wonderful point, it’s in his “Mythical Man-Month” book -- and if anybody hasn't read it, they need to read it. That's a marvelous book. But he says schedules slip a day at a time, and that's absolutely true. And so the question is, when the schedule is slipping, how long does it take to know? If you're really slipping the schedule, when can you take action to recover? Today, the software community basically knows we're in real trouble when they're not into test when they should be. That's usually a year or more after they start. They're way down the road. It's unrecoverable.
The issue is, how can you detect a one day slip that same day? What's interesting is that, with the TSP, we can do it. So that's what we do. That's why they meet schedules, that's why they can come in on time. We had one team, this was a Boeing team years ago. There were disasters with them. Half of these projects have disasters, because of management changes or something else -- all kinds of stuff that have nothing to do with what we're doing with the process. So we run into this and it's one of my frustrations, the dynamics of the management system. Maintaining continuity in an organization in any kind of improvement effort is not totally impossible, but almost, unless you really are dealing with the top of the organization. And we were not doing so at Boeing.
This story started in about September one year. Boeing had bid to do a major update for the weapons delivery system for the B2 bomber. They were under a subcontract from Northrop, and the Air Force had told them that they wanted a lot of features but that only a few were truly critical. Those critical features had to be completed by December in the following year for a B2 flight test. So Boeing looked over the feature list and gave the Air Force a bid to do them all.
I heard that people had decided to use the PSP on the project, so I called them and told them they should use the TSP because the PSP by itself wasn’t likely to be successful. They agreed and asked me to be the team coach for the first launch. They also trained two of their people to be local coaches.
Well, when I got there and we did the launch, we did not realize that having people watch the launch could be a real problem, so I agreed that they could have people sit in. The team had 18 engineers and we got through the first day-and-a-half of the launch and had made the first preliminary estimates for the development effort. During the launch, some of the observers went to see the program manager and told him that the development estimates were much bigger than their estimates, so the schedule would not meet their December deadline. The program manager then killed the launch during the lunch break. Unfortunately, we did not have enough data then to prove that the test times would be much less than the original plan so they could still meet the schedule, so the team did not officially follow the TSP.
What then happened surprised me, because the Boeing managers didn’t know about the importance of the team having a plan to follow, so they just went along following the plan that had been developed during the proposal. The team, however, had already produced enough of a plan to guide their work, and since they had no other detailed plan for the work, they just followed the partial plan that came out of the partial launch we had done. They also had two coaches who helped them use the PSP.
In any event, this team got going, and they could see -- they had a schedule -- they had to be done in 15 months, and it was a hard stop. They had to be done. After three weeks with the TSP, they could see they were in trouble. They were able to lay it out -- they were going to be about a month, month-and-a-half late. And so the team actually sat down. Remember I talked about task earlier? The team sat down and figured out, “What can we do?” They concluded they had to go on overtime for a couple of weeks. They decided not to do it right away, because it's going into Thanksgiving and Christmas, but right after Christmas, they went on overtime for several weeks till they got back on schedule.
And they each established goals for personal task time and they tracked them. And they got their hours up and they met the schedule. They actually came in a little bit early. If they hadn't done that, they would have had no way of knowing, and that's the case on all these projects. If you're going to maintain schedules on these projects, you've got to manage every day. That means you've got to know where you are every day, and that takes precise schedules, data, and tracking, all kinds of stuff. And that can only be done by the developers themselves. That's why knowledge workers have to manage themselves.
The manager's job is to give them support, coach them and protect them and all that sort of stuff, but not to actually manage what they do. That's a key point, so think of it this way, on a big program. Take a program that's got thousands of people or hundreds. As I say, I had the OS 360 with thousands of people, so I have had a lot of experience with great big programs. I'll tell you, I would have given my eye teeth for a process like the TSP. It just would have been a gift, because we really would have had the ability to manage what we were doing.
Booch: Watts, measure for me what a really, really large program is, versus a small one. Bjarne Stroustrup once pointed out to me that if I can't say what a thing is not, I haven't really identified what the thing is. So I'd like to understand your metrics for when you reach that threshold, that it becomes that size.
Humphrey: Before I do that, let me finish the Boeing story, OK?
Booch: Ok, then we’ll come back to my question.
Humphrey: Well, the Boeing team ended up producing really high quality designs and code. The overall code ended up being much bigger than planned – as I recall over twice as big – and the effort was much more than expected because they were making a fundamental change in the system. The requirement was to take a hard coded weapon-attachment design and convert it to a table-driven design. That is a bit like replacing all the wiring and plumbing in a house without affecting the wallpaper. It’s real tough.
So the team kept following the TSP, and they were about five months late finishing the design. Everybody was worried. Because the design was such high quality, however, coding was pretty fast and they got into test only about two months late. Then the testing was a snap – they actually were ready for the December Air Force tests a month ahead of schedule.
You would think that this would have convinced management to keep using the TSP, but it didn’t. The problem was that the contract called for a design review in December that first year and the development team didn’t have the design review in their plan so they missed the date and Boeing forfeited an award fee as a result. In any event, because they missed this fee, management was severely criticized by senior management and the program was always viewed as a failure, even though it was delivered on time.
Several things happened that we could have avoided had we known what we know now. Actually, however, we learned a lot of these lessons at Boeing. The first was that, if management had trusted the team and explained the critical need for a design review in December, the team could have held one. They knew a lot about the design in December but, with the TSP, they were now spending a lot more time in design and a lot less in coding. So they could have given a high level design review in December even though the detailed designs weren’t done.
Some of our early PSP data shows what I mean. I mean, during PSP training, the developers write a bunch of programs. In the first courses it was 10 programs and the data on program 1 is with their old process, the way they had always worked without the PSP. On program 1, the data we have on several hundred programmers shows that they spent less than 20% of their time in design at the beginning and over 40% at the end of the course. Design was taking twice as long. While programmers don’t now generally use compilers, they use code generators which also try to generate code from whatever defective stuff the programmers submit, which is essentially the same problem we were running into with compilers.
In any event, what is surprising about this is that even though the total design and code time percentages went up, the total time actually went down. That is because the time spent in compile and test dropped from about 45% to only 25%. The compile time at the beginning was more than the design time and at the end it was much less. While these were all important, the key was that program quality was so much better that system test time dropped from a planned six months to about one month, so even though they got into their internal testing late they were able to deliver to the Air Force for flight testing ahead of schedule. In a later presentation on this project, a Boeing executive said that this time cut testing time by 94%.
I also went out to visit the team in about early March and looked at some of their early data. Their plan showed that the program would have about 7,500 new and changed lines of code, and they had estimates for how much this would be in each of the system’s existing modules. When I got there, they had only completed coding for a few dozen modules but, with the TSP, they had the data for the new and changed code for each module. I then assumed that the ratio for these few modules would be about the same for the rest of the program and estimated that total program size would be about 18,000 lines of code. My estimate later turned out to be pretty accurate. The TSP data is really valuable for this kind of thing, and people don’t seem to see how useful it can be in measuring job status or estimating job completion.
We also learned that we had to bar observers from all but the opening and closing meetings, and we had to make sure that the managers better understood what this was all about. This experience led to a one-day course for executives. Actually, executives don’t go to courses so we call it a seminar. We also have a three-day manager’s course.
OK, that’s the Boeing story, now, back to your question about measurement, OK?
Booch: OK. I was asking about your metrics for what is a really large program.
The Large-System Problem
Humphrey: I'm talking about a program, typically it's three or four hundred or more people, typically involves multiple organizations, not necessarily different companies, but certainly different laboratories, different teams, different groups. Typically, they're remote and they're building a fairly big product, usually in multiple releases, not always. But that's what I'm talking about. As I say, it's typically a fairly large product. I won't put it in terms of software lines of code, because many of these are not software systems. They have software in them, but they're other systems. You work on these big nuclear power plant things. We've had some involvement in those. I think we have a team that's actually using the TSP to design nuclear reactor power plants. But there aren't any software people on the team at all. We have requirements people doing it too.
So you can have a big team, which is just a large collection of groups that all have to interact and they all have to synchronize their work. That's the key. And so the issue now -- I ran into it at IBM, and it was one of the most serious problems we had, and we had to keep the managers right on the ball -- because we ran into what we call the “last liar problem.” Fundamentally, if everybody's in trouble on a project, no one wants to admit it, because no one really knows what's going on. They're all sort of in trouble, they can see intuitively, “We're late,” and they wait for somebody to have problems that are so visible, they can no longer be concealed. And then everybody else can relax, because the first ones to admit to problems are the ones that did it. But everybody's in trouble.
And so the real issue with these great big systems is that no one really knows precisely where they are. No one really has a way of tracking a day at a time where they are. The interactions and the connections among the groups are very hard to manage, because people really can't predict exactly when they'll be done. And so all of these great big systems are enormous interconnected things. You have this big network of commitments and that sort of thing. In these great big systems, there are several things that are serious problems, and one is, no one really knows where any of the pieces stand. That means that everybody, when they're talking about their status, is defensive. They're sort of guessing.
So the team leader goes to the individual engineer/developer and says, “Where are you?” He says, “I'm almost done, Chief.” Well, the team leader knows he isn't almost done and he tries to poke at it. So they have kind of a guess at when you're going to get into test, but no one knows. So the team leader’s kind of uneasy about it. He doesn't have this feeling of confidence, and he gets that from all his team. And when he talks to his management, he then gives kind of a fluffy answer and no one is convinced he's right. He knows he's sort of soft and he hasn't met schedules before, so why would they believe him this time? And as you begin to build that up, a layer at a time, you get all the way up in the organization and everybody is being defensive. No one is admitting exactly where they are. No one is sitting down and saying, “Here are the facts. Let's fix it. Let's get in and resolve the problems.”
And so these great big systems, you get this kind of defensive structure, all the way up. From the management of these big companies, it goes to the Department of Defense. And from the Department of Defense it goes to Congress. So no one knows what they're talking about at every level and the reason is because they don't start with a solid foundation at the very beginning. The engineers, the individual developers, don't know exactly where they are and if they don't, no one above them can, and the entire system is guessing.
My point is that on these great big systems, they're so sophisticated and so interconnected, that you're literally taking all kinds of risks when you do this. If you count on good luck, you're not going to get it, and that's exactly the case with these systems. That's why, by and large, these great big systems are enormously late and way over budget. The way you can deal with it is to start all the way at the bottom with real precise control, put self directed teams in place, begin to use data to track and manage it. And we know exactly how to do that. That's what's so frustrating to me, because I can't get anybody interested.
So that's the issue we're struggling with. That's the ultra large system problem. How we move to that stuff, how do we handle this? I'm hoping we'll get there someday, but it's going to be a challenge. And if next year I'll be around to do it. But that's what we're struggling with.