Variable and Model Selection
Once we understand what data we actually have, we need to determine what we can learn from it. What possible relationships could we study? What possible relationships should we study? What relationship will we study first? Your answers to the last two questions will depend on the overall goals of the analysis.
Why Do It? The data may have been collected for a clearly stated purpose. Even so, there might be other interesting relationships to study that occur to you while you are analyzing the data, and which you might be tempted to investigate. However, it is important to decide in advance what you are going to do first and then to complete that task in a meticulous, organized manner. Otherwise, you will find yourself going in lots of different directions, generating lots of computer output, and becoming confused about what you have tried and what you have not tried; in short, you will drown yourself in the data. It is also at this stage that you may decide to create new variables or to reduce the number of variables in your analysis. Variables of questionable validity, variables not meaningfully related to what you want to study, and categorical variable values that do not have a sufficient number of observations should be dropped from the analysis. (See the case studies in Chapters 2 through 5 for examples of variable reduction/modification.) In the following example, we will use all the variables provided in Table 1.1.
Example The smallest number of observations for a categorical variable is 3 for the MIS (management information systems) category of the application type (app) variable (see Example 1.2). Given that our data set contains 34 observations, I feel comfortable letting MIS be represented by three projects. No matter how many observations the database contains, I don't believe it is wise to make a judgment about something represented by less than three projects. This is my personal opinion. Ask yourself this: If the MIS category contained only one project and you found in your statistical analysis that the MIS category had a significantly higher productivity, would you then conclude that all MIS projects in the bank have a high productivity? I would not. If there were two projects, would you believe it? I would not. If there were three projects, would you believe it? Yes, I would in this case. However, if there were 3000 projects in the database, I would prefer for MIS to be represented by more than three projects. Feel free to use your own judgment.
Even with this small sample of software project data, we could investigate a number of relationships. We could investigate if any of the factors collected influenced software development effort. Or we could find out which factors influenced software development productivity (i.e., size/effort). We could also look at the relationship between application size (size) and Telon use (telonuse), between size and application type (app), or between application type (app) and staff application knowledge (t13), just to name a few more possibilities. In this example, we will focus on determining which factors affect effort. That is, do size, application type (app), Telon use (telonuse), staff application knowledge (t13), staff tool skills (t14), or a combination of these factors have an impact on effort? Is effort a function of these variables? Mathematically speaking, does:
effort = f (size, app, telonuse, t13, t14)?
In this equation, effort is on the left-hand side (LHS) and the other variables are on the right-hand side (RHS). We refer to the LHS variable as the dependent variable and the RHS variables as independent variables.
What to Watch Out For
To develop a predictive model, make sure that the independent variables are all factors that you know or can predict with reasonable accuracy in advance.
Category values with less than three observations should be dropped from any multi-variable analysis.