# A Data Analysis Methodology for Software Managers

This chapter is from the book

## Building the Multi-Variable Model

I call the technique I've developed to build the multi-variable model "stepwise ANOVA" (analysis of variance). It is very similar to forward stepwise regression except I use an analysis of variance procedure to build models with categorical variables. You will learn more about analysis of variance in Chapter 6. For the moment, you just need to know that this procedure allows us to determine the influence of numerical and categorical variables on the dependent variable, leffort. The model starts "empty" and then the variables most related to leffort are added one by one in order of importance until no other variable can be added to improve the model. The procedure is very labor-intensive because I make the decisions at each step myself; it is not automatically done by the computer. Although I am sure this could be automated, there are some advantages to doing it yourself. As you carry out the steps, you will develop a better understanding of the data. In addition, in the real world, a database often contains many missing values and it is not always clear which variable should be added at each step. Sometimes you need to follow more than one path to find the best model. In the following example, I will show you the simplest case using our 34-project, 6-variable database with no missing values. My goal for this chapter is that you understand the methodology. The four case studies in Chapters 2 through 5 present more complicated analyses, and will focus on interpreting the output.

Example

Determine Best One-Variable Model  First, we want to find the best one-variable model. Which variable, lsize, t13, t14, app, or telonuse, explains the most variation in leffort? I run regression procedures for the numerical variables and ANOVA procedures for the categorical variables to determine this. In practice, I do not print all the output. I save it in output listing files and record by hand the key information in a summary sheet. Sidebar 1.2 shows a typical summary sheet. I note the date that I carried out the analysis, the directory where I saved the files, and the names of the data file, the procedure file(s), and the output file(s). I may want to look at them again in the future, and if I don't note their names now, I may never find them again! We are going to be creating lots of procedures and generating lots of output, so it is important to be organized. I also note the name of the dependent variable.

Now I am ready to look at the output file and record the performance of the models. In the summary sheet, I record data only for significant variables. For the regression models, a variable is highly significant if its P>|t| value is 0.05 or less. In this case, I do not record the actual value; I just note the number of observations, the variable's effect on effort, and the adjusted R-squared value. If the significance is borderline, that is, if P>|t| is a number between 0.05 and 0.10, I note its value. If the constant is not significant, I note it in the Comments column. If you are analyzing a very small database, you might like to record these values for every variable—significant or not. Personally, I have found that it is not worth the effort for databases with many variables. If I need this information later, I can easily go back and look at the output file.

For the ANOVA models, I do the same except I look at a variable's Prob>F value to determine if the variable is significant. The effect of a categorical variable depends on the different types. For example, using Telon (telonuse= Yes) will have one effect on leffort and not using Telon (telonuse=No) will have a different effect on leffort. You cannot determine the effect from the ANOVA table.

In Example 1.11, I have highlighted the key numbers in bold. I see that there is a very significant relationship between leffort and lsize (P>|t|= 0.000): lsize explains 64% of the variation in leffort. The coefficient of lsize (Coef.) is a positive number (0.9298). This means that leffort increases with increasing lsize. The model was fit using data from 34 projects. I add this information to the summary sheet (Sidebar 1.2).

#### Example 1.11

```                                           . regress leffort 1size
Source              SS         df               MS                  Number of obs  =      34
Model           22.6919055      1           22.6919055              F(1,32)        =   59.67
Residual        12.1687291     32           .380272786              Prob > F       =  0.0000
Total           34.8606346     33           1.05638287              R-squared      =  0.6509
Root MSE       =  .61666

Leffort             Coef.             Std. Err.          t      P>|t|     [95% Conf. Interval]
Lsize             .9297666         .1203611              7.725  0.000     .6845991   1.174934
_cons             3.007431         .7201766              4.176  0.000     1.54048    4.474383```

In Example 1.12, I see that there is not a significant relationship between leffort and t13. Therefore, I do not even look at the coefficient of t13. I note nothing and move on to the next model.

#### Example 1.12

```                               . regress leffort t13
Source           SS        df        MS                   Number of obs   =      34
Model        .421933391     1    .421933391               F(1,32)         =    0.39
Residual     34.4387012    32   1.07620941                Prob > F        =  0.5357
Total        34.8606346    33   1.05638287                R-squared       =  0.0121
Root MSE        = -1.0374

Leffort        Coef.           Std. Err.     t       P>|t|           [95% Conf. Interval]
t13          .1322679       .2112423        0.626    0.536           -.2980186  .5625544
_cons        8.082423        .706209       11.445    0.000           6.643923   9.520924```

In Example 1.13, I see that there is a very significant relationship between leffort and t14 : t14 explains 36% of the variation in leffort. The coefficient of t14 is negative. This means that leffort decreases with increasing t14. The model was fit using data from 34 projects. I add this information to the summary sheet (Sidebar 1.2).

#### Example 1.13

```                                   . regress leffort t14
Source                 SS         df            MS                       Number of obs =     34
Model              13.1834553      1        13.1834553                   F(1,32)       =  19.46
Residual           21.6771793     32        .677411853                   Prob > F      = 0.0001
Total              34.8606346     33        1.05638287                   R-squared     = 0.3782
Root MSE      = .82305

Leffort               Coef.            Std. Err.         t       P>|t|     [95% Conf. Interval]
t14                 -.7022183       .1591783           -4.412    0.000     -1.026454  -.3779827
_cons                10.55504       .4845066           21.785    0.000     9.568136    11.54195```

In Example 1.14, I see that there is no significant relationship between leffort and app. I note nothing and move on to the next model.

#### Example 1.14

```                                   . anova leffort app
Number of obs  =      34   R-squared     =  0.0210
Root MSE       = 1.06659   Adj R-squared = -0.0769

Source        Partial SS         df           MS            F    Prob > F
Model         .732134098         3         .244044699     0.21   0.8855
App           .732134098         3         .244044699     0.21   0.8855
Residual    34.1285005          30         1.13761668
Total       34.8606346          33         1.05638287```

In Example 1.15, I see that there is a borderline significant relationship between leffort and telonuse : telonuse explains 8% of the variation in leffort. The model was fit using data from 34 projects. I add this information to the summary sheet (Sidebar 1.2).

#### Example 1.15

```. anova leffort telonuse
Number of obs = 34	R-squared = 0.1094
Root MSE = .984978	Adj R-squared = 0.0816

Source        Partial SS    df        MS           F     Prob > F
Model       3.81479355       1     3.81479355    3.93    0.0560
Telonuse    3.81479355       1     3.81479355    3.93    0.0560
Residual    31.0458411      32     .970182533
Total       34.8606346      33     1.05638287```

Sidebar 1.2 Statistical Output Summary Sheet

Date: 01/03/2001

Directory: C:\my documents\data analysis book\example34\

Data File: bankdata34.dta

Procedure Files: *var.do (* = one, two, three, etc.)

Output Files: *var.log

Dependent Variable: leffort

 Variables Num Obs Effect Adj R2 Significance of Added Variable Comments 1-variable models *lsize 34 + 0.64 t14 34 – 0.36 telonuse 34 0.08 .056 2-variable models with lsize t14 34 – 0.73 best model, sign. = 0.0000 3-variable models with lsize, t14 none significant no further improvement possible

Once I have recorded all of the output in the summary sheet, I select the variable that explains the most variation in leffort. In this step, it is obviously lsize. There is no doubt about it. Then I ask myself: Does the relationship between leffort and lsize make sense? Does it correspond to the graph of leffort as a function of lsize (Figure 1.8)? Yes, it does, so I add lsize to the model and continue with the next step.

Determine Best Two-Variable Model  Next I want to determine which variable, t13, t14, app, or telonuse, in addition to lsize, explains the most variation in leffort. I add lsize to my regression and ANOVA procedures and run them again. I record the output in the same summary sheet (Sidebar1.2). What I am principally concerned with at this stage is if the additional variable is significant. So first I look at P>|t| value of this variable. If it is not significant, I record nothing and move on to the next model.

In Example 1.16, I see that t13 is not significant (0.595).

#### Example 1.16

```. regress leffort lsize t13
Source           SS           df         MS                     Number of obs =     34
Model        22.8042808        2     11.4021404                 F(2,31)       =  29.32
Residual     12.0563538       31     .388914638                 Prob > F      = 0.0000
Total        34.8606346       33     1.05638287                 R-squared     = 0.6542
Root MSE      = .62363

Leffort         Coef.               Std. Err.      t     P>|t|   [95% Conf. Interval]
Lsize            .943487           .1243685      7.586   0.000   .6898359    1.197138
t13            -.0697449           .1297491     -0.538   0.595    -.33437    .1948801
_cons           3.151871           .7763016      4.060   0.000   1.568593    4.735149```

In Example 1.17, I learn that t14 is significant (0.002): lsize and t14 together explain 73% of the variation in leffort. The coefficient of t14 is a negative number. This means that leffort decreases with increasing t14. This is the same effect that we found in the one-variable model. If the effect was different in this model, that could signal something strange going on between lsize and t13, and I would look into their relationship more closely. lsize and the constant (_cons) are still significant. If they were not, I would note this in the Comments column. Again, this model was built using data from 34 projects.

#### Example 1.17

```                                  . regress leffort lsize t14
Source             SS      df          MS                 Number of obs =     34
Model       25.9802069      2      12.9901035             F(2,31)       =  45.35
Residual    8.88042769     31      .286465409             Prob > F      = 0.0000
Total       34.8606346     33      1.05638287             R-squared     = 0.7453
Root MSE      = .53522

Leffort      Coef.               Std. Err.    t       P>|t|         [95% Conf. Interval]
Lsize       .7678266            .1148813    6.684     0.000        .5335247    1.002129
t14        -.3856721            .1138331   -3.388     0.002       -.6178361    -.153508
_cons       5.088876            .8764331    5.806     0.000        3.301379    6.876373```

In Examples 1.18 and 1.19, I see that app and telonuse are not significant (0.6938 and 0.8876).

#### Example 1.18

```               . anova leffort lsize app, category (app)
Number of obs  =     34      R-squared     = 0.6677
Root MSE       = .63204      Adj R-squared = 0.6218
Source       Partial SS     df         MS         F           Prob > F
Model       23.2758606       4     5.81896516     14.57       0.0000
Lsize       22.5437265       1     22.5437265     56.43       0.0000
app         .583955179       3     .194651726      0.49       0.6938
Residual     11.584774      29     .399474964
Total       34.8606346      33     1.05638287```

#### Example 1.19

```             . anova leffort lsize telonuse, category (telonuse)
Number of obs =          34          R-squared        =  0.6512
Root MSE        = .626325          Adj R-squared =  0.6287

Source         Partial SS   df         MS         F           Prob > F
Model         22.6998727     2     11.3499363     28.93       0.0000
Lsize         18.8850791     1     18.8850791     48.14       0.0000
telonuse      .007967193     1     .007967193      0.02       0.8876
Residual      12.1607619    31     .392282644
Total         34.8606346    33     1.05638287```

Again, the decision is obvious: The best two-variable model of leffort is lsize and t14. Does the relationship between t14 and leffort make sense? Does it correspond to the graph of leffort as a function of t14? If yes, then we can build on this model.

Determine Best Three-Variable Model  Next I want to determine which variable, t13, app, or telonuse, in addition to lsize and t14, explains the most variation in leffort. I add t14 to my regression and ANOVA procedures from the previous step and run them again. I record the output in the same summary sheet (Sidebar 1.2). As in the previous step, what I am principally concerned with at this stage is if the additional variable is significant. If it is not significant, I record nothing and move on to the next model. Let's look at the models (Examples 1.20, 1.21, and 1.22).

#### Example 1.20

```                                              . regress leffort lsize t14 t13
Source           SS         df        MS          Number of obs =     34
Model        26.0505804      3     8.68352679     F(3, 30)      =  29.57
Residual     8.81005423     30     .293668474     Prob > F      = 0.0000
Total        34.8606346     33     1.05638287     R-squared     = 0.7473
Root MSE      = .54191

leffort           Coef.           Std. Err.   t       P>|t|     [95% Conf. Interval]
lsize             .7796095      .118781     6.563     0.000    .5370263     1.022193
t14               -.383488     .1153417     -3.325    0.002    -.6190471   -.1479289
t13               -.055234     .1128317     -0.490    0.628    -.285667      .175199
_cons             5.191477     .9117996     5.694     0.000    3.329334      7.05362```

#### Example 1.21

```            . anova leffort lsize t14 app, category (app)
Number of obs =      34     R-squared     = 0.7478
Root MSE      = .560325     Adj R-squared = 0.7028

Source       Partial SS    df        MS            F        Prob > F
Model       26.0696499      5     5.21392998     16.61      0.0000
lsize       12.3571403      1     12.3571403     39.36      0.0000
t14         2.79378926      1     2.79378926      8.90      0.0059
app         .089442988      3     .029814329      0.09      0.9622
Residual     8.7909847     28     .313963739
Total       34.8606346     33     1.05638287   ```

#### Example 1.22

```            . anova leffort lsize t14 telonuse, category(telonuse)
Number of obs  =      34    R-squared     = 0.7487
Root MSE       = .540403    Adj R-squared = 0.7236

Source      Partial SS      df         MS            F      Prob > F
Model         26.099584      3     8.69986134     29.79     0.0000
lsize         12.434034      1      12.434034     42.58     0.0000
t14          3.39971135      1     3.39971135     11.64     0.0019
telonuse     .119377093      1     .119377093      0.41     0.5274
Residual      8.7610506     30      .29203502
Total        34.8606346     33     1.05638287   ```

None of the additional variables in the three models (Examples 1.20, 1.21, and 1.22) are significant.

The Final Model  The stepwise ANOVA procedure ends as no further improvement in the model is possible. The best model is the two-variable model: leffort as a function of lsize and t14. No categorical variables were significant in this example, so this model is the same model found by the automatic stepwise regression procedure. I check one final time that the relationships in the final model (Example 1.23) make sense. We see that lsize has a positive coefficient. This means that the bigger the application size, the greater the development effort required. Yes, this makes sense to me. I would expect bigger projects to require more effort. The coefficient of t14, staff tool skills, is negative. This means that effort decreases with increasing staff tool skills. Projects with very high staff tool skills required less effort than projects with very low staff tool skills, everything else being constant. Yes, this makes sense to me, too. Print the final model's output and save it.

#### Example 1.23

```                                                      . regress leffort lsize t14
Source                SS     df         MS                Number of obs =     34
Model         25.9802069      2     12.9901035            F(2, 31)      =  45.35
Residual      8.88042769     31     .286465409            Prob > F      = 0.0000
Total         34.8606346     33     1.05638287            R-squared     = 0.7453
Root MSE      = .53522

leffort           Coef.        Std. Err.    t       P>|t|     [95% Conf. Interval]
lsize           .7678266     .1148813     6.684     0.000     .5335247   1.002129
t14            -.3856721     .1138331    -3.388     0.002     -.6178361  -.153508
_cons           5.088876     .8764331     5.806     0.000     3.301379   6.876373```

On the summary sheet, I note the significance of the final model. This is the Prob > F value at the top of the output. The model is significant at the 0.0000 level. This is Stata's way of indicating a number smaller than 0.00005. This means that there is less than a 0.005% chance that all the variables in the model (lsize and t14) are not related to leffort. (More information about how to interpret regression output can be found in Chapter 6.)

What to Watch Out For

• Be sure to use an ANOVA procedure that analyzes the variance of unbalanced data sets, that is, data sets that do not contain the same number of observations for each categorical value. I have yet to see a "balanced" software development database. In Stata, the procedure is called "ANOVA."

• Use the transformed variables in the model.

• Some models may contain variables with lots of missing values. It might be better to build on the second best model if it is based on more projects.

• If the decision is not obvious, follow more than one path to its final model (see Chapters 4 and 5). You will end up with more than one final model.

• Always ask yourself at each step if a model makes sense. If a model does not make sense, use the next best model.

Checking the Model

Before we can accept the final model found in the previous step, we must check that the assumptions underlying the statistical tests used have not been violated. In particular, this means checking that:

• Independent numerical variables are approximately normally distributed. (We did this in the preliminary analyses.)

• Independent variables are not strongly related to each other. (We did this partially during the preliminary analyses; now, we need to complete it.)

• The errors in our model should be random and normally distributed. (We still need to do this.)

In addition, we also need to check that no single project or small number of projects has an overly strong influence on the results.

### Numerical Variable Checks

We already calculated the correlation coefficients of numerical variables in our preliminary analyses and noted them. Now that I have my final model, I need to check that all the independent numerical variables present in the final model are not strongly linearly related to each other. In other words, I need to check for multicollinearity problems. Why would this cause a problem? If two or more explanatory variables are very highly correlated, it is sometimes not possible for the statistical analysis software to separate their independent effects and you will end up with some strange results. Exactly when this will happen is not predictable. So, it is up to you to check the correlations between all numerical variables. Because my model only depends on lsize and t14, I just need to check their correlation with each other. To avoid multicollinearity problems, I do not allow any two variables with an absolute value of Spearman's rho greater than or equal to 0.75 in the final model together. From our preliminary correlation analysis, we learned that size1 and t14 are slightly negatively correlated; they have a significant Spearman's correlation coefficient of –0.3599. Thus, there are no multicollinearity problems with this model.

You should also be aware that there is always the possibility that a variable outside the analysis is really influencing the results. For example, let's say I have two variables, my weight and the outdoor temperature. I find that my weight increases when it is hot and decreases when it is cold. I develop a model that shows my weight as a function of outdoor temperature. If I did not use my common sense, I could even conclude that the high outdoor temperature causes my weight gain. However, there is an important variable that I did not collect which is the real cause of any weight gain or loss—my ice cream consumption. When it is hot outside, I eat more ice cream, and when it is cold, I eat much less. My ice cream consumption and the outdoor temperature are therefore highly correlated. The model should really be my weight as a function of my ice cream consumption. This model is also more useful because my ice cream consumption is within my control, whereas the outdoor temperature is not. In this case, the outdoor temperature is confounded2 with my ice cream consumption and the only way to detect this is to think about the results. Always ask yourself if your results make sense and if there could be any other explanation for them. Unfortunately, we are less likely to ask questions and more likely to believe a result when it proves our point.

### Categorical Variable Checks

Strongly related categorical variables can cause problems similar to those caused by numerical variables. Unfortunately, strong relationships involving categorical variables are much more difficult to detect. We do not have any categorical variables in our final effort model, so we do not need to do these checks for our example. However, if we had found that telonuse and app were both in the model, how would we check that they are not related to each other or to the numerical variables in the model?

To determine if there is a relationship between a categorical variable and a numerical variable, I use an analysis of variance procedure. Let's take app and t14 in Example 1.24. Does app explain any variance in t14?

#### Example 1.24

```           . anova t14 app
Number of obs =      34     R-squared     = 0.1023
Root MSE      = .894427     Adj R-squared = 0.0125

Source        Partial SS     df        MS            F        Prob > F
Model         2.73529412      3     .911764706     1.14       0.3489
app           2.73529412      3     .911764706     1.14       0.3489
Residual           24.00     30     .80
Total         26.7352941     33     .810160428```

Example 1.24 shows that there is no significant relationship between app and t14 (the Prob > F value for app is a number greater than 0.05). I run ANOVA procedures for every categorical/numerical variable combination in the final model. (Note that the numerical variable must be the dependent LHS variable.) If I find a very strong relationship, I will not include the two variables together in the same model. I define "very strong relationship" as one variable explaining more than 75% of the variation in another.

I would like to point out here that we can get a pretty good idea about which variables are related to each other just by looking at the list of variables that are significant at each step as we build the one-variable, two-variable, three-variable, etc. models. In the statistical output sheet, Sidebar 1.2, we see that telonuse is an important variable in the one-variable model. However, once lsize has been added to the model, telonuse does not appear in the two-variable model. This means that there is probably a relationship between telonuse and lsize. Let's check (Example 1.25):

#### Example 1.25

```. anova lsize telonuse
Number of obs     =       34                          R-squared    = 0.1543
Root MSE     =     .832914                          Adj R-squared  = 0.1279

Source        Partial SS     df         MS           F      Prob > F
Model         4.04976176      1     4.04976176     5.84     0.0216
telonuse      4.04976176      1     4.04976176     5.84     0.0216
Residual      22.1998613     32     .693745665
Total         26.2496231     33     .795443123   ```

Yes, there is a significant relationship between lsize and telonuse. The use of Telon explains about 13% of the variance in lsize. Example 1.26 shows that applications that used Telon were much bigger than applications that did not. So, the larger effort required by applications that used Telon (Example 1.5) may not be due to Telon use per se, but because the applications were bigger. Once size has been added to the effort model, Telon use is no longer important; size is a much more important driver of effort. I learn as I analyze. Had this all been done automatically, I may not have noticed this relationship.

#### Example 1.26

```                 . table telonuse, c(mean size)
Telon Use     mean(size)
No                        455
Yes                      1056```

It is more difficult to determine if there is an important relationship between two categorical variables. To check this, I first calculate the chi-square statistic to test for independence. From this I learn if there is a significant relationship between two categorical variables, but not the extent of the relationship. (You will learn more about the chi-square test in Chapter 6.) In Example 1.27, I am interested in the Pr value (in bold). Pr is the probability that we are making a mistake if we say that there is a relationship between two variables. If the value of Pr is less than or equal to 0.05, we can accept that there is a relationship between the two variables. Here, Pr = 0.069, so I conclude that there is no significant relationship between the two variables.

#### Example 1.27

```. tabulate app telonuse, chi2
Application Type            Telon Use
No     Yes     Total
CustServ                            6      0         6
MIS                                 3      0         3
TransPro                           16      4        20
InfServ                             2      3         5
Total                              27      7        34
Pearson chi2(3) = 7.0878    Pr = 0.069```

If there is a significant relationship, I need to look closely at the two variables and judge for myself if they are so strongly related that there could be a problem. For example, if application type (app) and Telon use (telonuse) had been significantly related, I would first look closely at Example 1.27. There I would learn that no customer service (CustServ) or MIS application used Telon. Of the seven projects that used Telon, there is a split between transaction processing (TransPro) applications (a high-effort category; see Example 1.4) and information service (InfServ) applications (a low-effort category). Thus, the high effort for Telon use (see Example 1.5) is not due to an over-representation of high-effort transaction processing applications. In fact, the majority of projects that did not use Telon are transaction processing applications. I conclude that any relationship between Telon use and effort cannot be explained by the relationship between application type and Telon use; i.e. application type and Telon use are not confounded.

If I find any problems in the final model, I return to the step where I added the correlated/confounded variable to the variables already present in the model, take the second best choice, and rebuild the model from there. I do not carry out any further checks. The model is not valid, so there is no point. We have to start again. (See Chapter 5 for an example of confounded categorical variables.)

### Testing the Residuals

In a well-fitted model, there should be no pattern to the errors (residuals) plotted against the fitted values. The term "fitted value" refers to the leffort predicted by our model; the term "residual" is used to express the difference between the actual leffort and the predicted leffort for each project. Your statistical analysis tool should calculate the predicted values and residuals automatically for you. The errors of our model should be random. For example, we should not be consistently overestimating small efforts and underestimating large efforts. It is always a good idea to plot this relationship and take a look. If you see a pattern, it means that there is a problem with your model. If there is a problem with the final model, then try the second best model. If there is a problem with the second best model, then try the third best model, and so on. In Figure 1.12, I see no pattern in the residuals of our final model.

Figure 1.12 Residuals vs. fitted values

In addition, the residuals should be normally distributed. We can see in Figure 1.13 that they are approximately normally distributed. You will learn more about residuals in Chapter 6.

Figure 1.13 Distribution of residuals

### Detecting Influential Observations

How much is our final model affected by any one project or subset of our data? If we dropped one project from our database, would our model be completely different? I certainly hope not. But we can do better than hope; we can check the model's sensitivity to individual observations. Projects with large predicted errors (residuals) and/or projects very different from other project's values for at least one of the independent variables in the model can exert undue influence on the model (leverage).

Cook's distance summarizes information about residuals and leverage into a single statistic. Cook's distance can be calculated for each project by dropping that project and re-estimating the model without it. My statistical analysis tool does this automatically. Projects with values of Cook's distance, D, greater than 4/n should be examined closely (n is the number of observations). In our example, n = 34, so we are interested in projects for which D > 0.118. I find that one project, 51, has a Cook's distance of 0.147 (Example 1.28).

#### Example 1.28

```     . list id size effort t14 cooksd if cooksd>4/34
id     size       effort   t14      cooksd
28.     51     1526       5931     3        .1465599```

Why do I use Cook's distance? I use it because my statistical analysis tool calculates it automatically after ANOVA procedures. Other statistics, DFITS and Welsch distance, for instance, also summarize residual and leverage information in a single value. Of course, the cut-off values are different for DIFTS and Welsh distance. Do not complicate your life; use the influence statistic that your statistical analysis tool provides.3

Referring back to Figure 1.8, I see that the influence of Project 51 is due to its effort being slightly low for its size compared to other projects, so it must be pulling down the regression line slightly (leverage problem). After looking closely at this project, I see no reason to drop it from the analysis. The data is valid, and given the small number of large projects we have, we cannot say that it is an atypical project. If we had more data, we could, in all likelihood, find more projects like it. In addition, 0.15 is not that far from the 0.12 cut-off value.

If a project was exerting a very high influence, I would first try to understand why. Is the project special in any way? I would look closely at the data and discuss the project with anyone who remembered it. Even if the project is not special, if the Cook's distance is more than three times larger than the cut-off value, I would drop the project and develop an alternative model using the reduced data set. Then I would compare the two models to better understand the impact of the project.

### InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

## Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

## Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

### Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

### Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

### Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

### Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

### Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

### Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

## Other Collection and Use of Information

### Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

### Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

### Do Not Track

This site currently does not respond to Do Not Track signals.

## Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

## Children

This site is not directed to children under the age of 13.

## Marketing

Pearson may send or direct marketing communications to users, provided that

• Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
• Such marketing is consistent with applicable law and Pearson's legal obligations.
• Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
• Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

## Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

## Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

## Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

## Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

## Sharing and Disclosure

Pearson may disclose personal information, as follows:

• As required by law.
• With the consent of the individual (or their parent, if the individual is a minor)
• In response to a subpoena, court order or legal process, to the extent permitted or required by law
• To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
• In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
• To investigate or address actual or suspected fraud or other illegal activities
• To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
• To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
• To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.