Excel's LINEST() Function Deconstructed
Microsoft Excel's LINEST() worksheet function has a long and checkered history. It is capable of returning a multiple regression analysis with up to 64 predictor variables and one outcome or "predicted" variable. (Early versions permitted up to 16 predictor variables.)
LINEST() performs quite well in most situations. It returns accurate regression coefficients and intercepts, the standard errors of the coefficients and of the intercept, and six summary statistics regarding the regression: R2, the standard error of estimate, the F ratio for the full regression, the degrees of freedom for the residual, and the sums of squares for the regression and for the residual.
But LINEST has some drawbacks, ranging from the inconvenient to the potentially disastrous.
An Inconvenient Problem
One difficulty is that the regression coefficients and their standard errors are shown in reverse order in which their associated underlying variables appear on the worksheet. See Figure 1.
Figure 1 LINEST() returns coefficients in reverse order of the worksheet.
In Figure 1, the predictor variables are years of education and years of age. Education data is in column A, and Age data is in column B. The predicted variable, Income, is in column C.
The formula that uses the LINEST() function is array-entered (with Ctrl+Shift+Enter) in the range E5:G9. The formula in this example is:
The problem is that the regression coefficient for Age is in cell E5, and the coefficient for Education is in cell F5: in left-to-right order, the coefficient for Age comes before the coefficient for Education. But in the underlying data set, the Education data (column A) precedes the Age data (column B).
(The intercept, in cell G5 in Figure 1, always appears rightmost in the LINEST() results.)
So if you wanted to use the regression equation to estimate the income of the first person in Row 2, you would need to use this formula (parentheses included for clarity only):
instead of the more natural and more easily interpreted:
With just two variables, this is a really minor issue. But with 5, 10, perhaps 20 variables, it becomes exasperating. To complete the regression equation, you need to proceed left-to-right for the variables and right-to-left for the coefficients. With 20 of each, it's tedious and error-prone.
And there is absolutely no good reason for it—statistical, theoretical or programmatic. I recognize that one could use the TREND() function instead of assembling the regression formula, coefficient by coefficient and variable by variable, but there are often times when you need to see the result of modifying one variable or coefficient; the only way to do that is to call them out separately in the full equation.
Nevertheless, this is principally a matter of convenience. The issues that I'm going to discuss in subsequent papers are more serious, particularly if you're still using a version of Excel prior to 2003.
This paper continues with a discussion of how the results provided by LINEST() can be calculated, and how you can replicate those results using Excel's native worksheet functions. A little matrix algebra is needed and it will be necessary for you to be familiar with the concepts behind the worksheet functions MMULT(), MINVERSE(), and TRANSPOSE().
Once you've seen how to replicate the LINEST() results using straightforward matrix algebra, you'll be in a position to see how Microsoft got it badly wrong when it offered LINEST()’s third option, const. That option calculates regression statistics "without the constant," also known as "forcing the intercept through zero." While the associated problems have been fixed, anyone who is still using a version of Excel prior to 2003 is in trouble if that option is selected, whether in LINEST(), TREND(), or the Regression tool in the Data Analysis add-in.
You will see in a subsequent paper how Microsoft has changed its algorithm to avoid returning a negative R2, and how it came about in the first place. This is necessary information for anyone needing to migrate a regression analysis from, say, Excel 2002 to Excel 2010, or to understand how Excel 2002's results can be so different from Excel 2010's. Even if you're using a version subsequent to Excel 2003, the problems still show up in the R2 values associated with chart trendlines.
Microsoft has also included in the code for LINEST() a method for dealing with severe multi-collinearity in the X matrix. (Multi-collinearity is just a hifalutin word for two or more predictor variables that are perfectly correlated, or virtually so.) Microsoft deserves kudos for recognizing that the problem existed. But the way that the solution is manifested in the results of LINEST() since Excel 2003 is potentially disastrous. With the information in this paper, you'll be in a position to avoid that particular LINEST()—well, call it a feature.
Assembling LINEST() Results from Other Functions
In this section, I'm going to show you how to assemble the different results you get from LINEST() using other worksheet functions. Some of these methods will be clear, even obvious. Others will seem unclear, and they aren't at all intuitively rich. But by taking things apart, I think you'll find it much easier to understand the way they work together.
Getting the Regression Coefficients
The first step is to lay out the data as shown in Figure 2.
Figure 2 Add a column that contains nothing but 1's to the range of predictor variables.
Figure 2 shows that a column containing 1's is included with the other predictor, or X, values. This column enables the matrix operations described below to calculate an intercept and its standard error. Although you don't see that column of 1's when you run LINEST() directly on your input data, Excel adds it (invisibly) on your behalf.
Getting the Sum of Squares and Cross Products (SSCP)
You'll need access to what's called the transpose of the data in B3:E22. You can do that explicitly on the worksheet using Excel's TRANSPOSE() function. In Figure 2, the range H2:AA5 contains this array formula:
(Recall that you enter an array formula using Ctrl+Shift+Enter instead of simply Enter.)
With those two matrices set up, you can get what's called the sum of squares and cross-products matrix, often called the SSCP matrix. Use this array formula:
If you don't want to bother putting the transpose of the X matrix directly on the worksheet, you could use this array formula instead to get the SSCP matrix:
Excel's MMULT() function performs matrix multiplication. Here, the transpose of the X matrix (B3:E22) is post-multiplied by the X matrix.
Getting the Inverse of the SSCP Matrix
The next step is to get the inverse of the SSCP matrix. A matrix's inverse is analogous to an inverse in simple arithmetic. The inverse of the number 4 is 1/4: When you multiply a number by its inverse, you get 1.
Similarly, when you multiply a matrix by its inverse, you get a new matrix with 1's in its main diagonal and 0's everywhere else. Figure 3 shows the SSCP matrix in G3:J6, its inverse in G10:J13, and the result of the multiplication of the two matrices in L10:O13.
Figure 3 The matrix in L10:O13 is called an identity matrix.
Calculating the Regression Coefficients and Intercept
I mentioned earlier that much of the derivation of the results that LINEST() returns is not intuitively rich. The inverse of the SSCP matrix is an example of that. There's much information buried in the matrix inverse, but no flash of intuition will tell you that it's hidden there, or even why it's there. For example, see Figure 4.
Figure 4 The SSCP matrix and its inverse, combined with the X and Y matrices, return the regression coefficients and the intercept.
In Figure 4, notice the range G18:J18. It contains this array formula:
In words, the formula uses matrix multiplication via the MMULT() function to combine the transposed X matrix (B3:E32) with the Y matrix (A3:A32) with the inverse of the SSCP matrix (G10:J13). The result in G18:J18 is the intercept (G18) and the regression coefficients (H18:J18). The coefficients are in the same order that the underlying values appear on the worksheet—that is, columns C, D, and E contain the values for variables X1, X2, and X3, respectively, and cells H18, I18, and J18 contain the associated regression coefficients.
Cells G21:J21 contain the first row of the LINEST() results for the same underlying data set (except that the 1's in column B are omitted from the LINEST() arguments because LINEST() supplies them for you). Notice that the values for the intercept and the coefficients are identical to those in row 18. The only difference is that LINEST() has returned them out of order.
In sum, to get the intercept and regression coefficients using matrix algebra instead of using LINEST(), take the following general steps:
- Get the SSCP matrix using X'X. Use MMULT() and TRANSPOSE() to postmultiply the transpose of the X matrix by the X matrix.
- Use MINVERSE() to calculate the inverse of the SSCP matrix.
- Use the array formula given above and repeated here to calculate the intercept and coefficients:
Getting the Sum of Squares Regression and Residual
It probably seems a little perverse to go from the calculation of regression coefficients to sums of squares, skipping over standard errors, R2, F tests, and so on. But you need the sums of squares to calculate those other statistics.
Before getting to the matter of calculating the sums of squares, it's helpful to review the meaning of the sum of squares regression and the sum of squares residual.
A sum of squares, in most statistical contexts, is the sum of the squares of the differences (or deviations) between individual values and the mean of the values. So if our values are 2 and 4, the mean is 3. 2 – 3 is -1, and the squared deviation is +1. 4 – 3 is 1, and the squared deviation is +1. Therefore, the sum of squares is 1 + 1 or 2.
Our purpose in calculating those two sums of squares is to divide the total sum of squares into two parts:
- The sum of squares regression is the sum of the squared deviations of the Y values that are predicted by the regression coefficients and intercept, from the mean of the predicted values.
- The sum of squares residual is the sum of the squared deviations of the differences between the actual Y values and the predicted Y values, from the mean of those deviations.
Calculating the Predicted Values
Those two definitions of sums of squares are fairly dense when written in English. It’s usually easier to understand what's going on if you think about them in the context of an Excel worksheet. See Figure 5.
Figure 5 Calculating the sums of squares
In Figure 5, I have repeated the regression coefficients and the intercept, as calculated using the matrix algebra discussed earlier, in the range G3:J3. Because they appear in the correct order, you can easily use them to calculate the predicted Y values as shown in the range L3:L22. This is the formula that's used in cell L3:
The intercept and coefficients in G3:J3 are identified using dollar signs and therefore absolute addressing. The X values in C3:E3 are identified using relative addressing. Therefore, you can drag and drop or copy and paste from cell L3 into the range L4:L22.
Just as a check, Figure 5 also shows the predicted Y values in M3:M22, using this array formula in that range:
You'll note that the predicted values using matrix algebra are identical to the predicted values using TREND(). There are actually slight differences, but they do not begin to show up until the 14th decimal place. (For example, the difference between cell L8 and cell M8 is 0.000000000000057.)
Calculating the Prediction Errors
The values shown in Figure 5, in the range O3:O22, are the errors in the predicted values. They are simply the differences between the actual Y values in A3:A22 and the predicted values in L3:L22. So, for example, the formula in cell O3 is =A3-L3.
Calculating the Sums of Squares
With the predicted values and the errors of prediction, we're in a position to calculate the sums of squares. The sum of squares regression is found with this formula in cell G24:
and the sum of squares residual is found with a similar formula in cell H24:
Notice that the two sums of squares total to 21612.95. This is the same value as appears in cell G26. The formula in G26 is:
which is the sum of the squared deviations of the original Y values. So, the process described in this section has accomplished the following:
- Predicted Y values on the basis of the combination of the X values and the regression coefficients and intercept.
- Obtained the sum of squared deviations of the predicted Y values (the sum of squares regression).
- Calculated the errors of prediction by subtracting the predicted Y values from the actual Y values.
- Obtained the sum of squared deviations of the errors of prediction (the sum of squares residual).
- Demonstrated that the total sum of squares of the actual Y values has been divided into two portions: the sum of squares regression and the sum of squares residual.
Calculating the Regression Diagnostics
Now that we have the sum of squares regression and the sum of squares residual, it's easy to get the results that help you diagnose the accuracy of the regression equation.
The R2 is simply the proportion of variability in the Y values that can be attributed to variability in the best combination of the X variables. That best combination is the result of applying the regression coefficients to the X variables—that is, the best combination is represented by the predicted Y values.
Therefore, the R2 is calculated by this ratio:
(Sum of Squares Regression) / (Sum of Squares Total)
Because the sum of squares total is the sum of the regression and the residual sums of squares, you can easily calculate R2 on the worksheet as shown in Figure 6.
Figure 6 Calculating the goodness-of-fit statistics
In Figure 6, cell G14 contains this formula:
which returns the ratio of the regression sum of squares to the total sum of squares.
Calculating the Standard Error of Estimate
At this point, you need to keep in mind the way that you’ve set up your inputs. In Figure 6, I’ve set things up so that the column of 1's is shown explicitly on the worksheet. That's because the column is needed if you're going to obtain the value of the regression equation’s intercept by means of matrix algebra—the instances of MMULT(),TRANSPOSE(), and MINVERSE() that I’ve discussed in this paper.
In that case—if you're showing the column of 1's explicitly—you get the degrees of freedom for the sum of squares residual by subtracting the number of X variables on the worksheet from the number of observations, or rows, in the matrix of X values.
In the example shown in Figure 6, the number of observations is 20, found in rows 3 through 22. The number of variables is 4, found in columns B through E. Therefore, the number of degrees of freedom for the sum of squares residual is 16: 20-4. You can confirm this from the LINEST() results in Figure 6, cells G6:J10, where the degrees of freedom shows up in cell H9.
On the other hand, if you want to use LINEST() directly, you don't need to supply the column of 1's on the worksheet: Excel supplies the 1's for you and you never see them. But if you're going to determine the degrees of freedom residual for yourself, then subtract the number of X variables (in this case, 3) from the number of observations (20) and then subtract 1 from the result to get 16.
In fact, you'll find that most intermediate statistics texts tell you that the degrees of freedom for the residual sum of squares is N-k-1, where N is the number of observations, k is the number of predictor variables, and 1 is for the column of 1's that you never see unless you arrange for them yourself.
So, to get the standard error of estimate, divide the sum of squares residual by the degrees of freedom for the residual, and take the square root of the result. The formula used in cell G15 of Figure 6 is:
The result is identical to that provided in the LINEST() results in cell H8.
Calculating the F Ratio for the Regression
There are a couple of ways to go about calculating the F ratio for the full regression. Both involve using the degrees of freedom for the residual and the degrees of freedom for the regression.
The prior section discussed how to get the degrees of freedom for the residual. The degrees of freedom for the regression is the number of X variables minus 1. So, if you have supplied the column of 1's explicitly on the worksheet, as in Figure 6, there are four X variables, and the degrees of freedom for the regression is 3.
If you use LINEST() and do not supply a column of 1's to it as an X variable—because Excel does that on your behalf—you still have four X variables; it's just that you're not looking at one of them. So the degrees of freedom for the regression is still 4-1, or 3, in this example.
One way to calculate the F ratio is to use the R2 value. Figure 6 does that in cell G17, where the formula is:
In words, the numerator is the R2 value divided by the regression degrees of freedom. The denominator is (1 – R2) divided by the residual degrees of freedom.
Another way uses the sums of squares instead of the R2 value. It's mathematically equivalent because we use the sums of squares to calculate the R2 value. The formula used in cell G18 of Figure 6 is:
The numerator is the sum of squares regression divided by its degrees of freedom. The denominator is the sum of squares residual divided by its degrees of freedom.
You may know that a sum of squared deviations divided by its degrees of freedom is a variance, often termed a mean square. That's what we have in cell G18: one variance divided by another. And the ratio of two variances is an F ratio.
Here, we have the variance of the Y scores as predicted by the regression equation, divided by the variance of the errors in those predictions. If the resulting ratio is meaningfully larger than 1.0, we regard the regression as a reliable one: an outcome that we expect to be similar if we repeat this research with a different but similarly obtained sample of observations. And you can test the reliability of the observed F ratio by using Excel's F.DIST() function.
Getting the Standard Errors
The final task in deconstructing the LINEST() function is to calculate the values of the standard errors of the intercept and the regression coefficients. These values are returned in the second row of the LINEST() results. Figure 7 shows the required calculations.
Figure 7 Calculating the standard errors
Figure 7 shows the SSCP matrix and its inverse, shown earlier in Figure 4. To get the standard errors of the regression coefficients and the intercept, we need to multiply the inverse of the SSCP matrix by the mean square for the residual.
Figure 7 shows the inverse of the SSCP matrix in cells G12:J15.
The prior section showed how to calculate the mean square residual: simply divide the sum of squares residual by the residual degrees of freedom. Figure 7 does that for this example in cell M14, using this formula:
Note that L14 contains the sum of squares residual, and 16 is the degrees of freedom for the residual.
The matrix shown in Figure 7, cells G18:J21, is the result of multiplying the inverse of the SSCP matrix by the mean square residual. The array formula is:
The square roots of the elements in the main diagonal of the matrix in G18:J21 are the standard errors for the regression equation. They are shown in Figure 7, in cells G24:J24. The formulas are as follows:
The relevant portion of the LINEST() results is also shown in Figure 7, in cells L24:O24. Note that the values in that range are identical to those in G24:J24, but of course LINEST() returns them in reverse of the order in which the original variables are entered on the worksheet.