Home > Articles > Home & Office Computing > Microsoft Applications

Excel's LINEST() Function Deconstructed

Microsoft Excel has for many years included a worksheet function called LINEST(), which returns a multiple regression analysis of a single outcome or predicted variable on one or more predictor variables. LINEST() returns a regression equation, standard errors of regression coefficients, and goodness-of-fit statistics. In the first of three articles, Excel expert Conrad Carlberg, author of Predictive Analytics: Microsoft Excel, discusses issues regarding LINEST() that have not been covered sufficiently, or even accurately, in the past.
Like this article? We recommend

Like this article? We recommend

Microsoft Excel's LINEST() worksheet function has a long and checkered history. It is capable of returning a multiple regression analysis with up to 64 predictor variables and one outcome or "predicted" variable. (Early versions permitted up to 16 predictor variables.)

LINEST() performs quite well in most situations. It returns accurate regression coefficients and intercepts, the standard errors of the coefficients and of the intercept, and six summary statistics regarding the regression: R2, the standard error of estimate, the F ratio for the full regression, the degrees of freedom for the residual, and the sums of squares for the regression and for the residual.

But LINEST has some drawbacks, ranging from the inconvenient to the potentially disastrous.

An Inconvenient Problem

One difficulty is that the regression coefficients and their standard errors are shown in reverse order in which their associated underlying variables appear on the worksheet. See Figure 1.

Figure 1 LINEST() returns coefficients in reverse order of the worksheet.

In Figure 1, the predictor variables are years of education and years of age. Education data is in column A, and Age data is in column B. The predicted variable, Income, is in column C.

The formula that uses the LINEST() function is array-entered (with Ctrl+Shift+Enter) in the range E5:G9. The formula in this example is:

=LINEST(C2:C21,A2:B21,TRUE,TRUE)

The problem is that the regression coefficient for Age is in cell E5, and the coefficient for Education is in cell F5: in left-to-right order, the coefficient for Age comes before the coefficient for Education. But in the underlying data set, the Education data (column A) precedes the Age data (column B).

(The intercept, in cell G5 in Figure 1, always appears rightmost in the LINEST() results.)

So if you wanted to use the regression equation to estimate the income of the first person in Row 2, you would need to use this formula (parentheses included for clarity only):

=(E5*B2)+(F5*A2)+G5

instead of the more natural and more easily interpreted:

=(E5*A2)+(F5*B2)+G5

With just two variables, this is a really minor issue. But with 5, 10, perhaps 20 variables, it becomes exasperating. To complete the regression equation, you need to proceed left-to-right for the variables and right-to-left for the coefficients. With 20 of each, it's tedious and error-prone.

And there is absolutely no good reason for it—statistical, theoretical or programmatic. I recognize that one could use the TREND() function instead of assembling the regression formula, coefficient by coefficient and variable by variable, but there are often times when you need to see the result of modifying one variable or coefficient; the only way to do that is to call them out separately in the full equation.

Nevertheless, this is principally a matter of convenience. The issues that I'm going to discuss in subsequent papers are more serious, particularly if you're still using a version of Excel prior to 2003.

This paper continues with a discussion of how the results provided by LINEST() can be calculated, and how you can replicate those results using Excel's native worksheet functions. A little matrix algebra is needed and it will be necessary for you to be familiar with the concepts behind the worksheet functions MMULT(), MINVERSE(), and TRANSPOSE().

Once you've seen how to replicate the LINEST() results using straightforward matrix algebra, you'll be in a position to see how Microsoft got it badly wrong when it offered LINEST()’s third option, const. That option calculates regression statistics "without the constant," also known as "forcing the intercept through zero." While the associated problems have been fixed, anyone who is still using a version of Excel prior to 2003 is in trouble if that option is selected, whether in LINEST(), TREND(), or the Regression tool in the Data Analysis add-in.

You will see in a subsequent paper how Microsoft has changed its algorithm to avoid returning a negative R2, and how it came about in the first place. This is necessary information for anyone needing to migrate a regression analysis from, say, Excel 2002 to Excel 2010, or to understand how Excel 2002's results can be so different from Excel 2010's. Even if you're using a version subsequent to Excel 2003, the problems still show up in the R2 values associated with chart trendlines.

Microsoft has also included in the code for LINEST() a method for dealing with severe multi-collinearity in the X matrix. (Multi-collinearity is just a hifalutin word for two or more predictor variables that are perfectly correlated, or virtually so.) Microsoft deserves kudos for recognizing that the problem existed. But the way that the solution is manifested in the results of LINEST() since Excel 2003 is potentially disastrous. With the information in this paper, you'll be in a position to avoid that particular LINEST()—well, call it a feature.

Assembling LINEST() Results from Other Functions

In this section, I'm going to show you how to assemble the different results you get from LINEST() using other worksheet functions. Some of these methods will be clear, even obvious. Others will seem unclear, and they aren't at all intuitively rich. But by taking things apart, I think you'll find it much easier to understand the way they work together.

Getting the Regression Coefficients

The first step is to lay out the data as shown in Figure 2.

Figure 2 Add a column that contains nothing but 1's to the range of predictor variables.

Figure 2 shows that a column containing 1's is included with the other predictor, or X, values. This column enables the matrix operations described below to calculate an intercept and its standard error. Although you don't see that column of 1's when you run LINEST() directly on your input data, Excel adds it (invisibly) on your behalf.

Getting the Sum of Squares and Cross Products (SSCP)

You'll need access to what's called the transpose of the data in B3:E22. You can do that explicitly on the worksheet using Excel's TRANSPOSE() function. In Figure 2, the range H2:AA5 contains this array formula:

=TRANSPOSE(B3:E22)

(Recall that you enter an array formula using Ctrl+Shift+Enter instead of simply Enter.)

With those two matrices set up, you can get what's called the sum of squares and cross-products matrix, often called the SSCP matrix. Use this array formula:

=MMULT(H2:AA5,B3:E22)

If you don't want to bother putting the transpose of the X matrix directly on the worksheet, you could use this array formula instead to get the SSCP matrix:

=MMULT(TRANSPOSE(B3:E22),B3:E22)

Excel's MMULT() function performs matrix multiplication. Here, the transpose of the X matrix (B3:E22) is post-multiplied by the X matrix.

Getting the Inverse of the SSCP Matrix

The next step is to get the inverse of the SSCP matrix. A matrix's inverse is analogous to an inverse in simple arithmetic. The inverse of the number 4 is 1/4: When you multiply a number by its inverse, you get 1.

Similarly, when you multiply a matrix by its inverse, you get a new matrix with 1's in its main diagonal and 0's everywhere else. Figure 3 shows the SSCP matrix in G3:J6, its inverse in G10:J13, and the result of the multiplication of the two matrices in L10:O13.

Figure 3 The matrix in L10:O13 is called an identity matrix.

Calculating the Regression Coefficients and Intercept

I mentioned earlier that much of the derivation of the results that LINEST() returns is not intuitively rich. The inverse of the SSCP matrix is an example of that. There's much information buried in the matrix inverse, but no flash of intuition will tell you that it's hidden there, or even why it's there. For example, see Figure 4.

Figure 4 The SSCP matrix and its inverse, combined with the X and Y matrices, return the regression coefficients and the intercept.

In Figure 4, notice the range G18:J18. It contains this array formula:

=TRANSPOSE(MMULT(G10:J13,MMULT(TRANSPOSE(B3:E22),A3:A22)))

In words, the formula uses matrix multiplication via the MMULT() function to combine the transposed X matrix (B3:E32) with the Y matrix (A3:A32) with the inverse of the SSCP matrix (G10:J13). The result in G18:J18 is the intercept (G18) and the regression coefficients (H18:J18). The coefficients are in the same order that the underlying values appear on the worksheet—that is, columns C, D, and E contain the values for variables X1, X2, and X3, respectively, and cells H18, I18, and J18 contain the associated regression coefficients.

Cells G21:J21 contain the first row of the LINEST() results for the same underlying data set (except that the 1's in column B are omitted from the LINEST() arguments because LINEST() supplies them for you). Notice that the values for the intercept and the coefficients are identical to those in row 18. The only difference is that LINEST() has returned them out of order.

In sum, to get the intercept and regression coefficients using matrix algebra instead of using LINEST(), take the following general steps:

  1. Get the SSCP matrix using X'X. Use MMULT() and TRANSPOSE() to postmultiply the transpose of the X matrix by the X matrix.
  2. Use MINVERSE() to calculate the inverse of the SSCP matrix.
  3. Use the array formula given above and repeated here to calculate the intercept and coefficients:

=TRANSPOSE(MMULT(G10:J13,MMULT(TRANSPOSE(B3:E22),A3:A22)))

Getting the Sum of Squares Regression and Residual

It probably seems a little perverse to go from the calculation of regression coefficients to sums of squares, skipping over standard errors, R2, F tests, and so on. But you need the sums of squares to calculate those other statistics.

Before getting to the matter of calculating the sums of squares, it's helpful to review the meaning of the sum of squares regression and the sum of squares residual.

A sum of squares, in most statistical contexts, is the sum of the squares of the differences (or deviations) between individual values and the mean of the values. So if our values are 2 and 4, the mean is 3. 2 – 3 is -1, and the squared deviation is +1. 4 – 3 is 1, and the squared deviation is +1. Therefore, the sum of squares is 1 + 1 or 2.

Our purpose in calculating those two sums of squares is to divide the total sum of squares into two parts:

  • The sum of squares regression is the sum of the squared deviations of the Y values that are predicted by the regression coefficients and intercept, from the mean of the predicted values.
  • The sum of squares residual is the sum of the squared deviations of the differences between the actual Y values and the predicted Y values, from the mean of those deviations.

Calculating the Predicted Values

Those two definitions of sums of squares are fairly dense when written in English. It’s usually easier to understand what's going on if you think about them in the context of an Excel worksheet. See Figure 5.

Figure 5 Calculating the sums of squares

In Figure 5, I have repeated the regression coefficients and the intercept, as calculated using the matrix algebra discussed earlier, in the range G3:J3. Because they appear in the correct order, you can easily use them to calculate the predicted Y values as shown in the range L3:L22. This is the formula that's used in cell L3:

=$G$3+SUMPRODUCT(C3:E3,$H$3:$J$3)

The intercept and coefficients in G3:J3 are identified using dollar signs and therefore absolute addressing. The X values in C3:E3 are identified using relative addressing. Therefore, you can drag and drop or copy and paste from cell L3 into the range L4:L22.

Just as a check, Figure 5 also shows the predicted Y values in M3:M22, using this array formula in that range:

=TREND(A3:A22,C3:E22)

You'll note that the predicted values using matrix algebra are identical to the predicted values using TREND(). There are actually slight differences, but they do not begin to show up until the 14th decimal place. (For example, the difference between cell L8 and cell M8 is 0.000000000000057.)

Calculating the Prediction Errors

The values shown in Figure 5, in the range O3:O22, are the errors in the predicted values. They are simply the differences between the actual Y values in A3:A22 and the predicted values in L3:L22. So, for example, the formula in cell O3 is =A3-L3.

Calculating the Sums of Squares

With the predicted values and the errors of prediction, we're in a position to calculate the sums of squares. The sum of squares regression is found with this formula in cell G24:

=DEVSQ(L3:L22)

and the sum of squares residual is found with a similar formula in cell H24:

=DEVSQ(O3:O22)

Notice that the two sums of squares total to 21612.95. This is the same value as appears in cell G26. The formula in G26 is:

=DEVSQ(A3:A22)

which is the sum of the squared deviations of the original Y values. So, the process described in this section has accomplished the following:

  • Predicted Y values on the basis of the combination of the X values and the regression coefficients and intercept.
  • Obtained the sum of squared deviations of the predicted Y values (the sum of squares regression).
  • Calculated the errors of prediction by subtracting the predicted Y values from the actual Y values.
  • Obtained the sum of squared deviations of the errors of prediction (the sum of squares residual).
  • Demonstrated that the total sum of squares of the actual Y values has been divided into two portions: the sum of squares regression and the sum of squares residual.

Calculating the Regression Diagnostics

Now that we have the sum of squares regression and the sum of squares residual, it's easy to get the results that help you diagnose the accuracy of the regression equation.

Calculating R2

The R2 is simply the proportion of variability in the Y values that can be attributed to variability in the best combination of the X variables. That best combination is the result of applying the regression coefficients to the X variables—that is, the best combination is represented by the predicted Y values.

Therefore, the R2 is calculated by this ratio:

     (Sum of Squares Regression) / (Sum of Squares Total)

Because the sum of squares total is the sum of the regression and the residual sums of squares, you can easily calculate R2 on the worksheet as shown in Figure 6.

Figure 6 Calculating the goodness-of-fit statistics

In Figure 6, cell G14 contains this formula:

=G12/(G12+H12)

which returns the ratio of the regression sum of squares to the total sum of squares.

Calculating the Standard Error of Estimate

At this point, you need to keep in mind the way that you’ve set up your inputs. In Figure 6, I’ve set things up so that the column of 1's is shown explicitly on the worksheet. That's because the column is needed if you're going to obtain the value of the regression equation’s intercept by means of matrix algebra—the instances of MMULT(),TRANSPOSE(), and MINVERSE() that I’ve discussed in this paper.

In that case—if you're showing the column of 1's explicitly—you get the degrees of freedom for the sum of squares residual by subtracting the number of X variables on the worksheet from the number of observations, or rows, in the matrix of X values.

In the example shown in Figure 6, the number of observations is 20, found in rows 3 through 22. The number of variables is 4, found in columns B through E. Therefore, the number of degrees of freedom for the sum of squares residual is 16: 20-4. You can confirm this from the LINEST() results in Figure 6, cells G6:J10, where the degrees of freedom shows up in cell H9.

On the other hand, if you want to use LINEST() directly, you don't need to supply the column of 1's on the worksheet: Excel supplies the 1's for you and you never see them. But if you're going to determine the degrees of freedom residual for yourself, then subtract the number of X variables (in this case, 3) from the number of observations (20) and then subtract 1 from the result to get 16.

In fact, you'll find that most intermediate statistics texts tell you that the degrees of freedom for the residual sum of squares is N-k-1, where N is the number of observations, k is the number of predictor variables, and 1 is for the column of 1's that you never see unless you arrange for them yourself.

So, to get the standard error of estimate, divide the sum of squares residual by the degrees of freedom for the residual, and take the square root of the result. The formula used in cell G15 of Figure 6 is:

=SQRT(H12/16)

The result is identical to that provided in the LINEST() results in cell H8.

Calculating the F Ratio for the Regression

There are a couple of ways to go about calculating the F ratio for the full regression. Both involve using the degrees of freedom for the residual and the degrees of freedom for the regression.

The prior section discussed how to get the degrees of freedom for the residual. The degrees of freedom for the regression is the number of X variables minus 1. So, if you have supplied the column of 1's explicitly on the worksheet, as in Figure 6, there are four X variables, and the degrees of freedom for the regression is 3.

If you use LINEST() and do not supply a column of 1's to it as an X variable—because Excel does that on your behalf—you still have four X variables; it's just that you're not looking at one of them. So the degrees of freedom for the regression is still 4-1, or 3, in this example.

One way to calculate the F ratio is to use the R2 value. Figure 6 does that in cell G17, where the formula is:

=(G14/3)/((1-G14)/16)

In words, the numerator is the R2 value divided by the regression degrees of freedom. The denominator is (1 – R2) divided by the residual degrees of freedom.

Another way uses the sums of squares instead of the R2 value. It's mathematically equivalent because we use the sums of squares to calculate the R2 value. The formula used in cell G18 of Figure 6 is:

=(G12/3)/(H12/16)

The numerator is the sum of squares regression divided by its degrees of freedom. The denominator is the sum of squares residual divided by its degrees of freedom.

You may know that a sum of squared deviations divided by its degrees of freedom is a variance, often termed a mean square. That's what we have in cell G18: one variance divided by another. And the ratio of two variances is an F ratio.

Here, we have the variance of the Y scores as predicted by the regression equation, divided by the variance of the errors in those predictions. If the resulting ratio is meaningfully larger than 1.0, we regard the regression as a reliable one: an outcome that we expect to be similar if we repeat this research with a different but similarly obtained sample of observations. And you can test the reliability of the observed F ratio by using Excel's F.DIST() function.

Getting the Standard Errors

The final task in deconstructing the LINEST() function is to calculate the values of the standard errors of the intercept and the regression coefficients. These values are returned in the second row of the LINEST() results. Figure 7 shows the required calculations.

Figure 7 Calculating the standard errors

Figure 7 shows the SSCP matrix and its inverse, shown earlier in Figure 4. To get the standard errors of the regression coefficients and the intercept, we need to multiply the inverse of the SSCP matrix by the mean square for the residual.

Figure 7 shows the inverse of the SSCP matrix in cells G12:J15.

The prior section showed how to calculate the mean square residual: simply divide the sum of squares residual by the residual degrees of freedom. Figure 7 does that for this example in cell M14, using this formula:

=L14/16

Note that L14 contains the sum of squares residual, and 16 is the degrees of freedom for the residual.

The matrix shown in Figure 7, cells G18:J21, is the result of multiplying the inverse of the SSCP matrix by the mean square residual. The array formula is:

=G12:J15*M14

The square roots of the elements in the main diagonal of the matrix in G18:J21 are the standard errors for the regression equation. They are shown in Figure 7, in cells G24:J24. The formulas are as follows:

G24: =SQRT(G18)

H24: =SQRT(H19)

I24: =SQRT(I20)

J24: =SQRT(J21)

The relevant portion of the LINEST() results is also shown in Figure 7, in cells L24:O24. Note that the values in that range are identical to those in G24:J24, but of course LINEST() returns them in reverse of the order in which the original variables are entered on the worksheet.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020