Home > Articles > Home & Office Computing > Microsoft Applications

This chapter is from the book

Linear Regression with Coded Vectors

Another method for dealing with seasonal time series, whether trended or not, employs linear regression in combination with effect coding. Effect coding is just a way of representing which season a particular observation belongs to, and Excel makes it particularly easy to set up.

The two broad methods that this chapter has already discussed, those of simple averages and of moving averages, also employ linear regression. Figure 5.5 shows how the simple averages method uses regression to calculate a year-to-year trend and distribute that trend equally across the quarters. Figure 5.7 shows how the same method uses regression to forecast from deseasonalized quarters before adding seasonal effects back in. (It would have been feasible to use Holt’s method in Figure 5.7 instead of linear regression.)

Figure 5.14 shows how you can use linear regression just as is done in Figure 5.7, to forecast from deseasonalized observations prior to reseasonalizing them using the seasonal indexes. Again, you could use Holt’s method instead of regression.

If you combine effect coding (or, as you’ll see, dummy coding) with linear regression, you can get the forecasts of a seasonal time series in one step. It’s not necessary to jump through the hoops of calculating averages for each season and for an encompassing period such as a year, deviate them from the grand mean, normalize them, deseasonalize the time series, forecast a trended series, and finally add the seasonal indexes back in. Nor is it necessary to do anything fancy with LINEST() or TREND(). It’s all in the way you indicate the season in the time series.

About Effect Coding and Regression

In the 1960s, when computers were becoming more and more accessible in colleges and universities and in business settings, regression analysis began to supplant an older statistical method, the analysis of variance, or ANOVA. That method, which first saw light in a 1918 paper by Sir Ronald Fisher, helps the researcher draw inferences about the reliability of an empirical finding. For example: “If I repeat this experiment 100 times, how often can I expect to find differences between groups as large as I just saw, assuming that this time around those differences were due to sampling error?”

ANOVA is one way of expressing something called the General Linear Model, in itself a way of describing contributions to numeric observations. Suppose that you were investigating the distribution of C reactive protein (CRP) in humans. Besides the measured level of that protein for each person in a sample, you might have information about each person’s sex, and whether each person has coronary artery disease.

The General Linear Model would tell you that each person’s CRP can be expressed by summing the following terms:

  • The mean level of CRP in your entire sample
  • The elevation or depression from the overall mean in CRP for males, or, if the person is female, for females
  • The elevation or depression from the overall mean in CRP for people with coronary disease; otherwise, for those without such disease
  • The elevation or depression for males with and without coronary disease, and for females with and without coronary disease
  • An error term that measures the difference between a person’s actual measured CRP level and the sum of the other effects (grand mean, elevation or depression specific to each sex, elevation or depression due to coronary disease, interaction between sex and coronary disease)

The categories in ANOVA (in this example, sex and the presence or absence of coronary disease) are traditionally handled separately: Various calculations are performed for the males’ data, and performed separately from the females’ data; and similarly for the presence and absence of coronary disease. In this way, ANOVA can deal with nominal, categorical information.

Regression can’t—or, rather, doesn’t—work directly with nominal information such as “Male or “Has atherosclerosis.” With regression, you use one of several available methods to code those categories numerically. Then, you present those codes along with CRP measures to some function such as Excel’s LINEST() or TREND() functions to handle everything at once. Most, perhaps even the overwhelming majority, of analyses that contrast different groups and subgroups since the 1990s have used a regression approach.

This was not true before significant computing power became readily available, before capabilities that once required mainframes became available on the desktop. Or laptop. Or wristwatch.

Researchers had to wait for that computing power because regression analysis requires a lot more in the way of exacting calculations than does ANOVA. It’s worth it, though. Regression analysis with coded variables gives you all the results that ANOVA does, plus richer statistics that put you in a position to understand your data much more thoroughly.

I’ve dragged you through this mini history lesson not because it’s intrinsically interesting, but to emphasize that a coding scheme in conjunction with regression analysis has been used successfully and effectively for decades, both in more static research such as medical, agricultural, and financial, and also in research that emphasizes changes over time, such as predictive analytics.

Effect Coding with Seasons

Let’s have a look at a full analysis before digging into how it works. Figure 5.16 repeats the time series used with simple averages from Figure 5.5, along with three variables that identify the seasons. Recall that the time series consists of the average number of daily hits on a football-related website, for each quarter in each of five years.

Figure 5.16

Figure 5.16 The variables that identify seasons are more frequently termed vectors.

A time period that belongs to the first quarter gets a 1 in column E and a zero in columns F and G. If it belongs to the second quarter, it gets a 1 in column F and a zero in columns E and G. If it belongs to the third quarter, it gets a 1 in column G and zeroes in columns E and F. If it belongs to the fourth quarter, it gets a –1 in all three columns, E, F, and G.

Besides columns C and D, which contain the variable of interest and the time period during which the measurement was taken, there are three new columns: one that represents membership in Quarter 1, one in Quarter 2, and one in Quarter 3. We have four quarters, but three vectors in columns E, F, and G are sufficient to define the quarter for each of the 20 observations. If a record has a –1 in columns E, F, and G, it belongs to Quarter 4.

In Figure 5.16, notice that the LINEST() function results in the range I3:M7. The values in I3:K3 correspond exactly to the seasonal indexes shown for the same data set in Figure 5.7, cells C2:E2. The values are in reverse order in Figure 5.16 due to a peculiarity in LINEST(), which has been around since the 1990s and which I criticized in a Note toward the end of Chapter 4.

Furthermore, if you used the LINEST() function on Figure 5.7 to regress the values in column C on those in column B, you’d find that the intercept and coefficient correspond exactly to those reported in cells L3:M3 in Figure 5.16. The regression of the deseasonalized quarterlies on the time periods corresponds precisely to the intercept and coefficient in the omnibus analysis in Figure 5.16.

In other words, by coding quarterly membership in the input range for LINEST() by means of effect codes, you can get precisely the same results as you do by going through all the computations in the simple-averages method. I included the simple-averages method earlier in this chapter because if you understand its flow of overt events, you’re better placed to understand what’s going on inside the black box when you submit effect codes to LINEST() along with the actual observations and their time periods.

Given that LINEST() returns the intercept, the trend coefficient and the seasonal indexes that the simple-averages method returns, it shouldn’t be surprising to see that you can easily get the same forecasts from the TREND() function. See Figure 5.17.

Figure 5.17

Figure 5.17 The forecasts are identical to those shown in Figure 5.7.

The LINEST() analysis from Figure 5.16 is repeated in Figure 5.17, and the intercept and coefficients in its first row are used in cell I3 to calculate the forecast for the first observation. It’s identical to the value returned by TREND() in cell H3. No surprise there, of course: That’s exactly what TREND() is supposed to do.

One useful aspect of effect coding is that the regression coefficient associated with a particular coded vector equals the difference between the grand mean of the outcome variable and the mean of the group associated with that vector. Because that’s a confusing concept when it’s presented in words, let’s look at it more closely in Figure 5.18.

Figure 5.18

Figure 5.18 The reason it’s termed effect coding.

There’s a lot going on in Figure 5.18. First, the original observations in column C are regressed onto the time periods in column D. The results of that regression are shown in the range E3:E22, where the array formula is as shown here:

=TREND(C3:C22,D3:D22)

So the values in column E are the forecasts of average daily hits based on the relationship between the observations in column C and the time periods in column D.

Column F subtracts the forecast hits in column E from the actual figures in column C. The results of the subtractions are termed residuals—what’s left over of the original observations after the effect of their relationship to the time periods has been removed.

Finally, the residuals in column F are analyzed with another instance of LINEST(), regressing the residuals on the effect codes in columns G, H, and I. There are a couple of points of interest about the LINEST() results, in N3:Q7:

  • The intercept in cell Q3 is zero. With effect coding, the intercept equals the grand mean of the outcome variable. We’re analyzing residuals here, and residuals always sum to zero in effect coding. Therefore, the intercept also equals zero.
  • The regression coefficients appear in N3:P3. With effect coding, the coefficients equal the difference between the mean of the associated group and the grand mean. Because the grand mean of the residuals is zero, each coefficient equals the mean of the residuals in the associated group.

Figure 5.18 also shows in the range M10:P10 the residual number of hits in each quarter, calculated as a mean for each quarter from the values in column F. So, the formula in cell P10 is as follows:

=AVERAGE(F3,F7,F11,F15,F19)

Notice that the result, 100.14, is equal to the regression coefficient for the first coded vector in cell P3. And the calculated means for the second and third quarters, in O10 and N10, equal the regression coefficients in cells O3 and N3. Again, this is characteristic of pure effect coding. The regression coefficients quantify the effect of being in a given group when all you have as predictors are effect codes. If, as in analysis of covariance, you include one or more covariates—and here the time period is analogous to a covariate—the regression coefficients do not precisely equal the group means.

Also notice that because the design of the analysis includes three coded vectors, not four, you get only three coefficients. (Adding a fourth vector to identify quarterly membership explicitly makes it impossible for linear regression to analyze the problem correctly.) The way to get the fourth seasonal index is to add the other indexes together and multiply by –1. So in Figure 5.18, the formula for cell M3 is this:

=-SUM(N3:P3)

Note that the result in cell M3 is identical to the average of the Quarter 4 residuals calculated in cell M10.

Setting Up the Coded Vectors

If you have just a few observations in your time series—say, on the order of 20 or fewer—it’s probably quicker and easier to enter the 1s, 0s and –1s by hand. You can speed things up a little by selecting the range that the codes will occupy, typing a 0, and then entering it with Ctrl+Enter. That will populate most of the cells with their correct code. There are always more 0s than 1s or –1s when you have at least three vectors. Then, replace the 0s with 1s and –1s where needed.

With so many observations and vectors that you can’t bring yourself to enter the codes by hand, you can speed things up appreciably by using Excel’s VLOOKUP() function. See Figure 5.19 for an example.

Figure 5.19

Figure 5.19 The range A2:D5 serves as a sort of template for the actual code vectors.

Excel uses the term lookup range to refer to a range that’s used as A2:D5 is used in this example. The first column in your lookup range should contain the characters or numbers you use to identify your seasons. Those identifiers could be the numbers 1 through 4 if your seasons are quarters; or 1 through 12 if they are months; or Monday, Tuesday, and so on if they are days. The characters or numbers in the first column of the lookup range must correspond to the characters in the time-period portion of your time series. (So if your seasons are days and are represented as Mon, Tue, Wed, Thu, Fri, Sat, and Sun, you should use the same abbreviations in the first column of the lookup table.)

You’ll also need as many vector columns in the lookup range as you have seasons, minus 1. In Figure 5.19, for example, there are four quarters, so the lookup range has three columns—one for each vector.

Enter the 1s, 0s, and –1s as shown. Using the codes in this way results in a set of regression coefficients which quantify the effect of being in a given group.

The basic rules are these:

  • Enter a 1 in the column that represents the season identified in the lookup range’s first column, and 0s in the remaining columns of that row.
  • For the final season, enter –1s throughout the vectors on its row.

Then, enter this formula in the first row, first code vector (in Figure 5.19, that’s cell E8):

=VLOOKUP($B8,$A$2:$D$5,2)

The first argument, which here is $B8, should be a reference to the season identifier for the first record. It anchors the reference to column B so that you can drag it to the right without column B adjusting to columns C and D. Note that cell B8 contains a 1. The fact that the value 1 is in column B means that the record on row 8 belongs to Quarter 1.

The second argument, in this case $A$2:$D$5, should refer to the lookup range (without the column headers that Figure 5.19 shows in row 1). The third argument should identify the column with the codes for the first vector; here, that’s column B, the second column in the lookup range.

Use cell E8’s selection handle to drag the formula into columns F and G. You’ll need to edit the final argument from 2 to 3 in column F, and from 2 to 4 in column G. The formulas in F8 and G8 should now be, respectively,

=VLOOKUP($B8,$A$2:$D$5,3)

and

=VLOOKUP($B8,$A$2:$D$5,4)

Finally, make a multiple selection of E8:G8 and drag it down as far as your time series goes—in the layout shown in Figure 5.19, that’s through row 27, although only the first 23 rows are visible in the figure. If the column immediately to the left of the coded vectors has data all the way to the end of the time series, just double-click cell G8’s selection handle to copy and paste the formulas all the way down, and no farther.

At this point, you’re in a position to use LINEST() to get the regression equation and associated statistics to evaluate it, and TREND() to get the seasonal forecasts. Figure 5.19 uses this array formula in F2:J6:

=LINEST(C8:C27,D8:G27,,TRUE)

And this one in J8:J27:

=TREND(C8:C27,D8:G27)

You can get the one-step-ahead forecast for the first quarter of 2006 using the same approach as shown at the bottom of Figure 5.17.

Dummy Coding

Effect coding has a close relative called dummy coding. It uses the same data setup as effect coding, with one crucial difference: The records that get –1s in each vector with effect coding get 0s instead in dummy coding. Figure 5.20 shows the data in Figure 5.19 set up for dummy coding.

Figure 5.20

Figure 5.20 Notice that the overall regression statistics such as R2 are unaffected by the choice of coding method.

Using dummy coding instead of effect coding changes no regression statistics other than the regression coefficients and intercept for the coded vectors, and their standard errors (found in LINEST()’s second row). With dummy coding, the coefficient for each coded variable expresses the difference between the mean of the group associated with that variable, and the mean of the group that gets 0s on all vectors.

This characteristic makes dummy coding helpful in experiments in which two or more treatment groups are contrasted with a control group. Dummy coding can also prove useful in logistic regression because it can make interpretation of the results more straightforward.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020