Linear Regression with Coded Vectors
Another method for dealing with seasonal time series, whether trended or not, employs linear regression in combination with effect coding. Effect coding is just a way of representing which season a particular observation belongs to, and Excel makes it particularly easy to set up.
The two broad methods that this chapter has already discussed, those of simple averages and of moving averages, also employ linear regression. Figure 5.5 shows how the simple averages method uses regression to calculate a year-to-year trend and distribute that trend equally across the quarters. Figure 5.7 shows how the same method uses regression to forecast from deseasonalized quarters before adding seasonal effects back in. (It would have been feasible to use Holt’s method in Figure 5.7 instead of linear regression.)
Figure 5.14 shows how you can use linear regression just as is done in Figure 5.7, to forecast from deseasonalized observations prior to reseasonalizing them using the seasonal indexes. Again, you could use Holt’s method instead of regression.
If you combine effect coding (or, as you’ll see, dummy coding) with linear regression, you can get the forecasts of a seasonal time series in one step. It’s not necessary to jump through the hoops of calculating averages for each season and for an encompassing period such as a year, deviate them from the grand mean, normalize them, deseasonalize the time series, forecast a trended series, and finally add the seasonal indexes back in. Nor is it necessary to do anything fancy with LINEST() or TREND(). It’s all in the way you indicate the season in the time series.
About Effect Coding and Regression
In the 1960s, when computers were becoming more and more accessible in colleges and universities and in business settings, regression analysis began to supplant an older statistical method, the analysis of variance, or ANOVA. That method, which first saw light in a 1918 paper by Sir Ronald Fisher, helps the researcher draw inferences about the reliability of an empirical finding. For example: “If I repeat this experiment 100 times, how often can I expect to find differences between groups as large as I just saw, assuming that this time around those differences were due to sampling error?”
ANOVA is one way of expressing something called the General Linear Model, in itself a way of describing contributions to numeric observations. Suppose that you were investigating the distribution of C reactive protein (CRP) in humans. Besides the measured level of that protein for each person in a sample, you might have information about each person’s sex, and whether each person has coronary artery disease.
The General Linear Model would tell you that each person’s CRP can be expressed by summing the following terms:
- The mean level of CRP in your entire sample
- The elevation or depression from the overall mean in CRP for males, or, if the person is female, for females
- The elevation or depression from the overall mean in CRP for people with coronary disease; otherwise, for those without such disease
- The elevation or depression for males with and without coronary disease, and for females with and without coronary disease
- An error term that measures the difference between a person’s actual measured CRP level and the sum of the other effects (grand mean, elevation or depression specific to each sex, elevation or depression due to coronary disease, interaction between sex and coronary disease)
The categories in ANOVA (in this example, sex and the presence or absence of coronary disease) are traditionally handled separately: Various calculations are performed for the males’ data, and performed separately from the females’ data; and similarly for the presence and absence of coronary disease. In this way, ANOVA can deal with nominal, categorical information.
Regression can’t—or, rather, doesn’t—work directly with nominal information such as “Male or “Has atherosclerosis.” With regression, you use one of several available methods to code those categories numerically. Then, you present those codes along with CRP measures to some function such as Excel’s LINEST() or TREND() functions to handle everything at once. Most, perhaps even the overwhelming majority, of analyses that contrast different groups and subgroups since the 1990s have used a regression approach.
This was not true before significant computing power became readily available, before capabilities that once required mainframes became available on the desktop. Or laptop. Or wristwatch.
Researchers had to wait for that computing power because regression analysis requires a lot more in the way of exacting calculations than does ANOVA. It’s worth it, though. Regression analysis with coded variables gives you all the results that ANOVA does, plus richer statistics that put you in a position to understand your data much more thoroughly.
I’ve dragged you through this mini history lesson not because it’s intrinsically interesting, but to emphasize that a coding scheme in conjunction with regression analysis has been used successfully and effectively for decades, both in more static research such as medical, agricultural, and financial, and also in research that emphasizes changes over time, such as predictive analytics.
Effect Coding with Seasons
Let’s have a look at a full analysis before digging into how it works. Figure 5.16 repeats the time series used with simple averages from Figure 5.5, along with three variables that identify the seasons. Recall that the time series consists of the average number of daily hits on a football-related website, for each quarter in each of five years.
Figure 5.16 The variables that identify seasons are more frequently termed vectors.
A time period that belongs to the first quarter gets a 1 in column E and a zero in columns F and G. If it belongs to the second quarter, it gets a 1 in column F and a zero in columns E and G. If it belongs to the third quarter, it gets a 1 in column G and zeroes in columns E and F. If it belongs to the fourth quarter, it gets a –1 in all three columns, E, F, and G.
Besides columns C and D, which contain the variable of interest and the time period during which the measurement was taken, there are three new columns: one that represents membership in Quarter 1, one in Quarter 2, and one in Quarter 3. We have four quarters, but three vectors in columns E, F, and G are sufficient to define the quarter for each of the 20 observations. If a record has a –1 in columns E, F, and G, it belongs to Quarter 4.
In Figure 5.16, notice that the LINEST() function results in the range I3:M7. The values in I3:K3 correspond exactly to the seasonal indexes shown for the same data set in Figure 5.7, cells C2:E2. The values are in reverse order in Figure 5.16 due to a peculiarity in LINEST(), which has been around since the 1990s and which I criticized in a Note toward the end of Chapter 4.
Furthermore, if you used the LINEST() function on Figure 5.7 to regress the values in column C on those in column B, you’d find that the intercept and coefficient correspond exactly to those reported in cells L3:M3 in Figure 5.16. The regression of the deseasonalized quarterlies on the time periods corresponds precisely to the intercept and coefficient in the omnibus analysis in Figure 5.16.
In other words, by coding quarterly membership in the input range for LINEST() by means of effect codes, you can get precisely the same results as you do by going through all the computations in the simple-averages method. I included the simple-averages method earlier in this chapter because if you understand its flow of overt events, you’re better placed to understand what’s going on inside the black box when you submit effect codes to LINEST() along with the actual observations and their time periods.
Given that LINEST() returns the intercept, the trend coefficient and the seasonal indexes that the simple-averages method returns, it shouldn’t be surprising to see that you can easily get the same forecasts from the TREND() function. See Figure 5.17.
The LINEST() analysis from Figure 5.16 is repeated in Figure 5.17, and the intercept and coefficients in its first row are used in cell I3 to calculate the forecast for the first observation. It’s identical to the value returned by TREND() in cell H3. No surprise there, of course: That’s exactly what TREND() is supposed to do.
One useful aspect of effect coding is that the regression coefficient associated with a particular coded vector equals the difference between the grand mean of the outcome variable and the mean of the group associated with that vector. Because that’s a confusing concept when it’s presented in words, let’s look at it more closely in Figure 5.18.
Figure 5.18 The reason it’s termed effect coding.
There’s a lot going on in Figure 5.18. First, the original observations in column C are regressed onto the time periods in column D. The results of that regression are shown in the range E3:E22, where the array formula is as shown here:
So the values in column E are the forecasts of average daily hits based on the relationship between the observations in column C and the time periods in column D.
Column F subtracts the forecast hits in column E from the actual figures in column C. The results of the subtractions are termed residuals—what’s left over of the original observations after the effect of their relationship to the time periods has been removed.
Finally, the residuals in column F are analyzed with another instance of LINEST(), regressing the residuals on the effect codes in columns G, H, and I. There are a couple of points of interest about the LINEST() results, in N3:Q7:
- The intercept in cell Q3 is zero. With effect coding, the intercept equals the grand mean of the outcome variable. We’re analyzing residuals here, and residuals always sum to zero in effect coding. Therefore, the intercept also equals zero.
- The regression coefficients appear in N3:P3. With effect coding, the coefficients equal the difference between the mean of the associated group and the grand mean. Because the grand mean of the residuals is zero, each coefficient equals the mean of the residuals in the associated group.
Figure 5.18 also shows in the range M10:P10 the residual number of hits in each quarter, calculated as a mean for each quarter from the values in column F. So, the formula in cell P10 is as follows:
Notice that the result, 100.14, is equal to the regression coefficient for the first coded vector in cell P3. And the calculated means for the second and third quarters, in O10 and N10, equal the regression coefficients in cells O3 and N3. Again, this is characteristic of pure effect coding. The regression coefficients quantify the effect of being in a given group when all you have as predictors are effect codes. If, as in analysis of covariance, you include one or more covariates—and here the time period is analogous to a covariate—the regression coefficients do not precisely equal the group means.
Also notice that because the design of the analysis includes three coded vectors, not four, you get only three coefficients. (Adding a fourth vector to identify quarterly membership explicitly makes it impossible for linear regression to analyze the problem correctly.) The way to get the fourth seasonal index is to add the other indexes together and multiply by –1. So in Figure 5.18, the formula for cell M3 is this:
Note that the result in cell M3 is identical to the average of the Quarter 4 residuals calculated in cell M10.
Setting Up the Coded Vectors
If you have just a few observations in your time series—say, on the order of 20 or fewer—it’s probably quicker and easier to enter the 1s, 0s and –1s by hand. You can speed things up a little by selecting the range that the codes will occupy, typing a 0, and then entering it with Ctrl+Enter. That will populate most of the cells with their correct code. There are always more 0s than 1s or –1s when you have at least three vectors. Then, replace the 0s with 1s and –1s where needed.
With so many observations and vectors that you can’t bring yourself to enter the codes by hand, you can speed things up appreciably by using Excel’s VLOOKUP() function. See Figure 5.19 for an example.
Figure 5.19 The range A2:D5 serves as a sort of template for the actual code vectors.
Excel uses the term lookup range to refer to a range that’s used as A2:D5 is used in this example. The first column in your lookup range should contain the characters or numbers you use to identify your seasons. Those identifiers could be the numbers 1 through 4 if your seasons are quarters; or 1 through 12 if they are months; or Monday, Tuesday, and so on if they are days. The characters or numbers in the first column of the lookup range must correspond to the characters in the time-period portion of your time series. (So if your seasons are days and are represented as Mon, Tue, Wed, Thu, Fri, Sat, and Sun, you should use the same abbreviations in the first column of the lookup table.)
You’ll also need as many vector columns in the lookup range as you have seasons, minus 1. In Figure 5.19, for example, there are four quarters, so the lookup range has three columns—one for each vector.
Enter the 1s, 0s, and –1s as shown. Using the codes in this way results in a set of regression coefficients which quantify the effect of being in a given group.
The basic rules are these:
- Enter a 1 in the column that represents the season identified in the lookup range’s first column, and 0s in the remaining columns of that row.
- For the final season, enter –1s throughout the vectors on its row.
Then, enter this formula in the first row, first code vector (in Figure 5.19, that’s cell E8):
The first argument, which here is $B8, should be a reference to the season identifier for the first record. It anchors the reference to column B so that you can drag it to the right without column B adjusting to columns C and D. Note that cell B8 contains a 1. The fact that the value 1 is in column B means that the record on row 8 belongs to Quarter 1.
The second argument, in this case $A$2:$D$5, should refer to the lookup range (without the column headers that Figure 5.19 shows in row 1). The third argument should identify the column with the codes for the first vector; here, that’s column B, the second column in the lookup range.
Use cell E8’s selection handle to drag the formula into columns F and G. You’ll need to edit the final argument from 2 to 3 in column F, and from 2 to 4 in column G. The formulas in F8 and G8 should now be, respectively,
Finally, make a multiple selection of E8:G8 and drag it down as far as your time series goes—in the layout shown in Figure 5.19, that’s through row 27, although only the first 23 rows are visible in the figure. If the column immediately to the left of the coded vectors has data all the way to the end of the time series, just double-click cell G8’s selection handle to copy and paste the formulas all the way down, and no farther.
At this point, you’re in a position to use LINEST() to get the regression equation and associated statistics to evaluate it, and TREND() to get the seasonal forecasts. Figure 5.19 uses this array formula in F2:J6:
And this one in J8:J27:
You can get the one-step-ahead forecast for the first quarter of 2006 using the same approach as shown at the bottom of Figure 5.17.
Effect coding has a close relative called dummy coding. It uses the same data setup as effect coding, with one crucial difference: The records that get –1s in each vector with effect coding get 0s instead in dummy coding. Figure 5.20 shows the data in Figure 5.19 set up for dummy coding.
Figure 5.20 Notice that the overall regression statistics such as R2 are unaffected by the choice of coding method.
Using dummy coding instead of effect coding changes no regression statistics other than the regression coefficients and intercept for the coded vectors, and their standard errors (found in LINEST()’s second row). With dummy coding, the coefficient for each coded variable expresses the difference between the mean of the group associated with that variable, and the mean of the group that gets 0s on all vectors.
This characteristic makes dummy coding helpful in experiments in which two or more treatment groups are contrasted with a control group. Dummy coding can also prove useful in logistic regression because it can make interpretation of the results more straightforward.