Elementary techniques for deriving causal insights from observational data. These techniques include natural experiments, the difference-in-difference design, and regression discontinuity—all of which can help us derive actionable insights from real-world data.
Save 35% off the list price* of the related book or multi-format eBook (EPUB + MOBI + PDF) with discount code ARTICLE.
* See informit.com/terms
In Chapter 10, we discussed the nuts and bolts of causal inference from observational data. We explored examples of natural experiments, where assignment into treatment and control groups occurs due to some natural process. We also went through an application of the difference-in-difference design. A difference-in-difference (DID) design can be used to model both natural experiments and quasi-experiments, which have a counterfactual, as in the designated market area (DMA) example.
Recall that a quasi-experimental design is used when the assignment process is not random. The reality is the vast majority of cases of causal inference from observational data are quasi-experiments, not natural experiments. To make these designs work, the researcher must control assignment into treatment and control groups. The idea of controlling assignment may seem abstract now, but you’ll understand what that means after you see the examples in this chapter and Chapter 12.
In this section, we’ll discuss a very popular quasi-experimental design technique, regression discontinuity (RD). This approach can be applied in a variety of cases, but adequate testing must be done to substantiate its validity. We will also cover time-series models and seasonality methods to improve estimation for the time-series case of RD, called interrupted time series (ITS).
Causal inference from observational data is difficult, as discussed in Chapter 10. Although causal inference generally takes more effort, it leads to the most prescriptive and actionable results. If we can utilize causal inference methods in addition to broader explanatory and predictive methods, we can understand our web product on a much deeper level.
With prediction, we can often throw everything but the kitchen sink at a problem. In contrast, with causal inference, our approach must be much more thoughtful. As described in earlier chapters, prediction can be validated and improved on a post hoc basis with external or test data; causal inference cannot. Causal inference relies on internal validity or the underlying logic of the design to drive the credibility in the results.
With causal inference, we must put on our detective hat, as we are always looking for ways to invalidate our designs. If a design has been invalidated, we can sometimes move to a smaller coverage area (or a smaller population for which there is support in the data, such as focusing on only male users for the dating website “liking” example), but often times we must start over or rethink our initial design. It’s also a much more one-off endeavor, which means that as a data scientist, you must have an arsenal or large toolkit of methods that you shuffle through for each specific problem. In addition, you might have to be much more creative in your application or design than with predictive methodologies.
In this chapter, we will cover RD, a quasi-experimental method. In practice, RD is one of the hardest designs to implement correctly. Many situations may initially seem like good candidates for an RD design, but after further evaluation, it becomes obvious they are plagued by selection bias. It’s not a one-size-fits-all approach, since we depend on the statistical estimation of the counterfactual, unlike natural experiments and DID methods. We’ll discuss these issues throughout the next two sections on RD and ITS.
This chapter is one of the more technically rigorous chapters, as we will use advanced methods to model counterfactuals. If you don’t understand the modeling methods, don’t let that frighten you away. Unlike many other methods, many RD designs can be invalidated by graphing. In the best RD cases, the causal effect can be found visually. Also, applying high-level modeling methods is uncomplicated in R, and these modeling methods also can be visually invalidated. In the next section, we’ll cover RD design and its applications.
11.1 Regression Discontinuity
In Chapter 10, we discussed DID modeling. When we use this approach, we’ve found a comparable counterfactual or control group to compare to our treated users. In contrast, regression discontinuity relies on a break or level change in our treatment variable or timing of our treatment variable to assign users to treatment and control.
11.1.1 Nuts and Bolts of RD
Suppose there is an arbitrary cut point or step change in a game or website feature. For instance, a user who gets 50 points within a specific time frame is awarded an enthusiast badge. This might be a good candidate for an RD design.
You will have users on both sides of this cut point—that is, a group of users at 49 points and a group of users at 51 points. Theoretically, these users on either side of the cut point are similar in terms of skill level, motivation, and time in product. The RD design assumes that users on both sides of the cut point are more similar to each other than to the other users in their own respective groups. For instance, a user who gets 75 points is substantially different from a user who gets 51 points, even though both received a badge. Similarly, a user with 20 points is a lot different from a user with 49 points, even though both do not progress to the next level.
The RD design finds the local average treatment effects (LATE). We’ve already seen the ATE and the ATT. LATE is the average treatment effect defined in a local area of variation. Here the local area is defined around the cut point.
11.1.2 Potential RD Designs
Regression discontinuity exploits breaks in the treatment variable. Let’s explore some potential examples to get a better understanding of where RD is best applied. Here are some non-web analytics examples of potential RD designs:
The effect of a scholarship program on future earnings. Scholarships are given to participants who score an 80 or higher on a national standardized test. We assume getting a score of 79 or 80 is essentially random. Here the treatment is receiving the scholarship. The control group are students who score 79 and the treatment group are students who score 80. The design can be invalidated if richer students with higher social status or with certain instructors are more likely to get an 80 than a 79, suggesting that there is selection at the cut point.
The effect of winning a U.S. House of Representatives election on personal wealth. The assumption here is that the winners of close elections are random. Thus, we can access the effect of winning a U.S. House seat by comparing personal wealth of near winners and losers. This design would be invalidated if there was selection in terms of who won, based on other factors such as personal wealth, meaning that richer candidates were more likely to win in close elections than are poorer candidates.
These examples are great candidates for RD, although they are not guaranteed to work. For instance, the elections example might ultimately fail because richer candidates and incumbents are able to squeeze out a win more often than not. Selection on wealth and incumbency means that the assumption of randomness at the cut point (i.e., who wins and loses election) is erroneous, invalidating the design. This was found in practice by Caughey and Sekhon (2001).
Nevertheless, RD is better than many other modeling methods, because clustering or density at the cut point can be plotted and observed. We’ll see examples of this in later sections. RD can also be invalidated quickly and easily, unlike pure modeling approaches.
11.1.3 The Enemy of the Good: Nonrandom Selection at the Cut Point
Now that we’ve seen some examples of successful and unsuccessful RD designs, let’s consider the core assumption that is required for RD to work. The primary assumption is random selection at the cut point. In our game example, we assumed that users who progressed right at the cut point did so randomly. Those users who progressed could easily have not progressed; likewise, those users who did not progress could easily have progressed.
The users right at the cut point who did not progress become the control group for the users who did progress. We can then compare two groups to find the effect of progressing on product retention. The defined cut point, score, and treatment make this a sharp regression discontinuity design. There is also a fuzzy RD design that involves a gradual change; it is not discussed in this book, but can be explored in reference readings.
Similar to the other quasi-experimental designs, selection bias can invalidate this design. For instance, suppose that users who progressed were more likely to have been in the product for a few days, rather than it being their first day. They were also more likely to have friends in the product than the users who did not progress. If this is the case—that is, there is nonrandom selection at the cut point—then our estimated “causal” effect of gaining a badge is not valid. We cannot rule out that it is a spurious relationship, where, for instance, having more friends is driving gaining a badge and retention.
Our estimate could be measuring the slight advantages that helped certain users progress to higher levels, instead of our desired causal variable, “gaining a badge.” Selection at the cut point breaks the randomness assumption. It means that the assumption that the users who progressed were similar to those who did not fails. We might be able to model the effects of confounders by adding them into our model estimator, but generally if there is one confounder, there’s likely to be many confounders—and there might be no support for controlling for these confounders.
Let’s explore this idea in a little more detail. To remove the effect of confounders, we theoretically need to find users in the treatment group who look like users in the control group. However, if there are one or more confounders, all users with a particular confounder may be found in the treatment group and not in the control group. For instance, suppose all users with friends in the product progress to 51 points, and there is no support for any users with friends in the 45–49 interval. We then cannot estimate the “causal” effect on the full population; that is, we might be able to estimate the effect only for users with no friends.
To model out confounders, we need to understand where the selection is occurring, have support in both the treatment and control groups, and properly model the selection. We might just need to drop observations where support in the treatment or control group is lacking.
Selection is a problem with all quasi-experimental designs and often drives “causal” effects, especially if they are particularly large. Selection at the cut point is actually very common in practice. Unfortunately, the more important (i.e., the larger) the causal effect, the more selection you’ll likely see. The reason is that the more important this factor is for positive outcomes, the more users will try to “game the system,” so to speak, making those who get a badge less and less random.
For instance, suppose we offered a $1,000 reward to everyone who progresses. You’d see some strong selection at the cut point. People would talk to one another in an effort to improve their game play, leading to selection. Generally, where regression discontinuity is most useful, it’s also most likely to be wrong. The one exception is with time as the discontinuity variable. We’ll discuss this special case when we consider interrupted time series. RD can require some advanced, or one-off, modeling techniques to estimate the LATE, but has some advantages over the DID design. In particular, it is easily observable, if data is plotted correctly, when RD’s core assumptions fail.
11.1.4 RD Complexities
Three complexities arise with the RD design:
Selection at the cut point
RD is only defined in the limit
“Clumpiness” in the data around the cut point
Generally, if you can handle these three issues, then the RD design is valid.
First, the difficulty with arbitrary human cut points is that there is often selection at the cut point. For instance, in two very popular examples of RD design, researchers have found rampant selection at the “random” cut points in both scholarships and elections. For instance, higher-income students and candidates with more campaign contributions are more likely to get scholarships and win elections, respectively. This is a huge problem. The benefit here is that it can be observable by plotting close to the cut point for all confounding variables.
The second complexity is that RD is only defined in the limit. If you remember back to your high school calculus class, the limit is the value that a function approaches as it gets closer and closer to some point, here the cut point. In our example, the arbitrary cut point is 50. It’s only defined in extremely close increments from 50. However in practice, only 51 and 49 are defined, while 49.999999999 is undefined and no user can take this value. We then face the problem of how large the control group (left of the cut point) and the treatment group (right of the cut point) should be. How much data do we have close to the cut point? Is a player who scored 48 or 52 really still comparable? How about 45 and 55? We can do robustness checks and vary our sample sizes to see how much those changes affect the estimate. However, if the results strongly differ, which they theoretically should, we need to ask ourselves if there is enough coverage at the cut point.
The third problem is how to model “clumpiness” near the cut points. If different types of models get different estimates for the pre-treatment and post-treatment values, what model should we use? We’ll see an example related to this issue in the next section, and suggest how to estimate the effect. Is there a true effect or is there rampant selection? Clumpiness at the cut point could also be a sign of selection. [You can test for clumpiness attributable to selection. While beyond the scope of this book, check out the McCrary density test described by McCrary (2008) and implemented in the R package ‘rdd’. The idea here is that if there is no selection, there should be not be irregular clumpiness on either side of the cut point and both sides of the cut point should be proportionally similar.]
11.1.5 Graphing the Data
With RD, there’s a simple rule: When in doubt, graph. How do we graph RD? The x-axis is the treatment variable, and we should focus on the area around the cut point. The y-axis is our outcome or confounder variables, which have occurred prior to treatment. Often with RD designs, the effect or selection in the data becomes visible by just graphing the data.
Graph the data closest to the cut point, because that is where RD is defined. Even if your assumptions may hold away from the cut point, that is largely irrelevant in an RD design. RD designs have been invalidated because of selection very close to the cut point. Even if a few, well-connected users are being selected for, that will invalidate your design.
When we have adequately accounted for these problems in the design, RD can be a useful tool. It is best applied when an arbitrary cut point is not well known by the players, so that no strategies or selection problems occur at the cut point. Now, we’ll discuss a numeric regression discontinuity example.