# Observational Data Analysis Techniques

Elementary techniques for deriving causal insights from observational data. These techniques include natural experiments, the difference-in-difference design, and regression discontinuity—all of which can help us derive actionable insights from real-world data.

This chapter is from the book

### This chapter is from the book 

In Chapter 10, we discussed the nuts and bolts of causal inference from observational data. We explored examples of natural experiments, where assignment into treatment and control groups occurs due to some natural process. We also went through an application of the difference-in-difference design. A difference-in-difference (DID) design can be used to model both natural experiments and quasi-experiments, which have a counterfactual, as in the designated market area (DMA) example.

Recall that a quasi-experimental design is used when the assignment process is not random. The reality is the vast majority of cases of causal inference from observational data are quasi-experiments, not natural experiments. To make these designs work, the researcher must control assignment into treatment and control groups. The idea of controlling assignment may seem abstract now, but you’ll understand what that means after you see the examples in this chapter and Chapter 12.

In this section, we’ll discuss a very popular quasi-experimental design technique, regression discontinuity (RD). This approach can be applied in a variety of cases, but adequate testing must be done to substantiate its validity. We will also cover time-series models and seasonality methods to improve estimation for the time-series case of RD, called interrupted time series (ITS).

Causal inference from observational data is difficult, as discussed in Chapter 10. Although causal inference generally takes more effort, it leads to the most prescriptive and actionable results. If we can utilize causal inference methods in addition to broader explanatory and predictive methods, we can understand our web product on a much deeper level.

With prediction, we can often throw everything but the kitchen sink at a problem. In contrast, with causal inference, our approach must be much more thoughtful. As described in earlier chapters, prediction can be validated and improved on a post hoc basis with external or test data; causal inference cannot. Causal inference relies on internal validity or the underlying logic of the design to drive the credibility in the results.

With causal inference, we must put on our detective hat, as we are always looking for ways to invalidate our designs. If a design has been invalidated, we can sometimes move to a smaller coverage area (or a smaller population for which there is support in the data, such as focusing on only male users for the dating website “liking” example), but often times we must start over or rethink our initial design. It’s also a much more one-off endeavor, which means that as a data scientist, you must have an arsenal or large toolkit of methods that you shuffle through for each specific problem. In addition, you might have to be much more creative in your application or design than with predictive methodologies.

In this chapter, we will cover RD, a quasi-experimental method. In practice, RD is one of the hardest designs to implement correctly. Many situations may initially seem like good candidates for an RD design, but after further evaluation, it becomes obvious they are plagued by selection bias. It’s not a one-size-fits-all approach, since we depend on the statistical estimation of the counterfactual, unlike natural experiments and DID methods. We’ll discuss these issues throughout the next two sections on RD and ITS.

This chapter is one of the more technically rigorous chapters, as we will use advanced methods to model counterfactuals. If you don’t understand the modeling methods, don’t let that frighten you away. Unlike many other methods, many RD designs can be invalidated by graphing. In the best RD cases, the causal effect can be found visually. Also, applying high-level modeling methods is uncomplicated in R, and these modeling methods also can be visually invalidated. In the next section, we’ll cover RD design and its applications.

## 11.1 Regression Discontinuity

In Chapter 10, we discussed DID modeling. When we use this approach, we’ve found a comparable counterfactual or control group to compare to our treated users. In contrast, regression discontinuity relies on a break or level change in our treatment variable or timing of our treatment variable to assign users to treatment and control.

### 11.1.1 Nuts and Bolts of RD

Suppose there is an arbitrary cut point or step change in a game or website feature. For instance, a user who gets 50 points within a specific time frame is awarded an enthusiast badge. This might be a good candidate for an RD design.

You will have users on both sides of this cut point—that is, a group of users at 49 points and a group of users at 51 points. Theoretically, these users on either side of the cut point are similar in terms of skill level, motivation, and time in product. The RD design assumes that users on both sides of the cut point are more similar to each other than to the other users in their own respective groups. For instance, a user who gets 75 points is substantially different from a user who gets 51 points, even though both received a badge. Similarly, a user with 20 points is a lot different from a user with 49 points, even though both do not progress to the next level.

The RD design finds the local average treatment effects (LATE). We’ve already seen the ATE and the ATT. LATE is the average treatment effect defined in a local area of variation. Here the local area is defined around the cut point.

### 11.1.2 Potential RD Designs

Regression discontinuity exploits breaks in the treatment variable. Let’s explore some potential examples to get a better understanding of where RD is best applied. Here are some non-web analytics examples of potential RD designs:

• The effect of a scholarship program on future earnings. Scholarships are given to participants who score an 80 or higher on a national standardized test. We assume getting a score of 79 or 80 is essentially random. Here the treatment is receiving the scholarship. The control group are students who score 79 and the treatment group are students who score 80. The design can be invalidated if richer students with higher social status or with certain instructors are more likely to get an 80 than a 79, suggesting that there is selection at the cut point.

• The effect of winning a U.S. House of Representatives election on personal wealth. The assumption here is that the winners of close elections are random. Thus, we can access the effect of winning a U.S. House seat by comparing personal wealth of near winners and losers. This design would be invalidated if there was selection in terms of who won, based on other factors such as personal wealth, meaning that richer candidates were more likely to win in close elections than are poorer candidates.

These examples are great candidates for RD, although they are not guaranteed to work. For instance, the elections example might ultimately fail because richer candidates and incumbents are able to squeeze out a win more often than not. Selection on wealth and incumbency means that the assumption of randomness at the cut point (i.e., who wins and loses election) is erroneous, invalidating the design. This was found in practice by Caughey and Sekhon (2001).

Nevertheless, RD is better than many other modeling methods, because clustering or density at the cut point can be plotted and observed. We’ll see examples of this in later sections. RD can also be invalidated quickly and easily, unlike pure modeling approaches.

### 11.1.3 The Enemy of the Good: Nonrandom Selection at the Cut Point

Now that we’ve seen some examples of successful and unsuccessful RD designs, let’s consider the core assumption that is required for RD to work. The primary assumption is random selection at the cut point. In our game example, we assumed that users who progressed right at the cut point did so randomly. Those users who progressed could easily have not progressed; likewise, those users who did not progress could easily have progressed.

The users right at the cut point who did not progress become the control group for the users who did progress. We can then compare two groups to find the effect of progressing on product retention. The defined cut point, score, and treatment make this a sharp regression discontinuity design. There is also a fuzzy RD design that involves a gradual change; it is not discussed in this book, but can be explored in reference readings.

Similar to the other quasi-experimental designs, selection bias can invalidate this design. For instance, suppose that users who progressed were more likely to have been in the product for a few days, rather than it being their first day. They were also more likely to have friends in the product than the users who did not progress. If this is the case—that is, there is nonrandom selection at the cut point—then our estimated “causal” effect of gaining a badge is not valid. We cannot rule out that it is a spurious relationship, where, for instance, having more friends is driving gaining a badge and retention.

Our estimate could be measuring the slight advantages that helped certain users progress to higher levels, instead of our desired causal variable, “gaining a badge.” Selection at the cut point breaks the randomness assumption. It means that the assumption that the users who progressed were similar to those who did not fails. We might be able to model the effects of confounders by adding them into our model estimator, but generally if there is one confounder, there’s likely to be many confounders—and there might be no support for controlling for these confounders.

Let’s explore this idea in a little more detail. To remove the effect of confounders, we theoretically need to find users in the treatment group who look like users in the control group. However, if there are one or more confounders, all users with a particular confounder may be found in the treatment group and not in the control group. For instance, suppose all users with friends in the product progress to 51 points, and there is no support for any users with friends in the 45–49 interval. We then cannot estimate the “causal” effect on the full population; that is, we might be able to estimate the effect only for users with no friends.

To model out confounders, we need to understand where the selection is occurring, have support in both the treatment and control groups, and properly model the selection. We might just need to drop observations where support in the treatment or control group is lacking.

Selection is a problem with all quasi-experimental designs and often drives “causal” effects, especially if they are particularly large. Selection at the cut point is actually very common in practice. Unfortunately, the more important (i.e., the larger) the causal effect, the more selection you’ll likely see. The reason is that the more important this factor is for positive outcomes, the more users will try to “game the system,” so to speak, making those who get a badge less and less random.

For instance, suppose we offered a \$1,000 reward to everyone who progresses. You’d see some strong selection at the cut point. People would talk to one another in an effort to improve their game play, leading to selection. Generally, where regression discontinuity is most useful, it’s also most likely to be wrong. The one exception is with time as the discontinuity variable. We’ll discuss this special case when we consider interrupted time series. RD can require some advanced, or one-off, modeling techniques to estimate the LATE, but has some advantages over the DID design. In particular, it is easily observable, if data is plotted correctly, when RD’s core assumptions fail.

### 11.1.4 RD Complexities

Three complexities arise with the RD design:

• Selection at the cut point

• RD is only defined in the limit

• “Clumpiness” in the data around the cut point

Generally, if you can handle these three issues, then the RD design is valid.

First, the difficulty with arbitrary human cut points is that there is often selection at the cut point. For instance, in two very popular examples of RD design, researchers have found rampant selection at the “random” cut points in both scholarships and elections. For instance, higher-income students and candidates with more campaign contributions are more likely to get scholarships and win elections, respectively. This is a huge problem. The benefit here is that it can be observable by plotting close to the cut point for all confounding variables.

The second complexity is that RD is only defined in the limit. If you remember back to your high school calculus class, the limit is the value that a function approaches as it gets closer and closer to some point, here the cut point. In our example, the arbitrary cut point is 50. It’s only defined in extremely close increments from 50. However in practice, only 51 and 49 are defined, while 49.999999999 is undefined and no user can take this value. We then face the problem of how large the control group (left of the cut point) and the treatment group (right of the cut point) should be. How much data do we have close to the cut point? Is a player who scored 48 or 52 really still comparable? How about 45 and 55? We can do robustness checks and vary our sample sizes to see how much those changes affect the estimate. However, if the results strongly differ, which they theoretically should, we need to ask ourselves if there is enough coverage at the cut point.

The third problem is how to model “clumpiness” near the cut points. If different types of models get different estimates for the pre-treatment and post-treatment values, what model should we use? We’ll see an example related to this issue in the next section, and suggest how to estimate the effect. Is there a true effect or is there rampant selection? Clumpiness at the cut point could also be a sign of selection. [You can test for clumpiness attributable to selection. While beyond the scope of this book, check out the McCrary density test described by McCrary (2008) and implemented in the R package ‘rdd’. The idea here is that if there is no selection, there should be not be irregular clumpiness on either side of the cut point and both sides of the cut point should be proportionally similar.]

### 11.1.5 Graphing the Data

With RD, there’s a simple rule: When in doubt, graph. How do we graph RD? The x-axis is the treatment variable, and we should focus on the area around the cut point. The y-axis is our outcome or confounder variables, which have occurred prior to treatment. Often with RD designs, the effect or selection in the data becomes visible by just graphing the data.

Graph the data closest to the cut point, because that is where RD is defined. Even if your assumptions may hold away from the cut point, that is largely irrelevant in an RD design. RD designs have been invalidated because of selection very close to the cut point. Even if a few, well-connected users are being selected for, that will invalidate your design.

When we have adequately accounted for these problems in the design, RD can be a useful tool. It is best applied when an arbitrary cut point is not well known by the players, so that no strategies or selection problems occur at the cut point. Now, we’ll discuss a numeric regression discontinuity example.

### InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

## Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

## Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

### Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

### Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

### Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

### Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

### Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

### Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

## Other Collection and Use of Information

### Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

### Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

### Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

### Do Not Track

This site currently does not respond to Do Not Track signals.

## Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

## Children

This site is not directed to children under the age of 13.

## Marketing

Pearson may send or direct marketing communications to users, provided that

• Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
• Such marketing is consistent with applicable law and Pearson's legal obligations.
• Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
• Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

## Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

## Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

## Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

## Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

## Sharing and Disclosure

Pearson may disclose personal information, as follows:

• As required by law.
• With the consent of the individual (or their parent, if the individual is a minor)
• In response to a subpoena, court order or legal process, to the extent permitted or required by law
• To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
• In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
• To investigate or address actual or suspected fraud or other illegal activities
• To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
• To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
• To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.