12.4 tidyr in Action: Exploring Educational Statistics
This section uses a real data set to demonstrate how reshaping your data with tidyr is an integral part of the data exploration process. The data in this example was downloaded from the World Bank Data Explorer,2 which is a data collection of hundreds of indicators (measures) of different economic and social development factors. In particular, this example considers educational indicators3 that capture a relevant signal of a country’s level of (or investment in) education—for example, government expenditure on education, literacy rates, school enrollment rates, and dozens of other measures of educational attainment. The imperfections of this data set (unnecessary rows at the top of the .csv file, a substantial amount of missing data, long column names with special characters) are representative of the challenges involved in working with real data sets. All graphics in this section were built using the ggplot2 package, which is described in Chapter 16. The complete code for this analysis is also available online in the book’s code repository.4
After having downloaded the data, you will need to load it into your R environment:
# Load data, skipping the unnecessary first 4 rows wb_data <- read.csv( "data/world_bank_data.csv", stringsAsFactors = F, skip = 4 )
When you first load the data, each observation (row) represents an indicator for a country, with features (columns) that are the values of that indicator in a given year (see Figure 12.3). Notice that many values, particularly for earlier years, are missing (NA). Also, because R does not allow column names to be numbers, the read.csv() function has prepended an X to each column name (which is just a number in the raw .csv file).
FIGURE 12.3 Untransformed World Bank educational data used in Section 12.4.
While in terms of the indicator this data is in long format, in terms of the indicator and year the data is in wide format—a single column contains all the values for a single year. This structure allows you to make comparisons between years for the indicators by filtering for the indicator of interest. For example, you could compare each country’s educational expenditure in 1990 to its expenditure in 2014 as follows:
# Visually compare expenditures for 1990 and 2014 # Begin by filtering the rows for the indicator of interest indicator <- "Government expenditure on education, total (% of GDP)" expenditure_plot_data <- wb_data %>% filter(Indicator.Name == indicator) # Plot the expenditure in 1990 against 2014 using the `ggplot2` package # See Chapter 16 for details expenditure_chart <- ggplot(data = expenditure_plot_data) + geom_text_repel( mapping = aes(x = X1990 / 100, y = X2014 / 100, label = Country.Code) ) + scale_x_continuous(labels = percent) + scale_y_continuous(labels = percent) + labs(title = indicator, x = "Expenditure 1990", y = "Expenditure 2014")
Figure 12.4 shows that the expenditure (relative to gross domestic product) is fairly correlated between the two time points: countries that spent more in 1990 also spent more in 2014 (specifically, the correlation—calculated in R using the cor() function—is .64).
Figure 12.4 A comparison of each country’s education expenditures in 1990 and 2014.
However, if you want to extend your analysis to visually compare how the expenditure across all years varies for a given country, you would need to reshape the data. Instead of having each observation be an indicator for a country, you want each observation to be an indicator for a country for a year—thereby having all of the values for all of the years in a single column and making the data long(er) format.
To do this, you can gather() the year columns together:
# Reshape the data to create a new column for the `year` long_year_data <- wb_data %>% gather( key = year, # `year` will be the new key column value = value, # `value` will be the new value column X1960:X # all columns between `X1960` and `X` will be gathered )
As shown in Figure 12.5, this gather() statement creates a year column, so each observation (row) represents the value of an indicator in a particular country in a given year. The expenditure for each year is stored in the value column created (coincidentally, this column is given the name "value").
FIGURE 12.5 Reshaped educational data (long format by year). This structure allows you to more easily create visualizations across multiple years.
This structure will now allow you to compare fluctuations in an indicator’s value over time (across all years):
# Filter the rows for the indicator and country of interest indicator <- "Government expenditure on education, total (% of GDP)" spain_plot_data <- long_year_data %>% filter( Indicator.Name == indicator, Country.Code == "ESP" # Spain ) %>% mutate(year = as.numeric(substr(year, 2, 5))) # remove "X" before each year # Show the educational expenditure over time chart_title <- paste(indicator, " in Spain") spain_chart <- ggplot(data = spain_plot_data) + geom_line(mapping = aes(x = year, y = value / 100)) + scale_y_continuous(labels = percent) + labs(title = chart_title, x = "Year", y = "Percent of GDP Expenditure")
The resulting chart, shown in Figure 12.6, uses the available data to show a timeline of the fluctuations in government expenditures on education in Spain. This produces a more complete picture of the history of educational investment, and draws attention to major changes as well as the absence of data in particular years.
Figure 12.6 Education expenditures over time in Spain.
You may also want to compare two indicators to each other. For example, you may want to assess the relationship between each country’s literacy rate (a first indicator) and its unemployment rate (a second indicator). To do this, you would need to reshape the data again so that each observation is a particular country and each column is an indicator. Since indicators are currently in one column, you need to spread them out using the spread() function:
# Reshape the data to create columns for each indicator wide_data <- long_year_data %>% select(-Indicator.Code) %>% # do not include the `Indicator.Code` column spread( key = Indicator.Name, # new column names are `Indicator.Name` values value = value # populate new columns with values from `value` )
This wide format data shape allows for comparisons between two different indicators. For example, you can explore the relationship between female unemployment and female literacy rates, as shown in Figure 12.7.
# Prepare data and filter for year of interest x_var <- "Literacy rate, adult female (% of females ages 15 and above)" y_var <- "Unemployment, female (% of female labor force) (modeled ILO estimate)" lit_plot_data <- wide_data %>% mutate( lit_percent_2014 = wide_data[, x_var] / 100, employ_percent_2014 = wide_data[, y_var] / 100 ) %>% filter(year == "X2014") # Show the literacy vs. employment rates lit_chart <- ggplot(data = lit_plot_data) + geom_point(mapping = aes(x = lit_percent_2014, y = employ_percent_2014)) + scale_x_continuous(labels = percent) + scale_y_continuous(labels = percent) + labs( x = x_var, y = "Unemployment, female (% of female labor force)", title = "Female Literacy Rate versus Female Unemployment Rate" )
Figure 12.7 Female literacy rate versus unemployment rate in 2014.
Each comparison in this analysis—between two time points, over a full time-series, and between indicators—required a different representation of the data set. Mastering use of the tidyr functions will allow you to quickly transform the shape of your data set, allowing for rapid and effective data analysis. For practice reshaping data with the tidyr package, see the set of accompanying book exercises.5