Exploring Your Data
Later in this book, you’ll see a range of functionality for manipulating data frames. For now, it is useful for you to look at a few simple functions that will help you to quickly understand the data stored in a data frame.
The Top and Bottom of Your Data
A function called head allows you to return the first few rows of the data. This is particularly useful when you have a large data frame and only want to get a high-level understanding of the structure of the data frame. The head function accepts any data frame and will return (by default) only the first six rows. For this example, we use the built-in iris data frame (for more information, open the help file for the iris data frame using the ?iris command):
> nrow(iris) # Number of rows in iris  150 > head(iris) # Return only the first 6 rows Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
This immediately gives us a view on the structure of the data. We can see that the iris data frame has five columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. All columns seem to be numeric, except the Species column, which appears to be character (or a “factor,” as briefly discussed earlier).
The second argument to the head function is the number of rows to return. Therefore, we could look at more or fewer rows if we wish:
> head(iris, 2) # Return only the first 3 rows Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa
If instead we wanted to look at the last few rows, we could use the tail function. This works in the same way as the head function, with the data frame as the first input and (optionally) the number of rows to return as the second input:
> tail(iris) # Return only the last 6 rows Sepal.Length Sepal.Width Petal.Length Petal.Width Species 145 6.7 3.3 5.7 2.5 virginica 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica > tail(iris, 2) # Return only the last 2 rows Sepal.Length Sepal.Width Petal.Length Petal.Width Species 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica
Viewing Your Data
If you are using the RStudio interface, you can use the View function to open the data in a viewing grid. This feature in RStudio is evolving quickly, so readers of this book may find the functionality richer than that presented here (the version of RStudio being used is 0.99.441). See Figure 4.1 for an example.
FIGURE 4.1 The iris dataset viewed in the RStudio data grid viewer
If we use the View function, our data frame is opened in the data grid viewer in RStudio:
> View(iris) # Open the iris data in the data grid viewer
This window allows us to scroll around our data, and tells us the range of data we are viewing (for example, in Figure 4.1 the message at the bottom of the viewer tells us that we are looking at rows “1 to 19 of 150”).
The search bar (top right of the window) allows us to input search criteria that will be used to search the entire dataset. This is used to interactively filter the data based on a partial matching of the search term. As a quick example, look at the result of typing 4.5 in the search bar, as shown in Figure 4.2.
FIGURE 4.2 Using the search bar in the data grid viewer
If we click the Filter icon from the top of the data grid viewer window, we will see a number of filtering fields appear, which we can use to interactively subset the data in a more data-driven manner. This example uses the filter feature to look only at rows for the “setosa” species with Sepal.Length greater than 5.5 (see Figure 4.3).
FIGURE 4.3 Filtering data in the data grid viewer
Summarizing Your Data
We can use the summary function to produce a range of statistical summary outputs to summarize our data. The summary function accepts a data frame and produces a textual summary of each column of the data:
> summary(iris) # Produce a textual summary Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Note that the summaries produced are suitable for each column type (statistical summary for numeric columns, frequency count for factor columns).
Visualizing Your Data
In this book, you will see a number of functions for creating sophisticated graphical outputs. However, let’s look at one simple function that creates an immediate visualization of the structure of our data.
We can create a scatter-plot matrix plot of our data frame using the pairs function as follows:
> pairs(iris) # Scatter-plot matrix of iris
In the graphic shown in Figure 4.4, each variable in the data is plotted against each other. For example, the plot in the top-right corner is a plot of Sepal.Length (y axis) against Species (x axis).
FIGURE 4.4 Scatter-plot Matrix of the iris data frame
From this plot we can quickly identify a number of characteristics of our data:
- We see that the data has five columns, whose names are printed on the diagonal of the plot.
- We can again see that Species is a factor column, whereas the rest are numeric.
- If we look at the plots on the right side of the chart, we can see each numeric variable plotted against Species and note that the numeric data would seem to vary across each level of Species.
- Columns Petal.Length and Petal.Width would seem to be highly correlated.