Home > Articles

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Exploring Your Data

Later in this book, you’ll see a range of functionality for manipulating data frames. For now, it is useful for you to look at a few simple functions that will help you to quickly understand the data stored in a data frame.

The Top and Bottom of Your Data

A function called head allows you to return the first few rows of the data. This is particularly useful when you have a large data frame and only want to get a high-level understanding of the structure of the data frame. The head function accepts any data frame and will return (by default) only the first six rows. For this example, we use the built-in iris data frame (for more information, open the help file for the iris data frame using the ?iris command):

> nrow(iris)         # Number of rows in iris
[1] 150
> head(iris)         # Return only the first 6 rows
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

This immediately gives us a view on the structure of the data. We can see that the iris data frame has five columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. All columns seem to be numeric, except the Species column, which appears to be character (or a “factor,” as briefly discussed earlier).

The second argument to the head function is the number of rows to return. Therefore, we could look at more or fewer rows if we wish:

> head(iris, 2)   # Return only the first 3 rows
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa

If instead we wanted to look at the last few rows, we could use the tail function. This works in the same way as the head function, with the data frame as the first input and (optionally) the number of rows to return as the second input:

> tail(iris)      # Return only the last 6 rows
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica
> tail(iris, 2)   # Return only the last 2 rows
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

Viewing Your Data

If you are using the RStudio interface, you can use the View function to open the data in a viewing grid. This feature in RStudio is evolving quickly, so readers of this book may find the functionality richer than that presented here (the version of RStudio being used is 0.99.441). See Figure 4.1 for an example.


FIGURE 4.1 The iris dataset viewed in the RStudio data grid viewer

If we use the View function, our data frame is opened in the data grid viewer in RStudio:

> View(iris)     # Open the iris data in the data grid viewer

This window allows us to scroll around our data, and tells us the range of data we are viewing (for example, in Figure 4.1 the message at the bottom of the viewer tells us that we are looking at rows “1 to 19 of 150”).

The search bar (top right of the window) allows us to input search criteria that will be used to search the entire dataset. This is used to interactively filter the data based on a partial matching of the search term. As a quick example, look at the result of typing 4.5 in the search bar, as shown in Figure 4.2.


FIGURE 4.2 Using the search bar in the data grid viewer

If we click the Filter icon from the top of the data grid viewer window, we will see a number of filtering fields appear, which we can use to interactively subset the data in a more data-driven manner. This example uses the filter feature to look only at rows for the “setosa” species with Sepal.Length greater than 5.5 (see Figure 4.3).


FIGURE 4.3 Filtering data in the data grid viewer

Summarizing Your Data

We can use the summary function to produce a range of statistical summary outputs to summarize our data. The summary function accepts a data frame and produces a textual summary of each column of the data:

> summary(iris)   # Produce a textual summary
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

Note that the summaries produced are suitable for each column type (statistical summary for numeric columns, frequency count for factor columns).

Visualizing Your Data

In this book, you will see a number of functions for creating sophisticated graphical outputs. However, let’s look at one simple function that creates an immediate visualization of the structure of our data.

We can create a scatter-plot matrix plot of our data frame using the pairs function as follows:

> pairs(iris)   # Scatter-plot matrix of iris

In the graphic shown in Figure 4.4, each variable in the data is plotted against each other. For example, the plot in the top-right corner is a plot of Sepal.Length (y axis) against Species (x axis).


FIGURE 4.4 Scatter-plot Matrix of the iris data frame

From this plot we can quickly identify a number of characteristics of our data:

  • We see that the data has five columns, whose names are printed on the diagonal of the plot.
  • We can again see that Species is a factor column, whereas the rest are numeric.
  • If we look at the plots on the right side of the chart, we can see each numeric variable plotted against Species and note that the numeric data would seem to vary across each level of Species.
  • Columns Petal.Length and Petal.Width would seem to be highly correlated.
  • + Share This
  • 🔖 Save To Your Account