Home > Articles

Tidy Data

This chapter is from the book

4.2 Columns Contain Multiple Variables

Sometimes columns in a data set may represent multiple variables. This format is commonly seen when working with health data, for example. To illustrate this situation, let’s look at the Ebola data set.

ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.columns)
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia',
       'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal',
       'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali',
       'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
       'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
       'Deaths_Spain', 'Deaths_Mali'],
      dtype='object')
# print select rows and columns
print(ebola.iloc[:5, [0, 1, 2,10]])
         Date  Day  Cases_Guinea  Deaths_Guinea
0    1/5/2015  289        2776.0         1786.0
1    1/4/2015  288        2775.0         1781.0
2    1/3/2015  287        2769.0         1767.0
3    1/2/2015  286           NaN            NaN
4  12/31/2014  284        2730.0         1739.0

The column names Cases_Guinea and Deaths_Guinea actually contain two variables. The individual status (cases and deaths, respectively) as well as the country name, Guinea. The data is also arranged in a wide format that needs to be reshaped (with the .melt() method).

First, let’s fix the problem we know how to fix, by melting the data into long format.

ebola_long = ebola.melt(id_vars=['Date', 'Day'])
print(ebola_long)
           Date  Day       variable   value
0      1/5/2015  289  Cases_Guinea   2776.0
1      1/4/2015  288  Cases_Guinea   2775.0
2      1/3/2015  287  Cases_Guinea   2769.0
3      1/2/2015  286  Cases_Guinea      NaN
4    12/31/2014  284  Cases_Guinea   2730.0
...         ...  ...           ...      ...


1947  3/27/2014    5   Deaths_Mali      NaN
1948  3/26/2014    4   Deaths_Mali      NaN
1949  3/25/2014    3   Deaths_Mali      NaN
1950  3/24/2014    2   Deaths_Mali      NaN
1951  3/22/2014    0   Deaths_Mali      NaN

[1952 rows x 4 columns]

Conceptually, the column of interest can be split based on the underscore in the column name, _. The first part will be the new status column, and the second part will be the new country column. This will require some string parsing and splitting in Python (more on this in Chapter 11). In Python, a string is an object, similar to how Pandas has Series and DataFrame objects. Chapter 2 showed how Series can have methods such as .mean(), and DataFrames can have methods such as .to_csv(). Strings have methods as well. In this case, we will use the .split() method that takes a string and “splits” it up based on a given delimiter. By default, .split() will split the string based on a space, but we can pass in the underscore, _, in our example. To get access to the string methods, we need to use the .str. attribute. .str. is a special type of attribute that Pandas calls an “accessor” because it can “access” string methods (see Chapter 11 for more on strings). Access to the Python string methods and allow us to work across the entire column. This will be the key to parting out the multiple bits of information stored in each value.

4.2.1 Split and Add Columns Individually

We can use the .str accessor to make a call to the .split() method and pass in the _ understore.

# get the variable column
# access the string methods
# and split the column based on a delimiter
variable_split = ebola_long.variable.str.split('_')
print(variable_split[:5])
0    [Cases, Guinea]
1    [Cases, Guinea]
2    [Cases, Guinea]
3    [Cases, Guinea]
4    [Cases, Guinea]
Name: variable, dtype: object

After we split on the underscore, the values are returned in a list. We can tell it’s a list by:

  1. Knowing about the .split() method on base Python string objects4

  2. Visually seeing the square brackets in the output, [ ]

  3. Getting the type() of one of the items in the Series

# the entire container
print(type(variable_split))
<class 'pandas.core.series.Series'>
# the first element in the container
print(type(variable_split[0]))
<class 'list'>

Now that the column has been split into various pieces, the next step is to assign those pieces to a new column. First, however, we need to extract all the 0-index elements for the status column and the 1-index elements for the country column. To do so, we need to access the string methods again, and then use the .get() method to “get” the index we want for each row.

status_values = variable_split.str.get(0)
country_values = variable_split.str.get(1)
print(status_values)
0        Cases
1        Cases
2        Cases
3        Cases
4        Cases
         ...
1947    Deaths
1948    Deaths
1949    Deaths
1950    Deaths
1951    Deaths
Name: variable, Length: 1952, dtype: object

Now that we have the vectors we want, we can add them to our dataframe.

ebola_long['status'] = status_values
ebola_long['country'] = country_values
print(ebola_long)
           Date  Day       variable   value   status   country
0      1/5/2015  289   Cases_Guinea  2776.0    Cases    Guinea
1      1/4/2015  288   Cases_Guinea  2775.0    Cases    Guinea
2      1/3/2015  287   Cases_Guinea  2769.0    Cases    Guinea
3      1/2/2015  286   Cases_Guinea     NaN    Cases    Guinea
4    12/31/2014  284   Cases_Guinea  2730.0    Cases    Guinea

...         ...  ...            ...     ...      ...       ...
1947  3/27/2014    5    Deaths_Mali     NaN   Deaths      Mali
1948  3/26/2014    4    Deaths_Mali     NaN   Deaths      Mali
1949  3/25/2014    3    Deaths_Mali     NaN   Deaths      Mali
1950  3/24/2014    2    Deaths_Mali     NaN   Deaths      Mali
1951  3/22/2014    0    Deaths_Mali     NaN   Deaths      Mali

[1952 rows x 6 columns]

4.2.2 Split and Combine in a Single Step

We can actually do the above steps in a single step. If we look at the .str.split() method documentation (you can find this by looking by going to the Pandas API documentation > Series > String Handling (.str.) > .split() method5), there is a parameter named expand that defaults to False, but when we set it to True, it will return a DataFrame where each result of the split is in a separate column, instead of a Series of list containers.

# reset our ebola_long data
ebola_long = ebola.melt(id_vars=['Date', 'Day'])

# split the column by _ into a dataframe using expand
variable_split = ebola_long.variable.str.split('_', expand=True)

print(variable_split)
           0       1
0      Cases  Guinea
1      Cases  Guinea
2      Cases  Guinea
3      Cases  Guinea
4      Cases  Guinea
...      ...     ...
1947  Deaths    Mali
1948  Deaths    Mali
1949  Deaths    Mali
1950  Deaths    Mali
1951  Deaths    Mali

[1952 rows x 2 columns]

From here, we can actually use the Python and Pandas multiple assignment feature (Appendix Q), to directly assign the newly split columns into the original DataFrame. Since our output variable_split returned a DataFrame with two columns, we can assign two new columns to our ebola_long DataFrame.

ebola_long[['status', 'country']] = variable_split
print(ebola_long)
           Date  Day      variable    value  status country
0      1/5/2015  289  Cases_Guinea   2776.0   Cases  Guinea
1      1/4/2015  288  Cases_Guinea   2775.0   Cases  Guinea
2      1/3/2015  287  Cases_Guinea   2769.0   Cases  Guinea
3      1/2/2015  286  Cases_Guinea      NaN   Cases  Guinea
4    12/31/2014  284  Cases_Guinea   2730.0   Cases  Guinea
...         ...  ...           ...      ...     ...     ...
1947  3/27/2014    5   Deaths_Mali      NaN  Deaths    Mali
1948  3/26/2014    4   Deaths_Mali      NaN  Deaths    Mali
1949  3/25/2014    3   Deaths_Mali      NaN  Deaths    Mali
1950  3/24/2014    2   Deaths_Mali      NaN  Deaths    Mali
1951  3/22/2014    0   Deaths_Mali      NaN  Deaths    Mali

[1952 rows x 6 columns]

You can also opt to do this as a concatenation (pd.concat()) function call as well (Chapter 6).

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.