Reshaping and Cleaning Common Data

By Daniel Y. Chen
Feb 21, 2018

📄 Contents

␡

6.1 Introduction
Concept Map
Objectives
6.2 Columns Contain Values, Not Variables
6.3 Columns Contain Multiple Variables
6.4 Variables in Both Rows and Columns
6.5 Multiple Observational Units in a Table (Normalization)
6.6 Observational Units Across Multiple Tables
6.7 Conclusion

⎙ Print

< Back Page 6 of 9 Next >

This chapter is from the book 

Pandas for Everyone: Python Data Analysis

Learn More Buy

6.4 Variables in Both Rows and Columns

At times data will be formatted so that variables are in both rows and columns—that is, in some combination of the formats described in previous sections of this chapter. Most of the methods needed to tidy up such data have already been presented. What is left to show is what happens if a column of data actually holds two variables instead of one variable. In this case, we will have to pivot or cast the variable into separate columns.

weather = pd.read_csv('../data/weather.csv')
print(weather.iloc[:5, :11])

        id  year  month element  d1    d2    d3  d4    d5  d6  d7
0  MX17004  2010      1    tmax NaN   NaN   NaN NaN   NaN NaN NaN
1  MX17004  2010      1    tmin NaN   NaN   NaN NaN   NaN NaN NaN
2  MX17004  2010      2    tmax NaN  27.3  24.1 NaN   NaN NaN NaN
3  MX17004  2010      2    tmin NaN  14.4  14.4 NaN   NaN NaN NaN
4  MX17004  2010      3    tmax NaN   NaN   NaN NaN  32.1 NaN NaN

The weather data include minimum and maximum (tmin and tmax values in the element column, respectively) temperatures recorded for each day (d1, d2, ..., d31)ofthe month (month). The element column contains variables that need to be casted/pivoted to become new columns, and the day variables need to be melted into row values. Again, there is nothing wrong with the data in the current format. It is simply not in a shape amenable to analysis, although this kind of formatting can be helpful when presenting data in reports. Let’s first melt/unpivot the day values.

weather_melt = pd.melt(weather,
                       id_vars=['id', 'year', 'month', 'element'],
                       var_name='day',
                       value_name='temp')
print(weather_melt.head())

        id  year  month element day  temp
0  MX17004  2010      1    tmax  d1   NaN
1  MX17004  2010      1    tmin  d1   NaN
2  MX17004  2010      2    tmax  d1   NaN
3  MX17004  2010      2    tmin  d1   NaN
4  MX17004  2010      3    tmax  d1   NaN

print(weather_melt.tail())

          id  year  month element  day  temp
677  MX17004  2010     10    tmin  d31   NaN
678  MX17004  2010     11    tmax  d31   NaN
679  MX17004  2010     11    tmin  d31   NaN
680  MX17004  2010     12    tmax  d31   NaN
681  MX17004  2010     12    tmin  d31   NaN

Next, we need to pivot up the variables stored in the element column. This process is referred to as casting or spreading in other statistical languages. One of the main differences between pivot_table and melt is that melt is a function within Pandas, whereas pivot_table is a method we call on a DataFrame object.

weather_tidy = weather_melt.pivot_table(
    index=['id', 'year', 'month', 'day'],
    columns='element',
    values='temp')

Looking at the pivoted table, we notice that each value in the element column is now a separate column. We can leave this table in its current state, but we can also flatten the hierarchical columns.

weather_tidy_flat = weather_tidy.reset_index()
print(weather_tidy_flat.head())

element       id  year  month  day  tmax  tmin
0        MX17004  2010      1   d1   NaN   NaN
1        MX17004  2010      1  d10   NaN   NaN
2        MX17004  2010      1  d11   NaN   NaN
3        MX17004  2010      1  d12   NaN   NaN
4        MX17004  2010      1  d13   NaN   NaN

Likewise, we can apply these methods without the intermediate dataframe:

weather_tidy = weather_melt.    pivot_table(
        index=['id', 'year', 'month', 'day'],
        columns='element',
        values='temp').    reset_index()

print(weather_tidy.head())

element       id  year  month  day  tmax  tmin
0        MX17004  2010      1   d1   NaN   NaN
1        MX17004  2010      1  d10   NaN   NaN
2        MX17004  2010      1  d11   NaN   NaN
3        MX17004  2010      1  d12   NaN   NaN
4        MX17004  2010      1  d13   NaN   NaN

< Back Page 6 of 9 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Reshaping and Cleaning Common Data

This chapter is from the book

This chapter is from the book

This chapter is from the book 

6.4 Variables in Both Rows and Columns

InformIT Promotional Mailings & Special Offers