Reshaping and Cleaning Common Data

By Daniel Y. Chen
Feb 21, 2018

📄 Contents

␡

6.1 Introduction
Concept Map
Objectives
6.2 Columns Contain Values, Not Variables
6.3 Columns Contain Multiple Variables
6.4 Variables in Both Rows and Columns
6.5 Multiple Observational Units in a Table (Normalization)
6.6 Observational Units Across Multiple Tables
6.7 Conclusion

⎙ Print

< Back Page 7 of 9 Next >

This chapter is from the book 

Pandas for Everyone: Python Data Analysis

Learn More Buy

6.5 Multiple Observational Units in a Table (Normalization)

One of the simplest ways of knowing whether multiple observational units are represented in a table is by looking at each of the rows, and taking note of any cells or values that are being repeated from row to row. This is very common in government education administration data, where student demographics are reported for each student for each year the student is enrolled.

Let’s look again at the Billboard data we cleaned in Section 6.2.2.

print(billboard_long.head())

   year        artist                    track  time date.entered  0  2000         2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26
1  2000       2Ge+her  The Hardest Part Of ...  3:15   2000-09-02
2  2000  3 Doors Down               Kryptonite  3:53   2000-04-08
3  2000  3 Doors Down                    Loser  4:24   2000-10-21
4  2000      504 Boyz            Wobble Wobble  3:35   2000-04-15

  week  rating
0  wk1    87.0
1  wk1    91.0
2  wk1    81.0
3  wk1    76.0
4  wk1    57.0

Suppose we subset (Section 2.4.1) the data based on a particular track:

print(billboard_long[billboard_long.track == 'Loser'].head())

      year        artist  track  time date.entered week rating
3     2000  3 Doors Down  Loser  4:24   2000-10-21  wk1   76.0
320   2000  3 Doors Down  Loser  4:24   2000-10-21  wk2   76.0
637   2000  3 Doors Down  Loser  4:24   2000-10-21  wk3   72.0
954   2000  3 Doors Down  Loser  4:24   2000-10-21  wk4   69.0
1271  2000  3 Doors Down  Loser  4:24   2000-10-21  wk5   67.0

We can see that this table actually holds two types of data: the track information and the weekly ranking. It would be better to store the track information in a separate table. This way, the information stored in the year, artist, track, and time columns would not be repeated in the data set. This consideration is particularly important if the data is manually entered. Repeating the same values over and over during data entry increases the risk of inconsistent data.

What we should do in this case is to place the year, artist, track, time, and date.entered in a new dataframe, with each unique set of values being assigned a unique ID. We can then use this unique ID in a second dataframe that represents a song, date, week number, and ranking. This entire process can be thought of as reversing the steps in concatenating and merging data described in Chapter 4.

billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
print(billboard_songs.shape)

(24092, 4)

We know there are duplicate entries in this dataframe, so we need to drop the duplicate rows.

billboard_songs = billboard_songs.drop_duplicates()
print(billboard_songs.shape)

(317, 4)

We can then assign a unique value to each row of data.

billboard_songs['id'] = range(len(billboard_songs))
print(billboard_songs.head(n=10))

   year          artist                    track  time  id
0  2000           2 Pac  Baby Don't Cry (Keep...  4:22   0
1  2000         2Ge+her  The Hardest Part Of ...  3:15   1
2  2000    3 Doors Down               Kryptonite  3:53   2
3  2000    3 Doors Down                    Loser  4:24   3
4  2000        504 Boyz            Wobble Wobble  3:35   4
5  2000            98^0  Give Me Just One Nig...  3:24   5
6  2000         A*Teens            Dancing Queen  3:44   6
7  2000         Aaliyah            I Don't Wanna  4:15   7
8  2000         Aaliyah                Try Again  4:03   8
9  2000  Adams, Yolanda            Open My Heart  5:30   9

Now that we have a separate dataframe about songs, we can use the newly created id column to match a song to its weekly ranking.

# Merge the song dataframe to the original data set
billboard_ratings = billboard_long.merge(
     billboard_songs, on=['year', 'artist', 'track', 'time'])
print(billboard_ratings.shape)

(24092, 8)

print(billboard_ratings.head())

   year artist                    track  time date.entered week  0  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk1
1  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk2
2  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk3
3  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk4
4  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk5

   rating  id
0    87.0   0
1    82.0   0
2    72.0   0
3    77.0   0
4    87.0   0

Finally, we subset the columns to the ones we want in our ratings dataframe.

billboard_ratings =     billboard_ratings[['id', 'date.entered', 'week', 'rating']]
print(billboard_ratings.head())

   id date.entered week  rating
0   0   2000-02-26  wk1    87.0
1   0   2000-02-26  wk2    82.0
2   0   2000-02-26  wk3    72.0
3   0   2000-02-26  wk4    77.0
4   0   2000-02-26  wk5    87.0

< Back Page 7 of 9 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Reshaping and Cleaning Common Data

This chapter is from the book

This chapter is from the book

This chapter is from the book 

6.5 Multiple Observational Units in a Table (Normalization)

InformIT Promotional Mailings & Special Offers