R: Tidy Data

R: Tidy Data

Unfortunately, data comes in all shapes and sizes. Especially when analyzing data from authorities. You’ll have to be able to deal with pdfs, fused table cells and frequent changes in terms and spelling.

When I analyzed the swiss arms export data as an intern at SRF Data, we had to work with scanned copies of data sheets that weren’t machine-readable, datasets with either french, german or french and german countrynames in the same column as well as fused cells and changing spelling of the categories.

Unsurprisingly, preparing and cleaning messy datasets is often the most time-consuming part of data analysis. Hardley Wickham, creator of  R packages like ggplot and reshape, wrote a very interesting paper about an important part of this data cleaning: the data tidying.
According to him, tidy data has a specific structure:

Each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.

As you may have seen in our post on ggplot2, Wickham calls this tidy format molten data. The idea behind this is to facilitate the analysis procedure by minimizing the effort in preparing the data for the different tools over and over again. His suggestion: Working on tidy, molten data with a set of tidy tools, allowing you to use the saved time to focus on the results.

Bildschirmfoto 2016-02-29 um 15.02.26

Excerpt of Wickhams “Tidy Data”

Practicing data tidying

But how do we tidy messy data? How do we get from raw to molten data and what are tidy tools? Let’s practice this on a messy dataset.

On our GitHub-page, we deposited an Excel file containing some data on marriages in Germany per state and for different years. Download it and open it with Excel to have a first look at it. As you’ll see, it’s a workbook with seven sheets. We have data for 1990, 2003 and for every year from 2010 through 2014. Although this is a quite small dataset which we could tidy manually in Excel, we’ll use this to practice skills that will come in handy when it comes to bigger datasets.

Now check whether this marriage data needs to be tidied:

  • Are there any changing terms?
  • Is the spelling correct?
  • Is every column that contains numbers correctly saved as a numeric column?
  • Are there needless titles, fused cells, empty cells or other annoying noise?

Spoiler alert: The sheets on 2010-2015 are okay, but the first two — not that much. We have different spelling and terms here, as well as three useless columns and one row plus all the numbers saved as text in the first sheet. As said, the mess in this example is limited and we could tidy it up manually with a few clicks. But let’s keep in mind that we’re here to learn how to handle those problems with larger datasets as well.

Within Excel, we will:

  • Delete spare rows and columns (we could do that in R too when it comes to REALLY messy data)
  • Save columns containing numbers as numeric type

Now we’ll change to R.

First of all, we need to install and require all the packages we’re going to use. We’ll do that with an if-statement telling R only to install the package if it hasn’t been installed yet. You could of course do this in the straightforward way without the conditional statement if you remember wether you already installed the package, but this is a quick way to make sure you don’t install something twice needlessly.

To read in the sheets of an Excel workbook, read_excel() from the readxl-package is a useful function. Because we don’t want to load the sheets separately, we’re going to use a loop for this. If you’re interested in learning more about loop functions, stay tuned for our upcomming tutorial on this topic.

messy_data is now a list of seven local data frames with messy_data[[1]] containing the data for 1990, messy_data[[2]] for 2003 and so on. Also, we added a “timestamp” column to each list element which contains the index of the list element.

To save the sheets as list elements is time saving, but we want all the data in one data frame:

If you get an error telling you the frames have different lengths you probably forgot to delete the spare columns in the 1990 sheet. Sometimes there even seems to be something invisble left in empty excel columns. I usually delete three or so of the empty columns and empty rows next to my data to be sure there isn’t something left I can’t see.

Next part: Restructuring the data

With the function gather() of Wickhams tidyr-package, we’ll melt the raw data frame to convert it to a molten format. And let’s change the timestamps created by the read-in loop to the actual year (we could do that with a loop, too, but this is good enough for now).

Oo-De-Lally! This is tidy data in a tidy format! Now we can check if we have to correct the state names (because with bigger datasets, you can’t quickly check and correct spelling and term issues within Excel):

So we got 19 different german Bundesländer. But Google tells us that there are only 16 states in Germany! Let’s have a closer look at the names to check whether we’ll find duplicates:

Yes, there are! For example Baden-Württemberg and BaWü refer to the same state, as well as Hessen, Hesse and Hesssen. You can just manually correct this. For really big datasets, you could also work with regular expressions and string replacement to find the duplicates, but for now, this should be enough:

Now that your data is tidy, the actual analysis can start. A very useful package to prepare molten data is dplyr. Its functions ease filtering the data or grouping it. Not only is this great for taking a closer look at certain subsets of your data, but, because Wickhams graphics package ggplot2 was created to fit the tidy data principle, we can quickly shape the data to be visually analyzed, too.

Here we have some examples for you showing how tidy data and tidy tools can work hand in hand. If you want to learn something about the graphics package ggplot2 first, visit our post for beginners on this!

Visual analysis with ggplot2: this may look complicated at first, but once you have coded the first ggplot you only have to change or/and add a few things to create several more and totally different plots.

Bildschirmfoto 2016-03-05 um 00.15.33

Maybe you’ve got some other questions this data could answer for you? Feel free to continue this analysis or try to tidy your own data set!

If you have any questions, problems or feedback, simply leave a comment, email us or join our slack team to talk to us any time!

 

{Credits for the awesome featured image go to Phil Ninh}

Comments ( 7 )

  1. Similarity and distance in data: Part 2 | Journocode
    […] you’ve followed our tutorial on the tidy data principles ggplot is built on, you’ll remember how we need to convert our data to the […]
  2. ReplyManuel
    Hi, thank you very much for the tutorial! I'm new in learning R, so the question might be a bit general: Anyway, it didn't quite become clear to me what the expressions between the percentage signs mean in detail. I googled a bit, but I did not find a sufficent and easy understandable answer. best regards, Manuel
    • ReplyMarie-Louise Timcke
      Hey Manuel, great you ask! Actually, I used to use this operator without even thinking why or how it actually works. I just knew it links functions together, so basically R knows at the end of the line that there will be some more input comming in the next line (similar to the "+" in ggplot). Now that you asked, I googled a bit for myself and - tadaa - found the magrittr-package. The "%>%"-operator is called pipe-operator. It even has some more functions than I knew until now that you asked, so thank you very much! You can read the most important information here: https://github.com/smbache/magrittr. Have a great day and keep asking questions! Greets, Marie-Louise
      • ReplyMarie-Louise Timcke
        Edit: if you referred to the %in%, too: The %in%-operator comes from the match-function which is defined in the base-package of R. The function match() returns a vector of the positions of matches of its first argument in its second and %in% is basically defined as "%in%" <- function(x, table) match(x, table, nomatch = 0) > 0 so it shortens the usage of the match-function for you. You can read something about this operator here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/match.html!
  3. ReplyManuel
    Thanks! You are doing a great job here
  4. R: Your first web application with shiny | Journocode
    […] app, you might want to have a look at our tutorial on the graphic package ggplot2 and our guide to tidy data, since we will use some of the functions and principles for it. If you’re already familiar […]
  5. Journocode-Beitrag: Your first web application with shiny (R) – Datentäter
    […] app, you might want to have a look at our tutorial on the graphic package ggplot2 and our guide to tidy data, since we will use some of the functions and principles for it. If you’re already familiar […]

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>