From the data to the story: A typical ddj workflow in R

by Marie-Louise Timcke 2 Comments
From the data to the story: A typical ddj workflow in R

R is getting more and more popular among Data Journalists worldwide, as Timo Grossenbacher from SRF Data pointed out recently in a talk at useR!2017 conference in Brussels. Working as a data trainee at Berliner Morgenpost’s Interactive Team, I can confirm that R indeed played an important role in many of our lately published projects, for example when we identified the strongholds of german parties. While we also use the software for more complex statistics from time to time, something that R helps us with on a near-daily basis is the act of cleaning, joining and superficially analyzing data. Sometimes it’s just to briefly check if there is a story hiding in the data. But sometimes, the steps you will learn in this tutorial are just the first part of a bigger, deeper data analysis.

So the following tutorial will guide you through the standard steps I always take when getting the data for a new data driven project at the Morgenpost. For this exemplary data journalism R workflow, we’ll have a look at data from BBSR Germany. As always, you’ll find the data and the (less well-commented) code on our GitHub page. This time, the R script is organized as an R Project, which makes collaboration easier because the working directory is automatically set to the directory where the Rproj file is located. A double click on the Rproj file opens a new RStudio Window that is already set to the right working directory of the analysis. Then you can open the R script as usual.

Note: R has a great selection of basic functions to get you going. And because it’s open source, there are plenty of great packages with additional functions. I’d really like to recommend to you these five packages, which we’ll also use in this tutorial:

My five favorite R packages

I always subdivide my R scripts into three parts:
Head: The part where I load packages and data
Preprocessing: The part where I clean and, well, preprocess the data for analysis
Analysis: The part where I analyze the data

The three parts don’t necessarily have to be in the same script. To make the analysis cleaner, you could save the head and preprocessing in a separate R-file that can either be sourced in the beginning of the analysis script or have the preprocessed data saved as a new data source that can be loaded in the analysis script.

Head

Enough talk! Let’s start with loading the packages we need for our analysis

Next, we need to load the data. For this example, I prepared an Excel Worksheet with two data sheets. Both contain data on Germany’s 402 city districts. We have the unique district ID and the district name as well as information on whether the district is a county or a city district. The first Excel sheet contains the average age of the district’s male population, the second contains the same data for the female population.

Preprocessing

Now that we have the data, we have to do a little preprocessing. We want to merge both dataframes into one that contains the age of the male and female population for each district. Let’s have a look at the frames to check whether we can do the merge:

We have two findings:

  1. The dataframes aren’t sorted in the same way and the column names aren’t the same either.  Luckily, the order isn’t important for the merge. We’ll get there later.
  2. age_female has more rows than age_male and more rows than there are german districts. What’s the reason for this?

Maybe there are duplicated rows:

Great, we just had to remove some duplicated rows. Now, let’s merge the dataframes! If they were ordered in exactly the same way,  we could use the function cbind() to simply add the age column of age_female to age_male. But because the order isn’t the same, we have to use merge(). merge() joins dataframes according to a column containing unique values that both have in common. For our example, the unique district ID seems to be the best choice. Because we just want to add the age column of age_female and not all its columns, we only select its first and fourth column by indexing age_female[c(1,4)]. Then we have to tell merge() the name of the matching column with the parameters by.x for the first and by.y for the second dataframe. If the matching columns had the same name, we could just use the parameter by.

P.S.: Merging also works with data frames of different lengths. In that case, you can specify whether you want to keep unmatchable rows. Type ?merge into the R console for more information.

Now for the tidyverse: We want the columns average_age_males and average_age_females to be converted into one column containing the variables name and one containing the matching values. With the parameter key, we tell gather() what the new column with the attribute names should be called. We give value the new column name for the values and then specify the columns that should be gathered by applying the columns’ indexes.

By the way: 1:3 is just the same as c(1,2,3).

Analysis

As mentioned before, a very nice package that’s great for getting a first overview of your data is dplyr. Like tidyr, it’s a package by Hadley Wickham and designed to work well with the tidy data format.

Helpful dplyr function we'll use in the following steps

Let’s get an overview of our data by filtering and summarizing the values:

See how dplyr makes it really easy to have a look at different aspects of your data by just combining different functions?
You can even easily answer questions that seem a little bit more difficult at first, as long as you know your R functions. This is something that won’t be as simple with tools like Excel.

For example:
Let’s find the youngest city for every german state.

The states have unique IDs, represented by the first two numbers of the district ID. So we have to group the data by the first two numbers of district_id, then only select the row in each group with the minimum value in district_mean. To make this specific grouping possible, we need help from the base function substr(). To check out how it works, just type ?substr into your console.

Finally, arrange the data in ascending order according to the district_mean.

…now take some time to imagine getting the same result with a tool like Excel…

Understood how dplyr works? Then try your own combinations on the data. Find the cities that are closest to the german age average or look whether there are more districts where males are older than females or if it’s the other way around.

 


 

Visualize

We’ve now found out a lot about our data by simply filtering, summarizing and arranging it. But sometimes, a simple visualization helps a lot in finding patterns, too. In a previous post, we already explained how ggplot2 works. This is why this time, we’ll go a step further and plot the data as a static choropleth map using the package.
The following guide is just one possible approach to such a map.

First, we need some geodata.It can be provided in different formats, for example as a GeoJSON. In this example, we have an ESRI Shapefile of Germany’s city and county districts. An ESRI Shapefile contains multiple files that have to be stored in the same directory. Nevertheless, we only will load the SHP file into R using rgdal’s readOGR().

krs_shape basically consists of two parts: A dataframe in krs_shape@data and the geographic information in krs_shape@polygons. The dataframe has a column KRS containing the very same unique district IDs as our age dataframe. We could merge the dataframes, plot the shapefile and colorize the plot according to the age values.
But, as always, I like to work with tidy data. This is why we’ll loaded the package broom before. If you take a look at the shapefile…

…you may understand why I’d like to keep the data a little bit more simple. broom’s tidy()-function simplifies the geodata:

Much better! We now have one row per polygon point and group IDs so we know which points belong to the same shape.
But we have a loss of information here, too: Where’s the district ID?
The district ID is swept away by broom. We still have IDs though, starting at zero and identifying the districts in the same order as they appeared in krs_shape@data. So “Schweinfurt” now has id=0 and “Würzburg” id=1.
There may be a simpler way to work around this problem, but here is what I usually do:

In the first step, I save the shapefiles’ district IDs as numerics in a new variable

Next, I arrange my dataframe to match the order of the IDs in the shapefile, then add new IDs from 0 to 401 that will match the tidied shapefile IDs:

Now I merge the tidied shapefile with my data by the new ID. It is important to not lose any shapefile rows while merging and keep the plott order straight, so be sure to set all.x to TRUE and arrange in ascending order according to the ID column.

This is our final plotting data. Every point of each district’s shapefile now has additional information like the district’s average age. Let’s plot this data with ggplot2!

Because we already explained how ggplot basically works in a previous post, I’ll only comment the code for the choropleth:

And this is what the result should look like:

 

Of course, our analyzed example data isn’t a data story treasure. The old eastern Germany might be a story, one of our arranged lists might be, too. Or the results just gave you a hint where to dig deeper. Whatever conclusions you draw from your brief data analysis with R: It was brief! With the few R methods we’ve shown you today, it won’t take you much time to be able to draw the first important conclusions from any given data.

As always, you’ll find the entire code for this example on our GitHub page. If you have any questions, suggestions or feedback, feel free to leave a comment! We’ll try to answer fast.

Comments ( 2 )

  1. Replythe T
    Nice guide! Little question: which editor theme are you using here?

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>