Scraping for everyone

Scraping for everyone

by Sophie Rotgeri, Moritz Zajonz and Elena Erdmann

One of the most important skills for data journalists is scraping. It allows us to download any data that is openly available online as part of a website, even when it’s not supposed to be downloaded: may it be information about the members of parliament or – as in our christmas-themed example – a list of christmas markets in Germany.

There are some tools that don’t require any programming and which are sufficient for many standard scraping tasks. However, programmed scrapers can be fitted precisely to the task at hand. Therefore, some more complicated scraping might require programming.

Here, we explain three ways of scraping: the no-programming-needed online tool import.io and writing a scraper both in R and Python – since those are the two most common programming languages in data journalism. However, if you read the different tutorials, you will see that the steps to programming a scraper are not that different after all. Make your pick:

No programming needed Python R


Scraping with Import.io

There are two steps to scraping a website. With import.io, you first have to select the information important to you. Then the tool will extract the data for you so you can download it.

You start by telling import.io which URL it should look at, in this case the address “http://weihnachtsmarkt-deutschland.de”. When you start a new “Extractor”, as import.io calls it, you will see a graphical interface. It has two tabs: “Edit” will display the website. Import.io analyses the website’s structure and automatically tries to find and highlight structured information for extraction. If it didn’t select the correct information , you can easily change the selection by clicking on the website’s elements you are interested in.

In the other tab, “Data”, the tool shows the selected data as a spreadsheet, which is how the downloaded data will look like. If you are satisfied, click “Extract data from website”.

Now, the actual scraping begins. This might take a few seconds or minutes, depending on the amount of data you told it to scrape. And there we go, all the christmas markets are already listed in a file type of your choosing.


Scraping with Python

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: downloading the html-page from the Internet. In Python, the requests library does that job for you. Install it via pip and import it into your program. To download the webpage, you only have to call requests.get and hand over the url, in this case the address “http://weihnachtsmarkt-deutschland.de”. Get() will return an object and you can display the downloaded website by calling response.text.

Now, the entire website source, with all the christmas markets, is saved on your computer. You basically just need to clean the file, remove all the HTML-markers, the text that you’re not interested in and find the christmas markets. A computer scientist would call this process parsing the file, so you’re about to create a parser.

Before we start on the parser, it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can turn to your Python-Script. Bs4 is a python library, that can be used to parse HTML-files and extract the HTML-elements from it. If you haven’t used it before, install it using pip. From the package, import BeautifulSoup (because the name is so long, we renamed it to bs in the import). Remember, the code of the website is in response.text. Load it by calling BeautifulSoup. As bs4 can be used for different file types, you need to specify the parser. For your HTML-needs, the ‘lxml’ parser is best suited.

Now the entire website is loaded into BeautifulSoup. The cool thing is, BeautifulSoup understands the HTML-format. Thus, you can easily search for all kinds of HTML-elements in the code. The BeautifulSoup method ‘find_all()’ takes the HTML-tag and returns all instances. You can further specify the id or class of an element – but this is not necessary for the christmas markets. As you have seen before, each christmas market is listed as one row in the table, and marked by a <tr> tag. Thus, all you have to do is find all elements with the </tr> tag in the website.

And there we go, all the christmas markets are already listed in your Python output!

However, the data is still quite dirty. If you have a second look at it, you find that each row consist of two data cells, marked by <td>. The first one is the name of the christmas market, and links to a separate website with more information. The second data cell is the location of the christmas market. You can now choose, which part of the data you are interested in. BeautifulSoup lets you extract all the text immediately, by calling .text for the item at hand. For each row, this will give you the name of the christmas market.

Done! Now you have the list of christmas markets ready to work with.


Scraping with R

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: Downloading the HTML-page from the Internet. In R, the “rvest” package packs all the required functions. Install() and library() it (or use “needs()”) and call “read_html()” to.. well, read the HTML from a specified URL, in this case the address “http://weihnachtsmarkt-deutschland.de”. Read_html() will return an XML document. To display the structure, call html_structure().

Now that the XML document is downloaded, we can “walk” down the nodes of its tree structure until we can single out the table. it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can go back to R.

First, we will create a nodeset from the document “doc” with “html_children()”. This finds all children of the specified document or node. In this case we call it on the main XML document, so it will find the two main children: “<head>” and “<body>”. From inspecting the source code of the christmas market website, we know that the <table> is a child of <body>. We can specify that we only want <body> and all of its children with an index.

Now, we have to further narrow it down, we really only want the table and nothing else. To achieve that, we can navigate to the corresponding <table> tag with “html_node()”.This will return a nodeset of one node, the “<table>”. Now, if we just use the handy “html_table()” function –- sounds like it was made just for us! –- we can extract all the information that is inside this HTML table directly into a dataframe.

Done! Now you have a dataframe ready to work with.

From the data to the story: A typical ddj workflow in R

From the data to the story: A typical ddj workflow in R

R is getting more and more popular among Data Journalists worldwide, as Timo Grossenbacher from SRF Data pointed out recently in a talk at useR!2017 conference in Brussels. Working as a data trainee at Berliner Morgenpost’s Interactive Team, I can confirm that R indeed played an important role in many of our lately published projects, for example when we identified the strongholds of german parties. While we also use the software for more complex statistics from time to time, something that R helps us with on a near-daily basis is the act of cleaning, joining and superficially analyzing data. Sometimes it’s just to briefly check if there is a story hiding in the data. But sometimes, the steps you will learn in this tutorial are just the first part of a bigger, deeper data analysis.

R: Your first web application with shiny

R: Your first web application with shiny

Data driven journalism doesn’t necessarily involve user interaction. The analysis and its results may be enough to write a dashing article without ever mentioning a number. But let’s face it: We love to interact with data visualizations! To build those, some basic knowledge of JavaScript and HTML is usually required.
What? Your only coding skills are a bit of R? No problemo! What if I told you there was a way to interactively show users your most interesting R-results in a fancy web app?

Shiny to the rescue

Shiny is a highly customizable web application framework that turns your analysis into an interactive web app. No HTML, no JavaScript, no CSS required — although you can use it to expand your app. Also, the layout is responsive (although it’s not perfect for every phone).

In this tutorial, we will learn step by step how to code the shiny app on Germany’s air pollutants emissions that you can see below.

Similarity and distance in data: Part 2

Similarity and distance in data: Part 2

Part 1 | Code

In part one of this tutorial, you learned about what distance and similarity mean for data and how to measure it. Now, let’s see how we can implement distance measures in R. We’re going to look at the built-in dist() function and visualize similarities with a ggplot2 tile plot, also called a heatmap.

Implementation in R: the dist() function

The simplest way to do distance measures in R is the dist() function. It works with matrices as well as data frames and has options for a lot of the measures we’ve gotten to know in the last part.

The crucial argument here is method. It has six options — actually more like four and a half, but you’ll see:

  • euclidean” Is the Euclidean distance.
  • maximum” The maximum distance.
  • manhattan” The Manhattan or city block distance.
  • canberra” Another name for the Manhattan distance.
  • binary” The Jaccard distance.
  • minkowski” Also called L-norm. The generalized version of Euclidean and Manhattan distance. Returns the Manhattan/Canberra distance if p = 1 and the Euclidean distance for p = 2.

We’re going to be working with the Jaccard distance in this lecture, but it works just as well for the other distance measures.

Download today’s dataset on similarities between right wing parties in Europe. It’s in the .Rdata file format, so you can load it into R with the load() function.

It contains the data frame values, which contains data on which european right wing parties agree with which right wing policies. The columns represent parties, while the rows represent political views. The entries are ones and zeros — one if the party agrees with the idea, zero if it doesn’t. This means we’re working with a binary or Boolean matrix (data frame, to be exact, but you get the idea). If you remember what we talked about in part one of this tutorial, you’ll realize this is a perfect situation for the Jaccard distance measure.

Since we want to visualize the similarities between the different parties, we want to calculate the distances over the columns of our dataset. This is a very important distinction, since some distance functions will calculate over rows per default, some over columns.

The dist() function works on rows. Since there’s no argument to switch to columns, we’ll have to transpose our dataset. Thankfully, this is pretty easy for data frames in R. We’ll just use t():

Note that with the default settings for diag and upper, the resulting “dist” object will have a triangle shape. That’s because we’re calculating the distance of every party to every other party, so the resulting matrix would be symmetric. Since we want to visualize our results, though, that’s what we want. So to prepare for visualization, we’ll have to do two things:

  • Add the diagonal and the upper triangle to make a complete rectangle shape.
  • Convert back from a dist object to a data frame so ggplot can work with the data.

Also, remember how we wanted to visualize the similarity between the parties, not their distance? And remember how distance and similarity metrics are inverse to each other? Once we’ve converted back to a data frame, we can simply use 1 - jacc  to get the Jaccard similarities from the Jaccard distances the dist() function returns.

If everything went according to plan, View(jaccsim) should show a symmetric data frame with values between zero and one, the diagonal consisting of all ones.

From here, let’s start preparing the dataset for ggplot visualization. For more info on how to work with ggplot, check out our tutorial, if you haven’t already.

Melting the data

If you’ve followed our tutorial on the tidy data principles ggplot is built on, you’ll remember how we need to convert our data to the specific tidy format ggplot works with. To melt our dataframe into a thin, long one instead of the rectangle shape it has right now, we’ll need to add a row containing the party names currently stored as row names, so the melting function will know what to melt on. Once we’ve done that, we can use melt() from the package reshape2 to convert our data.

Notice how we used the double colon “::” to specify to which package the function melt() belongs? This is a convenient alternative to loading an entire package if you only want to use one or two functions from it. As long as the package is installed, it will work just as well as library(reshape2).

The only thing left to do before we can start plotting is to make sure the parties are going to be in the right order when we plot them. When working with qualitative data, ggplot works with factors and plots the elements on the axes in the order of their factor levels. So we’ll make sure the levels are in the right order by specifying them explicitly:

The second argument to factor() specifies the levels and their order. Since we want to plot the similarities of each party with every other, we’re going to have party names on both x and y axes. By specifying one of the axes to be in reverse order with the rev() function, we make sure our plot looks nice and clean, as we’ll see in the next step: The actual visualization.

Visualization: Tile plot

There’s lots of ways to visualize similarity in data. When dealing with very small datasets like this one, one way to do it is using a heat map in tile format. At least that’s what I did, so that’s what you’re learning today. For each combination of two parties, there’s going to be a tile. The color of the tile shows the level of similarity between them: The more intense the color, the higher the similarity. The code we’re going to use is adapted from this blog post, which is really worth checking out.

First, we’re going zo build the basic plot structure:

Remember to load the ggplot2 package before you start plotting. We’re going to specify our x and y axes to be the two factors containing the party names with aes(names, variable). With geom_tile(), we define the basic structure of our plot: A set of tiles. They’re going to be filled according to the Jaccard similarities stored in the column value (aes(fill=value)). Their basic color is defined to be white, but we’ll create a gradient of blues with scale_file_gradient(). Try different color schemes if you like.

With these three basic set-up functions, you’re going to end up with something like this if you take a look at sim:

sim1

Not too bad, right? Notice how the diagonal of the tile matrix has the darkest possible blue. This makes sense, since those are the tile comparing one party to itself. The lighter the color, the lower the similarity between the parties.

But this plot doesn’t look as pretty as we’d like it to yet. The labels are to small, the axis labels aren’t necessary, the signature grey ggplot background isn’t visually appealing in this case and the legend doesn’t look as nice as it could.

Thankfully, ggplot2 let’s us edit all of that. Add these settings to your plot with the + operator and see what they do:

theme_light() is a standard theme with a clean look to it that fits our needs for this plot. The base_size argument lets us modify the text size of every text element in our plot. The default is 12px, but we want something a bit bigger for our plot. We don’t need any axis labels, so we’ll just pass the labs() function two empty strings.

The expand argument in the next two functions adds some space between the axes and our tiles, which we don’t want in this case. We’re going to set the argument to zero to make our plot look even cleaner. Also, we’re going to delete the legend title in the guides() function and remove the axis ticks with theme(). The text on the x axis looks a bit packed right now, so we’re going to rotate it a bit to give it more space. If everything worked out, your finished plot should look like this:

sim2

That’s better, isn’t it? Play around with the settings a little if you like. Maybe change the text size, the legend title of the rotation angle of the x axis text.

Anyway, though: You did it! Yay! This is, of course, only one way to visualize similarities. I’m sure there’s lots of other cool alternatives. If you find your own, leave a link in the comments, we’d love to hear about it. Until then: Experiment a little with similarity measures and ggplot options. See you in our next tutorial, our next meeting or on slack if you want to keep up with all of the hot Journocode gossip. Have fun!

Part 1 | Code

R: Tidy Data

R: Tidy Data

Unfortunately, data comes in all shapes and sizes. Especially when analyzing data from authorities. You’ll have to be able to deal with pdfs, fused table cells and frequent changes in terms and spelling.

When I analyzed the swiss arms export data as an intern at SRF Data, we had to work with scanned copies of data sheets that weren’t machine-readable, datasets with either french, german or french and german countrynames in the same column as well as fused cells and changing spelling of the categories.

Unsurprisingly, preparing and cleaning messy datasets is often the most time-consuming part of data analysis. Hardley Wickham, creator of  R packages like ggplot and reshape, wrote a very interesting paper about an important part of this data cleaning: the data tidying.
According to him, tidy data has a specific structure:

Each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.

As you may have seen in our post on ggplot2, Wickham calls this tidy format molten data. The idea behind this is to facilitate the analysis procedure by minimizing the effort in preparing the data for the different tools over and over again. His suggestion: Working on tidy, molten data with a set of tidy tools, allowing you to use the saved time to focus on the results.

Bildschirmfoto 2016-02-29 um 15.02.26

Excerpt of Wickhams “Tidy Data”

Practicing data tidying

But how do we tidy messy data? How do we get from raw to molten data and what are tidy tools? Let’s practice this on a messy dataset.

On our GitHub-page, we deposited an Excel file containing some data on marriages in Germany per state and for different years. Download it and open it with Excel to have a first look at it. As you’ll see, it’s a workbook with seven sheets. We have data for 1990, 2003 and for every year from 2010 through 2014. Although this is a quite small dataset which we could tidy manually in Excel, we’ll use this to practice skills that will come in handy when it comes to bigger datasets.

Now check whether this marriage data needs to be tidied:

  • Are there any changing terms?
  • Is the spelling correct?
  • Is every column that contains numbers correctly saved as a numeric column?
  • Are there needless titles, fused cells, empty cells or other annoying noise?

Spoiler alert: The sheets on 2010-2015 are okay, but the first two — not that much. We have different spelling and terms here, as well as three useless columns and one row plus all the numbers saved as text in the first sheet. As said, the mess in this example is limited and we could tidy it up manually with a few clicks. But let’s keep in mind that we’re here to learn how to handle those problems with larger datasets as well.

Within Excel, we will:

  • Delete spare rows and columns (we could do that in R too when it comes to REALLY messy data)
  • Save columns containing numbers as numeric type

Now we’ll change to R.

First of all, we need to install and require all the packages we’re going to use. We’ll do that with an if-statement telling R only to install the package if it hasn’t been installed yet. You could of course do this in the straightforward way without the conditional statement if you remember wether you already installed the package, but this is a quick way to make sure you don’t install something twice needlessly.

To read in the sheets of an Excel workbook, read_excel() from the readxl-package is a useful function. Because we don’t want to load the sheets separately, we’re going to use a loop for this. If you’re interested in learning more about loop functions, stay tuned for our upcomming tutorial on this topic.

messy_data is now a list of seven local data frames with messy_data[[1]] containing the data for 1990, messy_data[[2]] for 2003 and so on. Also, we added a “timestamp” column to each list element which contains the index of the list element.

To save the sheets as list elements is time saving, but we want all the data in one data frame:

If you get an error telling you the frames have different lengths you probably forgot to delete the spare columns in the 1990 sheet. Sometimes there even seems to be something invisble left in empty excel columns. I usually delete three or so of the empty columns and empty rows next to my data to be sure there isn’t something left I can’t see.

Next part: Restructuring the data

With the function gather() of Wickhams tidyr-package, we’ll melt the raw data frame to convert it to a molten format. And let’s change the timestamps created by the read-in loop to the actual year (we could do that with a loop, too, but this is good enough for now).

Oo-De-Lally! This is tidy data in a tidy format! Now we can check if we have to correct the state names (because with bigger datasets, you can’t quickly check and correct spelling and term issues within Excel):

So we got 19 different german Bundesländer. But Google tells us that there are only 16 states in Germany! Let’s have a closer look at the names to check whether we’ll find duplicates:

Yes, there are! For example Baden-Württemberg and BaWü refer to the same state, as well as Hessen, Hesse and Hesssen. You can just manually correct this. For really big datasets, you could also work with regular expressions and string replacement to find the duplicates, but for now, this should be enough:

Now that your data is tidy, the actual analysis can start. A very useful package to prepare molten data is dplyr. Its functions ease filtering the data or grouping it. Not only is this great for taking a closer look at certain subsets of your data, but, because Wickhams graphics package ggplot2 was created to fit the tidy data principle, we can quickly shape the data to be visually analyzed, too.

Here we have some examples for you showing how tidy data and tidy tools can work hand in hand. If you want to learn something about the graphics package ggplot2 first, visit our post for beginners on this!

Visual analysis with ggplot2: this may look complicated at first, but once you have coded the first ggplot you only have to change or/and add a few things to create several more and totally different plots.

Bildschirmfoto 2016-03-05 um 00.15.33

Maybe you’ve got some other questions this data could answer for you? Feel free to continue this analysis or try to tidy your own data set!

If you have any questions, problems or feedback, simply leave a comment, email us or join our slack team to talk to us any time!

 

{Credits for the awesome featured image go to Phil Ninh}

R: plotting with the ggplot2 package

R: plotting with the ggplot2 package

While crunching numbers, a visual analysis of your data may help you get an overview of your data or compare filtered information at a glance. Aside from the built-in graphics package, R has many additional packages to help you with that.
We want to focus on ggplot2 by Hadley Wickham, which is a very nice and quite popular graphics package.

Ggplot2 is based on a kind of statistical philosophy from a book I really recommend reading. In The Grammar of Graphics, author Leland Wilkinson goes deep into the structure of quantitative plotting. As a product, he establishes a rulebook for building charts the right way. Hadley Wickham built ggplot2 to follow these aesthetics and principles.

Project: Visualizing WhatsApp chat logs –
Part 1: Cleaning the data

Project: Visualizing WhatsApp chat logs – <br>Part 1: Cleaning the data

Part 2 | Code

A few weeks ago, we discovered it’s possible to export WhatsApp conversation logs as a .txt file. It’s quite an interesting piece of data, so we figured, why not analyze it? So here we go: A code-along R project in two steps.

  1. Cleaning the data: That’s what this part is for. We’ll get the .txt file ready to be properly evaluated.
  2. Visualizing the data: That’s what we’ll talk about in part two — creating some interesting visuals for our chat logs.

You can find the entire code for the project on our github page. In this part, we’ll walk you through the process of cleaning a dataset step by step. This is what the final product of part two will look like (Of course, yours could be something entirely else. There are heaps of great material in those logs):

 

is this the tooltip?

Getting the data

First things first: We’ll need some data to work with. WhatsApp has a built-in function for exporting chat logs via e-mail. Just choose any chat you want to analyze. Group chats are especially interesting for this particular visualization, since we’ll take a look at the number and timing of messages. In case you already know how to get the logs, you can just skip this step.

How to get the chat logs

Depending on what kind of phone you have, this might work a little differently. But this is how it works on Android:

While in a chat, tap the three dots in the upper right corner and select “More”, the last option.

Screenshot_2016-01-28-12-02-49

Then, select “E-mail chat”, the second option. It will let you choose an address to send to and voilà, there’s your text file.

Screenshot_2016-01-28-12-02-54

Alternatively, you can also go via the WhatsApp main page. Tap the three dots and select “Settings” > “Chat history” > “Send chat history”. Then, just select the chat you want to export.

Our to-do-list

The .txt file you’ll get is, well, not as difficult to handle as it could be, but it has a few quirks. The basic structure is pretty easy. Every row follows this basic pattern:

<time stamp> – <name>: <text>

Looks alright, doesn’t it? But there’s a few issues we’ll run into, especially if we don’t want to analyze just the  message count but the content as well. Some of them are easy to correct, like the dash between the time stamp and the name, some are more complicated:

  • The .txt file isn’t formatted like a .csv or a proper table. That is, not every row has the same amount of elements
  • Some rows don’t have a time stamp, but immediately start with text if a message has multiple paragraphs.
  • Some names have one, some names have two words, depending on wether they’re saved with or without surname.
  • The time stamp isn’t formatted to be evaluated and spans multiple columns.
  • Names have a colon at the end. That doesn’t look nice in graphics.

Converting and importing the file

Before we can start cleaning in R, we have to tackle the first issue on our to-do-list. If you try to read the text file into R right away, you’ll get an error:

We’ll have to convert it into a proper table structure. There’s multiple ways to do that. We used Excel to convert the file to .csv. If you already have your own favourite way to convert the file, you can do it your way.

Converting text to .csv with Excel

First, obviously, open Excel. Open the .txt file. Remember to switch from “Excel files” to “All file types” in the drop-down menu so your text file is visible. You should be led to the the text import wizard.

In the first step, set the file type to “Delimited”.

text-import-wizard-step-1

Then, separate by spaces (remember to un-check “Tab”).

text-import-wizard-step-2

In the last step, you can just leave the data format at “General” and click “Finish”.

text-import-wizard-step-3

The resulting dataset should look somewhat like this:

csv

Just save this as a .csv file and you should be good to go.

Now that we’ve got a proper .csv file, we can start cleaning it in R. First, read in the file and save it as a variable. For more info on data import in R, check our our previous tutorial.

Check that you specify the right separator. It’s probably a comma or a semicolon. Just open your file in a text editor and find out.

Regular expressions

To clean up the file, we’ll need to work with regular expressions. They’re used for finding and manipulating patterns in text and have their own syntax. Regex syntax is a little hard to wrap your head around, but there’s lots of reference sheets and expression testers like regexr online that help you translate. In R, you’ll use the grep() function family for text matching and replacement. Let’s try it out. As mentioned on our to-do-list, some rows don’t start with a timestamp. Visually, they’re easy to spot, because they don’t have a number at the beginning.

In regex, the pattern “character string without a digit at the beginning” is translated as “^\D”“^” matches the beginning of a string, “\D” means “anything except a digit”.

The call  to grep(“^\\D”, chat[,1]) tells R to look in the first column of our chat for rows that fit the regular expression “^\D”. The second backslash is an excape character only necessary in R, because the backslash serves other purposes there as well.

We’re not going to get into the details of regular expressions here, that’s a post for another time. Feel free to look them up on your own, though. If you want to analyze text files in your projects, it’s pretty sure you’ll encounter them anyway.

Shifting stampless rows

We’re going to shift the rows without time stamp a bit to the right, so we can copy down the time stamp and name of the sender. First, we’re going to make room at the end of the data frame, in case the stampless rows also happen to be very long messages:

Then, we’ll just move the first five rows that block the space for the time stamp to the end of the line, leaving the beginning of the line blank for the time being. We’ll use a for loop that goes through every row without a time stamp and moves the first five elements to the end of the line.

We’ll write a tutorial on loops and conditional statements soon, but in the meantime, check out this short explanation of you want to know more about loops.

We could copy down the time stamp and name right now, but since there’s still a few issues with the name columns, we’ll sort out those first. Before we do that, though, we’ll just quickly delete any entirely empty rows that might have snuck in. We’ll use the apply() function for that. It’s basically like a loop, just much faster and easier to handle in most cases. The R package swirl contains in-console tutorials and has a great one for the apply() function family as well.

Cleaning the surname column

Now, some contacts might be saved by first name, some by first and last name, right? So the column containing the surnames also sometimes contains a bit of text. The difference is, the text bit probably won’t end with a colon, the surnames definitely will. We can use regular expressions to filter the surname column accordingly.
Also, some messages aren’t actually chat content, but rather activity notifications like adding new members to a group. They’ll say something like “Marie Timcke added you”. Good thing is: Those messages don’t contain colons either, so we can use the same regular expression for the surnames and the notifications.

The regex we’ll use is “.+:$”. It matches any pattern with one or more characters (“.” for any character, “+” for “one or more”) followed by a literal colon (“:”) and then the end of the line (“$”).

The first part reduces the chat data frame to all columns that either have a colon in columns 5 (“grepl(“.+:$”, chat[,5])”) or (“|”) in column 4 (“grepl(“.+:$”, chat[,4])”). Of course, the stampless rows we just created are left in as well (“is.na(chat[,1])”). This effectively removes the notifications.

In the second chunk, we move the text parts in the surname column to the end of the line, the same way we shifted the rows without time stamp. By now, our file looks something like this (check yours with View(chat)):

chat2

The bigger part of our work is done. We’ll just format the time stamp in such a way that R can evaluate it and make a few cosmetic adjustments.

Converting the time stamp to date format

For R to convert the first two columns into a format it can work with, we’ll have to help it a bit. First, we’ll copy down time stamps and name from the previous row to all the rows we shifted before.

Then, we’ll clean the first few columns a bit, deleting the column with the dash, merging the time stamp so its all in one place and naming the first few columns.

Now, we can easily convert the first column into a date format. R has a few different classes for date and time objects. We’ll use the strptime() function, which produces an object of class “Posixlt”. If you want to know more about dates and times in R, again, the swirl lesson on that topic is great.

We need to tell strptime in which format the date is stored. In our case, it’s
“<day>.<month>.<full year>, <hours>:<minutes>”. In strptime() language, this is written as “%d.%m.%Y, %H:%M”.

One last cosmetic edit: The names still have colons at the end. This issue is easily solved with — you guessed it — regular expressions! We can use the gsub() function to search and replace patterns. We’ll use it on the “name” and “surname” column by replacing every colon at the end of a line with nothing, like this:

Congratulations, you’ve cleaned up the entire dataset! It should now look like this — no empty lines, no colons or text in the name columns and a wonderfully formatted time stamp.

chat3

Saving the data

Now, the only thing left to do is to save our beautiful, sparkly clean dataset to a new file. If you want to work with the file in another programm except R, you can use, for example, the write.table() or write.csv() function to export your data frame. Since we want to continue working in R for our visualizations, we’ll go with save() for now. It will create an .Rdata file that can be read back into R easily with the load() function.

There you go, all done! Give yourself a big pat on the back, because cleaning data is hard.

If you want to continue right away, check out part two of our WhatsApp project where we visualize the data we just cleaned. If you need help with the cleaning script or have suggestions on how to improve it, write us an e-mail or join our slack team. Our help and discussion channels are open for everyone!

 

Part 2 | Code

 

{Credits for the awesome featured image go to Phil Ninh}

Your first interactive choropleth map with R

Your first interactive choropleth map with R

When it comes to data journalism, visualizing your data isn’t what it’s all about. Getting and cleaning your data, analyzing and verifying your findings is way more important.

Still, an interactive eye-catcher holding interesting information will definitely not hurt your data story. Plus, you can use graphics for a visual analysis, too.

Here, we’ll show you how to build a choropleth map, where your data is visualized as colored polygon areas like countries and states.
We will code a multilayer map on Dortmunds students as an example. You’ll be able to switch between layered data from different years. The popups hold additional information on Dortmunds districts.

Now for the data

First of all you need to read a kml-file into R. KML stands for Keyhole Markup Language and as I just learned from the comment section of this tutorial it is a XML data format used to display geospatial information in a web browser. With a bit of googling, you’ll find kml-files holding geospatial informations of your own city, state or country. For this example, we’ll use this data on Dortmunds districts. Right click the link and save the file. Download the kml-file and save it to a new directory named “journocode” (or anything you want, really, but we’ll work with this for now).

Start RStudio. If you haven’t installed it yet, have a look at our first R Tutorial post. After starting RStudio, open a new R script and save it to the right directory. For example, if your “journocode”-directory was placed on your desktop (and your Username was MarieLou), type

Remember to use a normal slash (/) in your file path instead of a backslash. Now, we can read the shape file directly into R. If you don’t use our example data, try open your kml-file with a text editor first to look for the layer name! As you can see on this screenshot, for “Statistische Bezirke.kml” we have a layer named “Statistische_Bezirke”, defined in row four, and utf-8 encoding (see row 1), since we have the german umlauts “ä”, “ö” and “ü” in our file.

Bildschirmfoto 2016-01-22 um 12.31.12

Let’s load the data into R. We’ll do this with a function from the rgdal-package.

If you get an Error that says “Cannot open data source”, chances are there’s something wrong with your file name. Check that your working directory is properly set and that the file name is correct. Some browsers will change the .kml fily type to .txt, or even just add the .txt ending so you get “filename.kml.txt”. You’ll usually find the “layer” argument in your text file, named something like “name” or “id”, as shown above.

Did it work? Try to plot the polygons with the generic plot() function:

You should now see the black outlines of your polygons. Neat, isn’t it?

Next, we’ll need a little data to add to our map. To show you how to build a multilayer map, we will use two different csv files:   student1 & student2

The data contains information on the percentage of 18 to 25 year olds living in Dortmund in 2000 and 2014. Download the files and save them to your journocode directory. Make sure they’re still named student1 and student2.

This can be tricky sometimes: For our data, the encoding is “latin1” and the separation marks are commas. Open the csv files with a text editor to check if your separator is a comma, a semicolon or even a slash.

If everything worked out for you, celebrate a little! You’re a big step closer to your multilayer map!

 

Now for the interactive part

After looking through your data and analyzing it, you will now have some important information on how many values you have, which are the smallest and the biggest. For our example, we did that for you:

The highest value is 26%, so we can now think of a color scale from 0 to 26 to fill in our map. There are different statistical ways to decide what classes we want to divide our data into. For this mapping exercise, we will simply take eight classes: 0-5, 5-8, 8-10, 10-12, 12-14, 14-18, 18-24 and 24-26.

For every class, we want our map to fill the polygons in a different color. We’ll use a color vector generated with ColorBrewer here. Just copy the colour code you want, put it in a vector and replace it in the code. To paste the colors to the classes, use the function colorBin(). This is were you’ll need the package leaflet, which we will use to build our map. Install it, if you haven’t already.

Next up is the little infowindow we want to pop up when we click on the map. As you can see, I used some html code to specify some parameters for the first popup. For the second popup, I used a simpler way.

paste0() does the same thing as paste() but with no default separator. Check ?paste0 for more info. If something doesn’t work, check the punctuation!

 

Now for the map

After that, we’ll start right away with puzzling together all the parts we need:

The %>% operator is special to the leaflet package. Similar to the “+” in ggplot, it’s used to link two functions together. So remember: If you have a “%>%” opearator at the end of the line, R will expect more input from you.

The call to the function leaflet() starts the mapping procedd. The Provider Tile is your map base and background. If you don’t want to use the grey Tile in the example, have a look at this page and choose your own. Don’t worry if no map appears yet. With leaflet, you won’t see the actual map right away. First we’ll add the polygon layers and the popups we’ve defined to our map:

In our map, we want to be able to switch layers by clicking on a layer control panel with the group names. We’ll code that now:

Next, we want to add a thin color legend that shows the minimum and maximum value and the palette colors

The big moment: did it work? No mistake with the brackets or the punctuation? You’ll find out by typing:

Congratulations! You made your first multilayer choropleth with R! Now have fun building multilayer maps of your own city/country or even the whole world! If you want to publish your map, make sure you have the “htmlwidgets” package installed and add the following code to your script:

This will create a directory named “mymap_files” and a “mymap.html”-file. Save these two files in the same directory and load that on to your server. Et voilà: Your map is online!

If you publish a map based on our tutorial, feel free to link to our webpage and tell your fellows! We’d be delighted!

 

{Credits for the awesome featured image go to Phil Ninh}

R crash course: Basic data structures

R crash course: Basic data structures

 

„To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.“John M. Chambers

Data structures in R are quite different from most programming languages. Understanding them is a necessity, because they define the way you’ll work with your data. Problems in understanding data structures will probably also produce problems in your code.