Scraping for everyone

Scraping for everyone

by Sophie Rotgeri, Moritz Zajonz and Elena Erdmann

One of the most important skills for data journalists is scraping. It allows us to download any data that is openly available online as part of a website, even when it’s not supposed to be downloaded: may it be information about the members of parliament or – as in our christmas-themed example – a list of christmas markets in Germany.

There are some tools that don’t require any programming and which are sufficient for many standard scraping tasks. However, programmed scrapers can be fitted precisely to the task at hand. Therefore, some more complicated scraping might require programming.

Here, we explain three ways of scraping: the no-programming-needed online tool import.io and writing a scraper both in R and Python – since those are the two most common programming languages in data journalism. However, if you read the different tutorials, you will see that the steps to programming a scraper are not that different after all. Make your pick:

No programming needed Python R


Scraping with Import.io

There are two steps to scraping a website. With import.io, you first have to select the information important to you. Then the tool will extract the data for you so you can download it.

You start by telling import.io which URL it should look at, in this case the address “http://weihnachtsmarkt-deutschland.de”. When you start a new “Extractor”, as import.io calls it, you will see a graphical interface. It has two tabs: “Edit” will display the website. Import.io analyses the website’s structure and automatically tries to find and highlight structured information for extraction. If it didn’t select the correct information , you can easily change the selection by clicking on the website’s elements you are interested in.

In the other tab, “Data”, the tool shows the selected data as a spreadsheet, which is how the downloaded data will look like. If you are satisfied, click “Extract data from website”.

Now, the actual scraping begins. This might take a few seconds or minutes, depending on the amount of data you told it to scrape. And there we go, all the christmas markets are already listed in a file type of your choosing.


Scraping with Python

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: downloading the html-page from the Internet. In Python, the requests library does that job for you. Install it via pip and import it into your program. To download the webpage, you only have to call requests.get and hand over the url, in this case the address “http://weihnachtsmarkt-deutschland.de”. Get() will return an object and you can display the downloaded website by calling response.text.

Now, the entire website source, with all the christmas markets, is saved on your computer. You basically just need to clean the file, remove all the HTML-markers, the text that you’re not interested in and find the christmas markets. A computer scientist would call this process parsing the file, so you’re about to create a parser.

Before we start on the parser, it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can turn to your Python-Script. Bs4 is a python library, that can be used to parse HTML-files and extract the HTML-elements from it. If you haven’t used it before, install it using pip. From the package, import BeautifulSoup (because the name is so long, we renamed it to bs in the import). Remember, the code of the website is in response.text. Load it by calling BeautifulSoup. As bs4 can be used for different file types, you need to specify the parser. For your HTML-needs, the ‘lxml’ parser is best suited.

Now the entire website is loaded into BeautifulSoup. The cool thing is, BeautifulSoup understands the HTML-format. Thus, you can easily search for all kinds of HTML-elements in the code. The BeautifulSoup method ‘find_all()’ takes the HTML-tag and returns all instances. You can further specify the id or class of an element – but this is not necessary for the christmas markets. As you have seen before, each christmas market is listed as one row in the table, and marked by a <tr> tag. Thus, all you have to do is find all elements with the </tr> tag in the website.

And there we go, all the christmas markets are already listed in your Python output!

However, the data is still quite dirty. If you have a second look at it, you find that each row consist of two data cells, marked by <td>. The first one is the name of the christmas market, and links to a separate website with more information. The second data cell is the location of the christmas market. You can now choose, which part of the data you are interested in. BeautifulSoup lets you extract all the text immediately, by calling .text for the item at hand. For each row, this will give you the name of the christmas market.

Done! Now you have the list of christmas markets ready to work with.


Scraping with R

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: Downloading the HTML-page from the Internet. In R, the “rvest” package packs all the required functions. Install() and library() it (or use “needs()”) and call “read_html()” to.. well, read the HTML from a specified URL, in this case the address “http://weihnachtsmarkt-deutschland.de”. Read_html() will return an XML document. To display the structure, call html_structure().

Now that the XML document is downloaded, we can “walk” down the nodes of its tree structure until we can single out the table. it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can go back to R.

First, we will create a nodeset from the document “doc” with “html_children()”. This finds all children of the specified document or node. In this case we call it on the main XML document, so it will find the two main children: “<head>” and “<body>”. From inspecting the source code of the christmas market website, we know that the <table> is a child of <body>. We can specify that we only want <body> and all of its children with an index.

Now, we have to further narrow it down, we really only want the table and nothing else. To achieve that, we can navigate to the corresponding <table> tag with “html_node()”.This will return a nodeset of one node, the “<table>”. Now, if we just use the handy “html_table()” function –- sounds like it was made just for us! –- we can extract all the information that is inside this HTML table directly into a dataframe.

Done! Now you have a dataframe ready to work with.

From the data to the story: A typical ddj workflow in R

From the data to the story: A typical ddj workflow in R

R is getting more and more popular among Data Journalists worldwide, as Timo Grossenbacher from SRF Data pointed out recently in a talk at useR!2017 conference in Brussels. Working as a data trainee at Berliner Morgenpost’s Interactive Team, I can confirm that R indeed played an important role in many of our lately published projects, for example when we identified the strongholds of german parties. While we also use the software for more complex statistics from time to time, something that R helps us with on a near-daily basis is the act of cleaning, joining and superficially analyzing data. Sometimes it’s just to briefly check if there is a story hiding in the data. But sometimes, the steps you will learn in this tutorial are just the first part of a bigger, deeper data analysis.

Your first interactive choropleth map with R

Your first interactive choropleth map with R

When it comes to data journalism, visualizing your data isn’t what it’s all about. Getting and cleaning your data, analyzing and verifying your findings is way more important.

Still, an interactive eye-catcher holding interesting information will definitely not hurt your data story. Plus, you can use graphics for a visual analysis, too.

Here, we’ll show you how to build a choropleth map, where your data is visualized as colored polygon areas like countries and states.
We will code a multilayer map on Dortmunds students as an example. You’ll be able to switch between layered data from different years. The popups hold additional information on Dortmunds districts.

Now for the data

First of all you need to read a kml-file into R. KML stands for Keyhole Markup Language and as I just learned from the comment section of this tutorial it is a XML data format used to display geospatial information in a web browser. With a bit of googling, you’ll find kml-files holding geospatial informations of your own city, state or country. For this example, we’ll use this data on Dortmunds districts. Right click the link and save the file. Download the kml-file and save it to a new directory named “journocode” (or anything you want, really, but we’ll work with this for now).

Start RStudio. If you haven’t installed it yet, have a look at our first R Tutorial post. After starting RStudio, open a new R script and save it to the right directory. For example, if your “journocode”-directory was placed on your desktop (and your Username was MarieLou), type

Remember to use a normal slash (/) in your file path instead of a backslash. Now, we can read the shape file directly into R. If you don’t use our example data, try open your kml-file with a text editor first to look for the layer name! As you can see on this screenshot, for “Statistische Bezirke.kml” we have a layer named “Statistische_Bezirke”, defined in row four, and utf-8 encoding (see row 1), since we have the german umlauts “ä”, “ö” and “ü” in our file.

Bildschirmfoto 2016-01-22 um 12.31.12

Let’s load the data into R. We’ll do this with a function from the rgdal-package.

If you get an Error that says “Cannot open data source”, chances are there’s something wrong with your file name. Check that your working directory is properly set and that the file name is correct. Some browsers will change the .kml fily type to .txt, or even just add the .txt ending so you get “filename.kml.txt”. You’ll usually find the “layer” argument in your text file, named something like “name” or “id”, as shown above.

Did it work? Try to plot the polygons with the generic plot() function:

You should now see the black outlines of your polygons. Neat, isn’t it?

Next, we’ll need a little data to add to our map. To show you how to build a multilayer map, we will use two different csv files:   student1 & student2

The data contains information on the percentage of 18 to 25 year olds living in Dortmund in 2000 and 2014. Download the files and save them to your journocode directory. Make sure they’re still named student1 and student2.

This can be tricky sometimes: For our data, the encoding is “latin1” and the separation marks are commas. Open the csv files with a text editor to check if your separator is a comma, a semicolon or even a slash.

If everything worked out for you, celebrate a little! You’re a big step closer to your multilayer map!

 

Now for the interactive part

After looking through your data and analyzing it, you will now have some important information on how many values you have, which are the smallest and the biggest. For our example, we did that for you:

The highest value is 26%, so we can now think of a color scale from 0 to 26 to fill in our map. There are different statistical ways to decide what classes we want to divide our data into. For this mapping exercise, we will simply take eight classes: 0-5, 5-8, 8-10, 10-12, 12-14, 14-18, 18-24 and 24-26.

For every class, we want our map to fill the polygons in a different color. We’ll use a color vector generated with ColorBrewer here. Just copy the colour code you want, put it in a vector and replace it in the code. To paste the colors to the classes, use the function colorBin(). This is were you’ll need the package leaflet, which we will use to build our map. Install it, if you haven’t already.

Next up is the little infowindow we want to pop up when we click on the map. As you can see, I used some html code to specify some parameters for the first popup. For the second popup, I used a simpler way.

paste0() does the same thing as paste() but with no default separator. Check ?paste0 for more info. If something doesn’t work, check the punctuation!

 

Now for the map

After that, we’ll start right away with puzzling together all the parts we need:

The %>% operator is special to the leaflet package. Similar to the “+” in ggplot, it’s used to link two functions together. So remember: If you have a “%>%” opearator at the end of the line, R will expect more input from you.

The call to the function leaflet() starts the mapping procedd. The Provider Tile is your map base and background. If you don’t want to use the grey Tile in the example, have a look at this page and choose your own. Don’t worry if no map appears yet. With leaflet, you won’t see the actual map right away. First we’ll add the polygon layers and the popups we’ve defined to our map:

In our map, we want to be able to switch layers by clicking on a layer control panel with the group names. We’ll code that now:

Next, we want to add a thin color legend that shows the minimum and maximum value and the palette colors

The big moment: did it work? No mistake with the brackets or the punctuation? You’ll find out by typing:

Congratulations! You made your first multilayer choropleth with R! Now have fun building multilayer maps of your own city/country or even the whole world! If you want to publish your map, make sure you have the “htmlwidgets” package installed and add the following code to your script:

This will create a directory named “mymap_files” and a “mymap.html”-file. Save these two files in the same directory and load that on to your server. Et voilà: Your map is online!

If you publish a map based on our tutorial, feel free to link to our webpage and tell your fellows! We’d be delighted!

 

{Credits for the awesome featured image go to Phil Ninh}