Scraping for everyone

One of the most important skills for data journalists is scraping. It allows us to download any data that is openly available online as part of a website, even when it’s not supposed to be downloaded: may it be information about the members of parliament or – as in our christmas-themed example – a list of christmas markets in Germany.

2/7/2018

There are some scraping tools that don’t require any programming and which are sufficient for many standard scraping tasks. However, programmed scrapers can be fitted precisely to the task at hand. Therefore, some more complicated scraping might require programming.

Here, we explain three ways of scraping: the no-programming-needed online tool import.io and writing a scraper both in R and Python – since those are the two most common programming languages in data journalism. However, if you read the different tutorials, you will see that the steps to programming a scraper are not that different after all.

Scraping with Import.io

There are two steps to scraping a website. With import.io, you first have to select the information important to you. Then the tool will extract the data for you so you can download it.

You start by telling import.io which URL it should look at, in this case the address http://weihnachtsmarkt-deutschland.de. When you start a new Extractor, as import.io calls it, you will see a graphical interface. It has two tabs: Edit will display the website. Import.io analyses the website’s structure and automatically tries to find and highlight structured information for extraction. If it didn’t select the correct information , you can easily change the selection by clicking on the website’s elements you are interested in.

Website woth the list of christmas markets

In the other tab, Data, the tool shows the selected data as a spreadsheet, which is how the downloaded data will look like. If you are satisfied, click Extract data from website.

website opened in import.io

Now, the actual scraping begins. This might take a few seconds or minutes, depending on the amount of data you told it to scrape. And there we go, all the christmas markets are already listed in a file type of your choosing.

Scraping with Python

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: downloading the html-page from the Internet. In Python, the requests library does that job for you. Install it via pip and import it into your program. To download the webpage, you only have to call requests.get and hand over the url, in this case the address http://weihnachtsmarkt-deutschland.de. Get() will return an object and you can display the downloaded website by calling response.text.

import requests

response = requests.get("http://weihnachtsmarkt-deutschland.de")
print(response.text)

Now, the entire website source, with all the christmas markets, is saved on your computer. You basically just need to clean the file, remove all the HTML-markers, the text that you’re not interested in and find the christmas markets. A computer scientist would call this process parsing the file, so you’re about to create a parser.

Before we start on the parser, it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

Website structure as seen via developer mode

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can turn to your Python-Script. Bs4 is a python library, that can be used to parse HTML-files and extract the HTML-elements from it. If you haven’t used it before, install it using pip. From the package, import BeautifulSoup (because the name is so long, we renamed it to bs in the import). Remember, the code of the website is in response.text. Load it by calling BeautifulSoup. As bs4 can be used for different file types, you need to specify the parser. For your HTML-needs, the ‘lxml’ parser is best suited.

from bs4 import BeautifulSoup as bs 
soup = bs(response.text,'lxml')

Now the entire website is loaded into BeautifulSoup. The cool thing is, BeautifulSoup understands the HTML-format. Thus, you can easily search for all kinds of HTML-elements in the code. The BeautifulSoup method ‘find_all()’ takes the HTML-tag and returns all instances. You can further specify the id or class of an element – but this is not necessary for the christmas markets. As you have seen before, each christmas market is listed as one row in the table, and marked by a <tr> tag. Thus, all you have to do is find all elements with the </tr> tag in the website.

rows = soup.find_all('tr') 
print(rows)

And there we go, all the christmas markets are already listed in your Python output!

However, the data is still quite dirty. If you have a second look at it, you find that each row consist of two data cells, marked by <td>. The first one is the name of the christmas market, and links to a separate website with more information. The second data cell is the location of the christmas market. You can now choose, which part of the data you are interested in. BeautifulSoup lets you extract all the text immediately, by calling .text for the item at hand. For each row, this will give you the name of the christmas market.

for row in rows: 
     print(row.text)

Done! Now you have the list of christmas markets ready to work with.

Scraping with R

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: Downloading the HTML-page from the Internet. In R, the rvest package packs all the required functions. Install() and library() it (or use needs()) and call read_html() to.. well, read the HTML from a specified URL, in this case the address http://weihnachtsmarkt-deutschland.de. Read_html() will return an XML document. To display the structure, call html_structure().

needs( 
     dplyr, 
     rvest ) 
doc <- read_html("http://weihnachtsmarkt-deutschland.de/") 
html_structure(doc)

Now that the XML document is downloaded, we can walk down the nodes of its tree structure until we can single out the table. it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

Website structure as seen via developer mode

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can go back to R.

First, we will create a nodeset from the document doc with html_children(). This finds all children of the specified document or node. In this case we call it on the main XML document, so it will find the two main children: <head> and <body>. From inspecting the source code of the christmas market website, we know that the <table> is a child of <body>. We can specify that we only want <body> and all of its children with an index.

doc %>% 
     html_children() -> nodes 
body <- nodes[2]

Now, we have to further narrow it down, we really only want the table and nothing else. To achieve that, we can navigate to the corresponding <table> tag with html_node(). This will return a nodeset of one node, the <table>. Now, if we just use the handy html_table() function –- sounds like it was made just for us! –- we can extract all the information that is inside this HTML table directly into a dataframe.

body %>% 
     html_node("table") -> table_node 

table_node[[1]] %>% 
     html_table() -> df

Done! Now you have a dataframe ready to work with.