Scraping for everyone

Scraping for everyone

by Sophie Rotgeri, Moritz Zajonz and Elena Erdmann

One of the most important skills for data journalists is scraping. It allows us to download any data that is openly available online as part of a website, even when it’s not supposed to be downloaded: may it be information about the members of parliament or – as in our christmas-themed example – a list of christmas markets in Germany.

There are some tools that don’t require any programming and which are sufficient for many standard scraping tasks. However, programmed scrapers can be fitted precisely to the task at hand. Therefore, some more complicated scraping might require programming.

Here, we explain three ways of scraping: the no-programming-needed online tool import.io and writing a scraper both in R and Python – since those are the two most common programming languages in data journalism. However, if you read the different tutorials, you will see that the steps to programming a scraper are not that different after all. Make your pick:

No programming needed Python R


Scraping with Import.io

There are two steps to scraping a website. With import.io, you first have to select the information important to you. Then the tool will extract the data for you so you can download it.

You start by telling import.io which URL it should look at, in this case the address “http://weihnachtsmarkt-deutschland.de”. When you start a new “Extractor”, as import.io calls it, you will see a graphical interface. It has two tabs: “Edit” will display the website. Import.io analyses the website’s structure and automatically tries to find and highlight structured information for extraction. If it didn’t select the correct information , you can easily change the selection by clicking on the website’s elements you are interested in.

In the other tab, “Data”, the tool shows the selected data as a spreadsheet, which is how the downloaded data will look like. If you are satisfied, click “Extract data from website”.

Now, the actual scraping begins. This might take a few seconds or minutes, depending on the amount of data you told it to scrape. And there we go, all the christmas markets are already listed in a file type of your choosing.


Scraping with Python

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: downloading the html-page from the Internet. In Python, the requests library does that job for you. Install it via pip and import it into your program. To download the webpage, you only have to call requests.get and hand over the url, in this case the address “http://weihnachtsmarkt-deutschland.de”. Get() will return an object and you can display the downloaded website by calling response.text.

Now, the entire website source, with all the christmas markets, is saved on your computer. You basically just need to clean the file, remove all the HTML-markers, the text that you’re not interested in and find the christmas markets. A computer scientist would call this process parsing the file, so you’re about to create a parser.

Before we start on the parser, it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can turn to your Python-Script. Bs4 is a python library, that can be used to parse HTML-files and extract the HTML-elements from it. If you haven’t used it before, install it using pip. From the package, import BeautifulSoup (because the name is so long, we renamed it to bs in the import). Remember, the code of the website is in response.text. Load it by calling BeautifulSoup. As bs4 can be used for different file types, you need to specify the parser. For your HTML-needs, the ‘lxml’ parser is best suited.

Now the entire website is loaded into BeautifulSoup. The cool thing is, BeautifulSoup understands the HTML-format. Thus, you can easily search for all kinds of HTML-elements in the code. The BeautifulSoup method ‘find_all()’ takes the HTML-tag and returns all instances. You can further specify the id or class of an element – but this is not necessary for the christmas markets. As you have seen before, each christmas market is listed as one row in the table, and marked by a <tr> tag. Thus, all you have to do is find all elements with the </tr> tag in the website.

And there we go, all the christmas markets are already listed in your Python output!

However, the data is still quite dirty. If you have a second look at it, you find that each row consist of two data cells, marked by <td>. The first one is the name of the christmas market, and links to a separate website with more information. The second data cell is the location of the christmas market. You can now choose, which part of the data you are interested in. BeautifulSoup lets you extract all the text immediately, by calling .text for the item at hand. For each row, this will give you the name of the christmas market.

Done! Now you have the list of christmas markets ready to work with.


Scraping with R

There are two steps to scraping a website. First, you download the content from the web. Then, you have to extract the information that is important to you. Good news: Downloading a website is easy as pie. The main difficulty in scraping is the second step, getting from the raw HTML-file to the content that you really want. But no need to worry, we’ll guide you with our step-by-step tutorial.

Let’s get to the easy part first: Downloading the HTML-page from the Internet. In R, the “rvest” package packs all the required functions. Install() and library() it (or use “needs()”) and call “read_html()” to.. well, read the HTML from a specified URL, in this case the address “http://weihnachtsmarkt-deutschland.de”. Read_html() will return an XML document. To display the structure, call html_structure().

Now that the XML document is downloaded, we can “walk” down the nodes of its tree structure until we can single out the table. it is helpful to inspect the website you’re interested in. A basic understanding of front-end technologies comes very handy, but we’ll guide you through it even if you have never seen an HTML-file before. If you have some spare time, check out our tutorial on HTML, CSS and Javascript.

Every browser has the option to inspect the source code of a website. Depending on your Browser, you might have to enable Developer options or you can inspect the code straight away by right-clicking on the website. Below, you can see a screenshot of the developer tools in Chrome. On the left, you find the website – in this example the christmas markets. On the right, you see the HTML source code. The best part is the highlighting: If you click on a line of the source code, the corresponding unit of the website is highlighted. Click yourself through the website, until you find the piece of code that encodes exactly the information that you want to extract: in this case the list of christmas markets.

All the christmas markets are listed in one table. In html, a table starts with an opening <table> tag, and it ends with </table>. The rows of the table are called <tr> (table row) in html. For each christmas market, there is exactly one row in the table.

Now that you have made yourself familiar with the structure of the website and know what you are looking for, you can go back to R.

First, we will create a nodeset from the document “doc” with “html_children()”. This finds all children of the specified document or node. In this case we call it on the main XML document, so it will find the two main children: “<head>” and “<body>”. From inspecting the source code of the christmas market website, we know that the <table> is a child of <body>. We can specify that we only want <body> and all of its children with an index.

Now, we have to further narrow it down, we really only want the table and nothing else. To achieve that, we can navigate to the corresponding <table> tag with “html_node()”.This will return a nodeset of one node, the “<table>”. Now, if we just use the handy “html_table()” function –- sounds like it was made just for us! –- we can extract all the information that is inside this HTML table directly into a dataframe.

Done! Now you have a dataframe ready to work with.

Extracting geodata from OpenStreetMap with Osmfilter

Extracting geodata from OpenStreetMap with Osmfilter

A guest post by Hans Hack

When working on map related projects, I often need specific geographical data from OpenStreetMap (OSM) from a certain area. For a recent project of mine, I needed all the roads in Germany in a useful format so I can work with them in a GIS program. So how do I do I get the data to work with? With a useful little program called Osmfilter.

There are various sites that provide OSM datasets for certain areas. However, these datasets include ALL the geodata OSM provides. That means houses, streets, rivers etc – basically everything you see on a normal map. In addition, the file is in a rather inconvenient format for direct usage. A quite complicated way to proceed with the dataset would be to load it into a database and query what you need. This can be time consuming and – depending on the power of your PC – impossible.

If you only need to get info about small areas, I recommend using the site overpass turbo. For larger datasets, there is Osmfilter, which easily lets you filter OSM geodata. With a little help from two other free tools, you will have a dataset you can work with in no time.

 

What you’ll need for this tutorial

 

Get an OSM Dataset

To start out, we need to download an OSM dataset, which is saved in a format called pbf (a format to compress large data sets). For this tutorial, I will use a dataset provided by geofabrik, but there are other sources, too. Lets download the pbf file for Liechtenstein and save it in a folder of your choice. Once you have downloaded the data, open the shell and go to the folder where your new dataset is stored with the command

Prepare your data

Osmfilter only supports the file formats osm and o5m. For fast data processing, using o5m is recommended. You can convert your pbf file to o5m with osmconvert in your shell like this:

This translates to: Use the program osmconvert and convert the file called liechtenstein-latest.osm.pbf to a o5m file called liechtenstein. The -o stands for output.

You will now have the same dataset in the o5m format in the same folder.

Filter your data

Now, you can filter your geodata using the shell that should still be open. The osmfilter command logic is built up as follows:

Let’s look at the part about the filter commands. Here is where you can tell the program which parts of the dataset you need by writing --keep=DATA_I_WANT . Here is an example which creates a file for you called buildings.osm that contains all the buildings (and only the buildings) from the Liechtenstein dataset:

To find out how features are stored and classified in OSM, you can go to this site and look up the feature you want. Tip: You can head over to overpass turbo to test your query on a small area of your choice by using the ‘wizard’.

Of course you can do much more with Osmfilter. You can specify which building type you want. For example, you might only want to look at schools:

You can query multiple features by chaining them. For example if you want all the schools and universities:

You can exclude things by adding the flag –drop. For example, if you don’t want to have buildings that are warehouses but keep everything else:

You can reduce the final file size by dropping extra data on the authors and the version by adding:

You can, of course, combine these flags. Here is a query that gives you all the highway types that cars can use in Lichtenstein without the ones where motor vehicles are not allowed:

You find more on the filtering options on the Osmfilter site with some examples too.

Convert to a useful format

As a final step, you can convert your osm file to the most widely supported geodata format called a Shapefile (the GIS program QGIS can handle osm files, but it sometimes doesn’t work well with large datasets). You can convert your osm file to a Shapefile with the program ogr2ogr like this:

The above command converts the file streets_liechtenstein.osm to the shapefile format and tells it to store it in a folder called streets_shapefiles. In the newly created folder you will find 4 different shapefiles (one for every geometry type). In case of the streets, we are only interested in the file lines.shp. You can open this file in a GIS program of your choice.

Ogr2org also allows you to convert your newly created Shapefiles to other geodata formats that you may need, such as GeoJSON, CSV and many more. Have a look at the ogr2ogr website for more info. If you’re tired of using the shell, try the online tool Mapshaper which allows you to convert your Shapefile file to formats such as GeoJSON, SVG and CSV. The file size for Mapshaper is limited but I have tried it with files bigger than 1 GB.

 

Have fun filtering OSM and happy mapping!