1/28/2019

Say, for example, I am looking for coffee on Amazon. When I hit search, I get a list of results that's made to be easily readable for humans. But it doesn't help me much if I want to analyze the underlying data – how much the average coffee package costs or which brands dominate the Amazon coffee market, for example. For that purpose, a handy table might be more practical.

Amazon list of coffee and table of data from that list

One option, then, might be to copy the information on each result by hand. Let's say that takes me 5 seconds for each search result. With 200,000 results, that still takes them more than a month, if they work full-time from 9 to 5 at constant speed, without a break.

My main motivation for learning how to code has always been laziness: I hate doing the same boring thing twice, let alone 200,000 times. So, I learned how to scrape data.

Scrapers, in practice, are little programs that extract and refine information from web pages.

They can come in the form of point-and-click tools or scripts that you write in a programming language. Big advantages to using scrapers include:

Speed: scrapers can process a lot of data in a short time
Automation: scrapers can save manual work
Repetition: scrapers can be reused in regular intervals as websites get updated

If you need to scrape many differently structured sites, though, you'll quickly notice their biggest drawback: scrapers are pretty fragile. They have to be configured for the exact structure of one website. If the structure changes, your scraper might break and not produce the output you expect.

This is also what differentiates scrapers from APIs. If you haven't heard of those before: Application Programming Interfaces are portals that website creators use to grant developers direct access to the structured database where they store their information. They're much more stable because they're designed for data extraction, but: The website creator gets to decide the rules by which you can get access.

from structured database to unstructured website to table via scraper, or directly from database to table via an API

They might limit the scope of data you have access to, or the extraction speed. Scrapers, in contrast, can extract anything you can see on a web page, and even some things you can't (the ones that are in the website's source code — we'll get to that in a second). Also, not nearly every website has an API, but you can scrape info from virtually any site.

Scrapers therefore occupy an important place in the scope of data sources available to data journalists. So let's get to it: how to scrape data yourself.

A bit of a dampener first: scraping is one of the more advanced ways to gather data. Still, there are some tools you can — and should — start using immediately.

Level 1: Capture tables from Websites

This is the first step in your scraping career: there are extensions for Chrome (Table Capture) and Firefox (Table to Excel), that help you easily copy tables from websites into Excel or a similar program. They're the same program, they just have different names because why make it easy. (Note: if you used to have an add-on called TableTools in Firefox and wonder where it went: it's the Table to Excel one. Go ahead and reinstall it!)

With some tables, just marking them in your browser and using copy and paste will work, but it will often mess up the structure or the formatting of the table. These extensions spare you some trouble. If you have them installed, you can just go ahead and:

Right-click on any table on a web page (try this list of countries in the Eurovision Song Contest, for example)
Choose the option Table to Excel – Display Inline (or Table Capture – Display Inline if you're in Chrome). A field should appear in the upper left corner of the table that says something like Copy to clipboard and Save to Excel

Click Copy to Clipboard
Open Excel or your program of choice
Paste the data. Voila — a neatly formatted table.

Level 2: Scrape a single website with the Scraper extension

If you're feeling a little more adventurous, try the Scraper extension for Chrome. It can scrape more than just tables — it scrapes anything you can see on a website, with no programming knowledge necessary.

Say, for example, I desperately need a table with links to every xkcd comic in existence, as well as its publishing date and title. With the Scraper extension, you can go to their Archive page,

Right-click one of the links and
Select Scrape similar…

A new window will open and, if you wait a minute, you'll see that the program has already tried to guess which elements of the web page you want information about. It saw that there are many links like the one you clicked on the site, and thought you might want to scrape all of them.

If you know your way around the inner workings of a website a bit, you'll recognize that the XPath field specifies which kinds of elements you want to extract. If you don't: don't worry about it for now, this is generated automatically if you click on the right element to scrape.

For links, it will extract the text and the URL by default. If you want to extract more properties, try this:

Click the little green + next to URL
Type @title in the new field that says XPath, and Date (or anything, this is just the column name) where it says Name
Click Scrape at the bottom
Wait a second

Screenshot of the Chrome Scraper Extension in action

Your window should now look like the one on the image. Congratulations: you've now also extracted the publication date of each comic. This only works with this example and only because, the xkcd website administrators also specified a title element for each link hidden in the source code of the website (it makes a tooltip show up when you hover over each link) and wrote the publication date into that element. This will make sense to you if you already know a bit about how websites work. If you don't: may I recommend our Journocode tutorial on the basics of HTML, CSS and JavaScript? Here, I'll give you the very short version: the structure of a website is determined mainly by a language called HTML. It's the one with the many angle brackets <>. You can see the structure of the xkcd archive page in the image below (see for yourself by right-clicking on the Laptop Issues link on the page and selecting Inspect… to open your browser's developer tools). The title and href elements are the ones that the Scraper extension extracted from the page.

example of how to find structure behind data shown on website

Feel free to try the Scraper extension on any information you want to extract as well as on other elements – it doesn't just work on links. Once you have the information you want and the table preview looks good, just click Copy to clipboard or Export to Google Docs… and do some magic with your newly scraped data!

Level 3: Scrape many web pages with the Web Scraper extension

Often, the data we want is not presented neatly on one page, but spread out over multiple pages. One very common example of this is a search that only displays a few results at a time. Even with our fancy extensions, this would reduce us to clicking through each results page and scraping them one by one. But we don’t have to do that anymore, remember? Thankfully, there's a programming-free solution for this as well. It's the Web Scraper extension for Chrome.

As you've probably already realized with the previous extension, you really need to know how websites work to build more complex scrapers. Web Scraper is still pretty interactive and doesn't require coding, but if you've never opened the developer tools in your browser before, it might get confusing pretty quickly.

The big advantage that Web Scraper has is that it lets you scrape not only one page, but go into its sub-pages as well. So, for example, if you want to scrape not only the titles and links of all xkcd comics, but also extract the direct URL of each image, you could make the extension click on the link to each comic in turn, get the image URL from each subpage and add it to your dataset.

I won't go into much detail about how this works here – that might be material for its own tutorial – but I do want to show you the basic process.

Web Scraper lives in the Developer Tools, so right-click > Inspect... and select the Web Scraper tab
Create a sitemap
Click into the sitemap and create a selector. This tells the program which elements you want to scrape information from
Click Sitemap [name] > Scrape and wait until it’s done
Click Sitemap [name] > Export data as CSV

Congratulations! You did it.

This produces the same output as we got with the Scraper plugin before. The exciting part starts once you add another layer to the process. It works pretty much the same way:

Click into the sitemap, click into the selector and create a new selector inside the first.
(You can see the hierarchy of selected elements by clicking on Sitemap [name] > Selector graph.)
Click Sitemap [name] > Scrape and wait until it's done. This might take a while depending on the number of sub-pages you need to loop through. Once it's done:
Click Sitemap [name] > Export data as CSV.
Congratulations! You did it.

Even more tools

With this, you can tackle many of the scraping challenges that will present themselves to you. Of course, there are many more possible tools out there. I don't know most of them, and many of the more fancy ones are pretty expensive, but that should not keep you from knowing about them. So here's an incomplete list of other tools to check out:

Freemium:

Paid:

Programming libraries for scraping

As you can see, there’s a bunch of options out there for those of you who do not code at all. Still, programming your own scrapers gives you a lot more freedom in the process and helps you get past any limitations to the tools I’ve just introduced. I mostly use the function library rvest to do my scraping with the programming language R, because that’s what I’m most comfortable with. But Python, as well as node (so I’m told) and probably many other programming languages, offer scraping functionalities as well. To see what programming can look like in data journalism, head over to our post about R workflows for journalists.

the guardian datablog screenshot

Some function libraries that help you with scraping, apart from rvest, are Requests, Puppeteer, PhantomJS, Selenium and Scrapy. Try them out and tell us your favourite! I use Selenium for more complex problems. Because there are some situations where even many programming libraries give up.

Technical difficulties

There are some hurdles you might encounter in your work, and I want to go through some common issues here to give you a heads-up:

Data isn't part of the text, but in an image or a PDF

Scrapers can only "see" what's in a website's source code, so you'll have to use different methods here. Download the documents, be it images, PDFs or others, you want to analyse, and use appropriate software. If it's an image or a non-searchable PDF file, you might need optical character recognition (OCR) software.

The Website is too unstructured

If the website isn't consistently structured, your scraper will have a hard time figuring out where the data you want to scrape is located. Try using a pattern recognition algorithm. Regular expressions, for example, might be the way to go.

Data is being loaded dynamically

You know this from social media websites: Facebook or Twitter don't load their entire database at once, but they have infinite scrolling, where a new section only comes into view once you scroll down. In this case, you might need a scraping library that actually simulates a user and actually interacts with the website – scrolling down, entering user data, everything that you do. Selenium can do this, for example. Note: This is one of the more advanced methods, even for scraping, but it's also one of the coolest. Seeing a browser window open up on its own and entering information because I told a robot to do it, is pretty awesome.

Target is behind a paywall or login window

Selenium or similar libraries can help you with this, too — as I said, they can do anything you can do on the web, just automatically. Be careful when scraping content behind a login window, though: Some websites forbid automatic data extraction, and once you create an account and accept their terms of service, you might actually break your contract with them by doing it anyway.

Your IP gets blocked

Some pages try to detect unusual traffic. If you scrape a lot of pages really quickly, they might block your IP to protect themselves from attacks or to prevent you from extracting data. There are services that offer IP rotation: They automatically change your IP every X seconds. A VPN might also be helpful. Still, once you encounter that barrier, maybe check with your legal department whether your scraping is still legally above-board.

Captchas

You know them. They're supposed to make sure the people using a site are actual humans. And your scraper might be one of the non-people they're trying to keep out. This is a tricky problem. You might think about entering the Captchas by hand as your scraper encounters them or, if you don't want to do that and you really need what's behind that no robots sign, you might ask a proper nerd for help.

Happy scraping, and don't panic!

Don't worry: Most of the situations you'll encounter won't require heavy-duty scrapers.

As you can see, there's a bunch of options out there for those of you who do not code at all. Programming your own scrapers gives you even more freedom in the process and helps you get past any limitations to the tools I've just introduced. Still: if you know how to extract a table, and maybe even how to use the Scraper extension: awesome! You already know more than 99 percent of the population. And that fact is definitely true — we’re data journalists, after all.

We hope this tutorial-slash-toolkit-overview has provided you with a good starting point for your scraping endeavours.

Thanks for reading, and happy scraping!