A web scraping toolkit for journalists
Web scraping is one of the most useful and least understood methods for journalists to gather data. It’s the thing that helps you when, in your online research, you come across information that qualifies as data, but does not have a handy “Download” button. Here’s your guide on how to get started — without any coding necessary.
Say, for example, I am looking for “coffee” on Amazon. When I hit “search”, I get a list of results that’s made to be easily readable for humans. But it doesn’t help me much if I want to analyze the underlying data – how much the average coffee package costs or which brands dominate the Amazon coffee market, for example. For that purpose, a handy table might be more practical.
One option, then, might be to copy the information on each result by hand. Let’s say that takes my unpaid intern — whom I hired because I didn’t want to do the job myself — 5 seconds for each search result. With 200,000 results, that still takes them more than a month, if they work full-time from 9 to 5 at constant speed, without pause.
So, even if I had an unpaid intern at hand, this way is just not practical. My main motivation for learning how to code has always been laziness: I hate doing the same boring thing twice, let alone 200,000 times. So, I learned how to scrape data.
Scrapers, in practice, are little programs that extract and refine information from web pages.
They can come in the form of point-and-click tools or scripts that you write in a programming language. Their big advantages are:
- They’re much quicker than manual work,
- can automate the task of information extraction and
- can be recycled whenever you need to scrape the same website again.
If you need to scrape many differently structured sites, though, you’ll quickly notice their biggest drawback: scrapers are pretty fragile. They have to be configured for the exact structure of one website. If the structure changes, your scraper might break and not produce the output you expect anymore.
This is also what differentiates scrapers from APIs. If you haven’t heard of those before: Application Programming Interfaces are portals that website creators use to grant developers direct access to the structured database where they store their information. They’re much more stable because they’re designed for data extraction, but: The website creator gets to decide the rules by which you can get access.
They might limit the scope of data you have access to, or the extraction speed. Scrapers, in contrast, can extract anything you can see on a web page, and even some things you can’t (the ones that are in the website’s source code — we’ll get to that in a second). Also, not nearly every website has an API, but you can scrape info from virtually any site.
Scrapers occupy an important place in the scope of data sources available to data journalists. So let’s get to it: how to scrape data yourself.
A bit of a damper first: scraping is one of the more advanced ways to gather data. Still, there are some tools you can — and should — start using immediately.
😊 Level 1: Capture tables from Websites
This is the first step in your scraping career: there are extensions for Chrome (“Table Capture“) and Firefox (“Table to Excel“), that help you easily copy tables from websites into Excel or a similar program. They’re the same program, they just have different names because why make it easy. (Note: if you used to have an add-on called “TableTools” in Firefox and wonder where it went: it’s the “Table to Excel” one. Go ahead and reinstall it!)
With some tables, just marking them in your browser and using copy and paste will work, but it will often mess up the structure or the formatting of the table. These extensions spare you some trouble. If you have them installed, you can just go ahead and:
- Right-click on any table on a web page (try this list of countries in the Eurovision Song Contest, for example)
- Choose the option “Table to Excel – Display Inline” (or “Table Capture – Display Inline” if you’re in Chrome). A field should appear in the upper left corner of the table that says something like “Copy to clipboard” and “Save to Excel”
- Click “Copy to Clipboard”
- Open Excel or your program of choice
- Paste the data. Voila — a neatly formatted table.
😳 Level 2: Scrape a single website with the “Scraper” extension
If you’re feeling a little more adventurous, try the Scraper extension for Chrome. It can scrape more than just tables — it scrapes anything you can see on a website, with no programming knowledge necessary.
- Right-click one of the links and
- Select “Scrape similar…”.
A new window will open and, if you wait a minute, you’ll see that the program has already tried to guess which elements of the web page you want information about. It saw that there are many links like the one you clicked on the site, and thought you might want to scrape all of them.
If you know your way around the inner workings of a website a bit, you’ll recognize that the “XPath” field specifies which kinds of elements you want to extract. If you don’t: don’t worry about it for now, this is generated automatically if you click on the right element to scrape.
For links, it will extract the text and the URL by default. If you want to extract more properties, try this:
- Click the little green “+” next to “URL”
- Type “@title” in the new field that says “XPath”, and “Date” (or anything, this is just the column name) where it says “Name”
- Click “Scrape” at the bottom
- Wait a second
Feel free to try the Scraper extension on any information you want to extract as well as on other elements – it doesn’t just work on links. Once you have the information you want and the table preview looks good, just click “Copy to clipboard” or “Export to Google Docs…” and do some magic with your newly scraped data!
😨 Level 3: Scrape many web pages with the “Web Scraper” extension
Often, the data we want is not presented neatly on one page, but spread out over multiple pages. One very common example of this is a search that only displays a few results at a time. Even with our fancy extensions, this would reduce us to clicking through each results page and scraping them one by one. But we didn’t want to do that anymore, remember? Thankfully, there’s a programming-free solution for this as well. It’s the Web Scraper extension for Chrome.
As you’ve probably already realized with the previous extension, you really need to know how websites work to build more complex scrapers. Web Scraper is still pretty interactive and doesn’t require coding, but if you’ve never opened the developer tools in your browser before, it might get confusing pretty quickly.
The big advantage that Web Scraper has is that it lets you scrape not only one page, but go into its sub-pages as well. So, for example, if you want to scrape not only the titles and links of all xkcd comics, but also extract the direct URL of each image, you could make the extension click on the link to each comic in turn, get the image URL from each subpage and add it to your dataset.
I won’t go into much detail about how this works here – that might be material for its own tutorial – but I do want to show you the basic process. You can see it in the video below.
- Web Scraper lives in the Developer Tools, so right-click > “Inspect..” and select the “Web Scraper” tab.
- Create a sitemap.
- Click into the sitemap and create a selector. This tells the program which elements you want to scrape information from.
- Click “Sitemap [name]” > “Scrape” and wait until it’s done.
- Click “Sitemap [name]” > “Export data as CSV”.
Congratulations! You did it.
- Click into the sitemap, click into the selector and create a new selector inside the first.
(You can see the hierarchy of selected elements by clicking on “Sitemap [name]” > “Selector graph”.)
- Click “Sitemap [name]” > “Scrape” and wait until it’s done. This might take a while depending on the number of sub-pages you need to loop through. Once it’s done:
- Click “Sitemap [name]” > “Export data as CSV”.
- Congratulations! You did it.
Even more tools
With this, you can tackle many of the scraping challenges that will present themselves to you. Of course, there are many more possible tools out there. I don’t know most of them, and many of the more fancy ones are pretty expensive, but that should not keep you from knowing about them. So here’s an incomplete list of other tools to check out:
Programming libraries for scraping
As you can see, there’s a bunch of options out there for those of you who do not code at all. Still, programming your own scrapers gives you a lot more freedom in the process and helps you get past any limitations to the tools I’ve just introduced. I mostly use the function library rvest do do my scraping with the programming language R, because that’s what I’m most comfortable with. But Python, as well as node (so I’m told) and probably many other programming languages, offer scraping functionalities as well. For a head start in scraping with R or Python head to our tutorial.
Some function libraries that help you with scraping, apart from rvest, are Requests, Puppeteer, PhantomJS, Selenium and Scrapy. Try them out and tell us your favorite! I use Selenium for more complex problems. Because there are some situations where even many programming libraries give up.
There are some hurdles you might encounter in your work, and I want to go through some common phenomena here to give you a heads-up. Just click on one of the phenomena to get an explanation on how to get around it.
Happy scraping, and don’t panic!
Don’t worry, though: Most of the situations you’ll encounter won’t require heavy-duty scrapers.
As you can see, there’s a bunch of options out there for those of you who do not code at all. Programming your own scrapers gives you even more freedom in the process and helps you get past any limitations to the tools I’ve just introduced. Still: if you know how to extract a table, and maybe even how to use the Scraper extension: awesome! You already know more than 99 percent of the population. And that fact is definitely true — we’re data journalists, after all.
We hope this tutorial-slash-toolkit-overview has provided you with a good starting point for your scraping endeavours.
Thanks for reading, and happy scraping!