A web scraping toolkit for journalists

A web scraping toolkit for journalists

Web scraping is one of the most useful and least understood methods for journalists to gather data. It’s the thing that helps you when, in your online research, you come across information that qualifies as data, but does not have a handy “Download” button. Here’s your guide on how to get started — without any coding necessary.

Note: This tutorial was first published in our data-driven Advent calendar 2018 behind door number 13. You can check it out here.

Say, for example, I am looking for “coffee” on Amazon. When I hit “search”, I get a list of results that’s made to be easily readable for humans. But it doesn’t help me much if I want to analyze the underlying data – how much the average coffee package costs or which brands dominate the Amazon coffee market, for example. For that purpose, a handy table might be more practical.

One option, then, might be to copy the information on each result by hand. Let’s say that takes my unpaid intern — whom I hired because I didn’t want to do the job myself — 5 seconds for each search result. With 200,000 results, that still takes them more than a month, if they work full-time from 9 to 5 at constant speed, without pause.

So, even if I had an unpaid intern at hand, this way is just not practical. My main motivation for learning how to code has always been laziness: I hate doing the same boring thing twice, let alone 200,000 times. So, I learned how to scrape data.

Scrapers, in practice, are little programs that extract and refine information from web pages.

They can come in the form of point-and-click tools or scripts that you write in a programming language. Their big advantages are:

  • They’re much quicker than manual work,
  • can automate the task of information extraction and
  • can be recycled whenever you need to scrape the same website again.

If you need to scrape many differently structured sites, though, you’ll quickly notice their biggest drawback: scrapers are pretty fragile. They have to be configured for the exact structure of one website. If the structure changes, your scraper might break and not produce the output you expect anymore.

This is also what differentiates scrapers from APIs. If you haven’t heard of those before: Application Programming Interfaces are portals that website creators use to grant developers direct access to the structured database where they store their information. They’re much more stable because they’re designed for data extraction, but: The website creator gets to decide the rules by which you can get access.

They might limit the scope of data you have access to, or the extraction speed. Scrapers, in contrast, can extract anything you can see on a web page, and even some things you can’t (the ones that are in the website’s source code — we’ll get to that in a second). Also, not nearly every website has an API, but you can scrape info from virtually any site.

Scrapers occupy an important place in the scope of data sources available to data journalists. So let’s get to it: how to scrape data yourself.

A bit of a damper first: scraping is one of the more advanced ways to gather data. Still, there are some tools you can — and should — start using immediately.

😊 Level 1: Capture tables from Websites

This is the first step in your scraping career: there are extensions for Chrome (“Table Capture“) and Firefox (“Table to Excel“), that help you easily copy tables from websites into Excel or a similar program. They’re the same program, they just have different names because why make it easy. (Note: if you used to have an add-on called “TableTools” in Firefox and wonder where it went: it’s the “Table to Excel” one. Go ahead and reinstall it!)

With some tables, just marking them in your browser and using copy and paste will work, but it will often mess up the structure or the formatting of the table. These extensions spare you some trouble. If you have them installed, you can just go ahead and:

  1. Right-click on any table on a web page (try this list of countries in the Eurovision Song Contest, for example)
  2. Choose the option “Table to Excel – Display Inline” (or “Table Capture – Display Inline” if you’re in Chrome). A field should appear in the upper left corner of the table that says something like “Copy to clipboard” and “Save to Excel”
  1. Click “Copy to Clipboard”
  2. Open Excel or your program of choice
  3. Paste the data. Voila — a neatly formatted table.

😳 Level 2: Scrape a single website with the “Scraper” extension

If you’re feeling a little more adventurous, try the Scraper extension for Chrome. It can scrape more than just tables — it scrapes anything you can see on a website, with no programming knowledge necessary.

Say, for example, I desperately need a table with links to every xkcd comic in existence, as well as it’s publishing date and title. With the Scraper extension, you can go to their “Archive” page,

  1. Right-click one of the links and
  2. Select “Scrape similar…”.

A new window will open and, if you wait a minute, you’ll see that the program has already tried to guess which elements of the web page you want information about. It saw that there are many links like the one you clicked on the site, and thought you might want to scrape all of them.

If you know your way around the inner workings of a website a bit, you’ll recognize that the “XPath” field specifies which kinds of elements you want to extract. If you don’t: don’t worry about it for now, this is generated automatically if you click on the right element to scrape.

For links, it will extract the text and the URL by default. If you want to extract more properties, try this:

  1. Click the little green “+” next to “URL”
  2. Type “@title” in the new field that says “XPath”, and “Date” (or anything, this is just the column name) where it says “Name”
  3. Click “Scrape” at the bottom
  4. Wait a second
Your window should now look like the one on the image. Congratulations: you’ve now also extracted the publication date of each comic. This only works with this example and only because, hidden in the source code of the website, the xkcd website administrators also specified a “title” element for each link (it makes a tooltip show up when you hover over each link) and wrote the publication date into that element. This will make sense to you if you already know a bit about how websites work. If you don’t: may I recommend our Journocode tutorial on the basics of HTML, CSS and JavaScript? Here, I’ll give you the very short version: the structure of a website is determined mainly by a language called HTML. It’s the one with the many arrow brackets.You can see the structure of the xkcd archive page in the image below (see for yourself by right-clicking on the “Laptop Issues” link on the page and selecting “Inspect…” to open your browser’s developer tools). The title and href elements are the ones that the Scraper extension extracted from the page.

Feel free to try the Scraper extension on any information you want to extract as well as on other elements – it doesn’t just work on links. Once you have the information you want and the table preview looks good, just click “Copy to clipboard” or “Export to Google Docs…” and do some magic with your newly scraped data!

😨 Level 3: Scrape many web pages with the “Web Scraper” extension

Often, the data we want is not presented neatly on one page, but spread out over multiple pages. One very common example of this is a search that only displays a few results at a time. Even with our fancy extensions, this would reduce us to clicking through each results page and scraping them one by one. But we didn’t want to do that anymore, remember? Thankfully, there’s a programming-free solution for this as well. It’s the Web Scraper extension for Chrome.

As you’ve probably already realized with the previous extension, you really need to know how websites work to build more complex scrapers. Web Scraper is still pretty interactive and doesn’t require coding, but if you’ve never opened the developer tools in your browser before, it might get confusing pretty quickly.

The big advantage that Web Scraper has is that it lets you scrape not only one page, but go into its sub-pages as well. So, for example, if you want to scrape not only the titles and links of all xkcd comics, but also extract the direct URL of each image, you could make the extension click on the link to each comic in turn, get the image URL from each subpage and add it to your dataset.

I won’t go into much detail about how this works here – that might be material for its own tutorial – but I do want to show you the basic process. You can see it in the video below.

    1. Web Scraper lives in the Developer Tools, so right-click > “Inspect..” and select the “Web Scraper” tab.
    2. Create a sitemap.
    3. Click into the sitemap and create a selector. This tells the program which elements you want to scrape information from.
    4. Click “Sitemap [name]” > “Scrape” and wait until it’s done.
    5. Click “Sitemap [name]” > “Export data as CSV”.

Congratulations! You did it.

This produces the same output as we got with the Scraper plugin before. The exciting part starts once you add another layer to the process. It works pretty much the same way, as you can see in the second video:
  1. Click into the sitemap, click into the selector and create a new selector inside the first.
    (You can see the hierarchy of selected elements by clicking on “Sitemap [name]” > “Selector graph”.)
  2. Click “Sitemap [name]” > “Scrape” and wait until it’s done. This might take a while depending on the number of sub-pages you need to loop through. Once it’s done:
  3. Click “Sitemap [name]” > “Export data as CSV”.
  4. Congratulations! You did it.

Even more tools

With this, you can tackle many of the scraping challenges that will present themselves to you. Of course, there are many more possible tools out there. I don’t know most of them, and many of the more fancy ones are pretty expensive, but that should not keep you from knowing about them. So here’s an incomplete list of other tools to check out:

Programming libraries for scraping

As you can see, there’s a bunch of options out there for those of you who do not code at all. Still, programming your own scrapers gives you a lot more freedom in the process and helps you get past any limitations to the tools I’ve just introduced. I mostly use the function library rvest do do my scraping with the programming language R, because that’s what I’m most comfortable with. But Python, as well as node (so I’m told) and probably many other programming languages, offer scraping functionalities as well. For a head start in scraping with R or Python head to our tutorial.

Some function libraries that help you with scraping, apart from rvest, are Requests, Puppeteer, PhantomJS, Selenium and Scrapy. Try them out and tell us your favorite! I use Selenium for more complex problems. Because there are some situations where even many programming libraries give up.

Technical difficulties

There are some hurdles you might encounter in your work, and I want to go through some common phenomena here to give you a heads-up. Just click on one of the phenomena to get an explanation on how to get around it.

Data isn’t part of the text, but in an image or a PDF

Scrapers can only “see” what’s in a website’s source code, so you’ll have to use different methods here. Download the documents, be it images, PDFs or others, you want to analyze, and use appropriate software. If it’s an image or a non-searchable PDF file, you might need an optical character recognition (OCR) software.

The Website is too unstructured

If the website isn’t consistently structured, your scraper will have a hard time figuring out where the data you want to scrape is located. Try using a pattern recognition algorithm. Regular expressions, for example, might be the way to go.

Data is being loaded dynamically

You know this from social media websites: Facebook or Twitter don’t load their entire database at once, but they have “infinite scrolling”, where a new section only comes into view once you scroll down. In this case, you might need a scraping library that actually simulates a user and actually interacts with the website – scrolling down, entering user data, everything that you do. Selenium can do this, for example. Note: This is one of the more advanced methods, even for scraping, but it’s also one of the coolest. Seeing a browser window open up on it’s own and entering information because I told a robot to do it is pretty awesome

Target is behind a paywall or login window

Selenium or similar libraries can help you with this, too — as I said, they can do anything you can do on the web, just automatically. Be careful when scraping content behind a login window, though: Some websites forbid automatic data extraction, and once you create an account and accept their terms of service, you might actually break your contract with them by doing it anyway.

Your IP gets blocked

Some pages try to detect unusual traffic. If you scrape a lot of pages really quickly, they might block your IP to protect themselves from attacks or to prevent you from extracting data. There are services that offer IP rotation: They automatically change your IP every X seconds. A VPN might also be helpful. Still, once you encounter that barrier, maybe check with your legal department whether your scraping is still on the legal site.

Captchas

You know those. They’re supposed to make sure the people using a site are actual humans. And your scraper might be one of the non-people they’re trying to keep out. This is a tricky problem. You might think about entering the Captchas by hand as your scraper encounters them or, if you don’t want to do that and you really need what’s behind that “no robots” sign, you might ask a proper nerd for help.

 

Happy scraping, and don’t panic!

Don’t worry, though: Most of the situations you’ll encounter won’t require heavy-duty scrapers.

As you can see, there’s a bunch of options out there for those of you who do not code at all. Programming your own scrapers gives you even more freedom in the process and helps you get past any limitations to the tools I’ve just introduced. Still: if you know how to extract a table, and maybe even how to use the Scraper extension: awesome! You already know more than 99 percent of the population. And that fact is definitely true — we’re data journalists, after all.

We hope this tutorial-slash-toolkit-overview has provided you with a good starting point for your scraping endeavours.

Thanks for reading, and happy scraping!

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.