by Kira Schacht 2 Comments

This is a super useful R library that lets you scrape static websites without effort.

rvest can get you anything from the source code of a website. The process is simple, as you can see in the image above:

  1. Use read_html  to get the website’s code.
  2. Find HTML elements with html_node – or html_nodes, if you want multiple.
  3. Use html_text  or html_attr  to get the text inside the element or the value of an attribute, respectively.

It even features functions like html_table, which imports an entire table directly as a data frame. That’s immensely helpful, since we data-driven people deal with tables a lot.

One thing rvest can’t do, though, is get you all content from websites that use lazy loading. For example, your twitter feed will, of course, not load in its entirety the first time you open the page; more posts will only be shown once you scroll down. rvest loads only the content that’s there when you first open a page, so it won’t get you entire streams.

Another thing that might be difficult are parts of a site that are generated by embedded scripts. But if you’re unsure, take a look at the site’s source code: Anything you can Ctrl+F on there, you can also get with rvest. And believe me: That will get you a long way.

Comments ( 2 )

  1. ReplyMacky
    Hello, I'm trying to scrap a table from a webpage that uses lazy loading. I'm only able to scrap the content that is visible as you said, and my Google searchs brought me here. :) Is there something that you could suggest to solve this problem? Thanks in advance.

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.