rvest

by Kira Schacht 0 Comments
rvest

This is a super useful R library that lets you scrape static websites without effort.

rvest can get you anything from the source code of a website. The process is simple, as you can see in the image above:

  1. Use read_html  to get the website’s code.
  2. Find HTML elements with html_node – or html_nodes, if you want multiple.
  3. Use  html_text  or html_attr  to get the text inside the element or the value of an attribute, respectively.

It even features functions like html_table, which imports an entire table directly as a data frame. That’s immensely helpful, since we data-driven people deal with tables a lot.

One thing rvest can’t do, though, is get you all content from websites that use lazy loading. For example, your twitter feed will, of course, not load in its entirety the first time you open the page; more posts will only be shown once you scroll down. rvest loads only the content that’s there when you first open a page, so it won’t get you entire streams.

Another thing that might be difficult are parts of a site that are generated by embedded scripts. But if you’re unsure, take a look at the site’s source code: Anything you can Ctrl+F on there, you can also get with rvest. And believe me: That will get you a long way.

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>