Scraping [ˈskreɪpɪŋ]: Automatically extracting data from the Internet. Might be useful when your government releases a dataset, but does not make it easy to download.

Usually the process of scraping involves two steps: First you download a web page and then you process the source code to extract the data. There are tools to help you, but usually a little programming experience comes in handy. A good library to download a webpage is python’s request package. To further process the content a basic knowledge of web programing is very helpful. BeautifulSoup (bs4) is a great python package to process html-pages. Manytimes, data already comes in json-format and can thus be extracted directly.

There is static and dynamic scraping. When you just download the source of a web page it is called static scraping. However, sometimes you need to navigate and interact with the page in order to access all the data. That is when you want to try dynamic scraping. Selenium offers a python and a java library for that. It automatically opens a web browser and navigates through the pages just as you tell it to!

Back to Dictionary