Scraping

Scraping [ˈskreɪpɪŋ]: Automatically extracting data from the Internet. Might be useful when your government releases a dataset, but does not make it easy to download.

Usually the process of scraping involves two steps: First you download a web page and then you process the source code to extract the data. There are tools to help you, but usually a little programming experience comes in handy. A good library to download a webpage is python’s request package. To further process the content a basic knowledge of web programing is very helpful. BeautifulSoup (bs4) is a great python package to process html-pages. Manytimes, data already comes in json-format and can thus be extracted directly.

Significance

Significance [sɪɡˈnɪf.ə.kəns]: A measure of certainty of a statistical result that is used in hypothesis testing. In other words, the result is unlikely to have occured if the hypothesis you want to disprove were true. To be statistically significant, the result would have to be less likely than a significance level chosen beforehand. That level might be 5 % or 1 %, for example. If that’s the case, you win: You’d reject the hypothesis you wanted to disprove since it’s so unlikely to be true.

Significance Level

Significance Level [sɪɡˈnɪf.ə.kəns ˈlev.əl]: A threshold for the probability of a wrong decision when testing a hypothesis. Usually called alpha (α). Common choices are α = 0.05 (5 %) or α = 0.01 (1 %). If the calculated p-value is lower than the significance level, the result is called statistically significant.

Back to Dictionary

Simpson’s paradox

Simpson’s paradox [ˈsɪmps(ə)nz ˈpærədɒks]: A phenomenon in statistics by which one can derive opposite conclusions from the same data, depending on if you look at the data as a whole or separate them by a specific factor. A correlation observed in all parts of the data does not have to be a correlation that can be found in the dataset as a whole and vice versa. It was first described in 1951 by Edward H. Simpson.