Simpson’s paradox

Simpson’s paradox [ˈsɪmps(ə)nz ˈpærədɒks]: A phenomenon in statistics by which one can derive opposite conclusions from the same data, depending on if you look at the data as a whole or separate them by a specific factor. A correlation observed in all parts of the data does not have to be a correlation that can be found in the dataset as a whole and vice versa. It was first described in 1951 by Edward H. Simpson.

If you plot the percentage of AfD (Alternative for Germany, a right-wing populist party) voters against the percentage of migrant population for each city, a clear trend emerged in the 2017’s German election data: Those cities with a higher share of migrants seemed to be less likely to vote for AfD. But there was no causal relationship between the share of migrants and the share of AfD voters: Once you split the data into East and West Germany, the effect disappeared. The two parts of the country are still so profoundly different in their culture and demographics that they alone explain the data: Cities in the East tend to have more AfD voters, but also less migrants. The statistical paradox underlying this strange phenomenon is called Simpson’s paradox.

[button url=”http://journocode.com/data-journalism-dictionary/” new_tab=”” button_style=”btn-info” button_size=”btn-default”] Back to Dictionary[/button]

Related Dictionary Entries