Similarity and distance in data: Part 1

Similarity and distance in data: Part 1

Part 2

In your work, you might encounter a situation where you want to analyze how similar your data points are to each other. Depending on the structure of your data though, “similar” may mean very different things. For example, if you’re working with records containing real-valued vectors, the notion of similarity has to be different than, say, for character strings or even whole documents. That’s why there’s a small collection of similarity measures to choose from, each tailored to different types of data and different purposes.

Before we get to know some of them, though, let’s think about what we’d expect such a measure to do. It’s easily done: If two objects are similar, the measure should be high (maximum for two perfectly similar objects). If they’re dissmilar, the value of the similarity measure should be low, so it should either converge to zero or to a negative number. We can, of course, set other expectations, but this is the bare minimum any measure of similarity should satisfy.

The more distant, the less similar

Because of these properties, similarity measures are often obtained by simply using the inverse of a distance metric. The intuition behind this is that the futher apart two objects are, the more dissimilar they are and the bigger the “distance” between them is. The more similar the objects are, the closer they are and the smaller the distance between them is. This is why in this tutorial, we’ll take a look at different ways to measure the elusive conept of a “distance” between two points of data.

Distance measures should have a few specific properties. They might sound a little math-y, but we’ll concentrate on the relatively straightforward concepts behind them:

d(x,y) \geq 0
The distance of two objects x and y can’t be less than zero.

d(x,y) = 0 \iff x = y
Two perfectly similar objects have distance zero.

d(x,y) = d(y,x)
The distance between x and y is the same as between y and x — it doesn’t matter which way you go.

d(x,z) \leq d(x,y) + d(x,z)
If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. This is called the triangle inequality.

Now that we got that out of the way, let’s look at a few distance measures. Again, if it sounds too mathematical, just take a deep breath and focus on the concepts. Or just skip the math altogether and look at how to implement and visualize distance measures in R, which we’ll focus on in the second part of this tutorial.

Euclidean or Non-Euclidean?

There’s two major classes of distance measures we can distinguish: Euclidean ones and Non-Euclidean ones. You should choose the appropriate one according to wether or not your data can be represented as points in a Euclidean space. A Euclidean space is any space that has some real-valued number of dimensions where points can be located. Your common two-dimensional or three-dimensional coordinate systems are examples for such spaces.

The important thing is that it has to be possible to define an average over the data points for it to be a Euclidean space. So if you’re working with vectors that have real-valued components you can compute an average over, then voilà, you’re working in a Euclidean space.

We’re going to look more closely at a few distance measures, Euclidean ones as well as Non-Euclidean ones:

Euclidean distance

This is pretty much the most common distance measure. It’s so common, in fact, that it’s often called the Euclidean distance, even though there’s many Euclidean distance measures, as we just learned. It’s defined as

\sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2}

This Euclidean distance adds up all the squared distances between corresponding data points and takes the square root of the result. Remember the Pythagorean theorem? If you look closely, the Euclidean distance is just that theorem solved for the hypothenuse — which is, in this case, the distance between x and y. The Euclidean distance is pretty solid: It’s bigger for larger distances, and smaller for closer data points. It can get arbitrarily large and is only zero if the data points are all exactly the same. That’s fine though. If you take a look at the requirements we set for a distance function, that’s exactly what we want.

Manhattan distance

\sum\limits_{i=1}^n |x_i - y_i|

Also known as city block distance, Canberra distance, taxicab metric or snake distance, this is definitely the distance measure with the coolest name(s). Incidentally, they’re also pretty decriptive: The Manhattan distance is the shortest distance a car would have to drive in a city block structure to get from x to y. Since it takes the absolute distances in each dimension before we sum them up, the Manhattan distance will always be bigger or equal to the Euclidean distance, which we can imagine as the linear distance between the two points.

Maximum distance

The maximum distance looks at the distance of two points in each dimension and selects the biggest one. This one is pretty straightforward, but we can express it as a fancy formula anyway:

\max_{i}(|x_i - y_i|)

L-Norm / Minkoswki distance

The L-Norm is the generalized version of the aforementioned distance measures. It is defined as

(\sum\limits_{i=1}^n |x_i - y_i|^p)^{\frac{1}{p}}

 If p is equal to 2, we get the Euclidean distance, which is why it’s also called the L2-Norm. p = 1 returns the Manhattan distance or L1-Norm and p = \infty  equals the maximum.

To sum up Euclidean distance measures, let’s take a look at how they work in a simple two-dimensional space. The maximum distance is equal to the biggest distance in any dimension. In this case, that’s the difference betwenn the x values of points p and q, which is 8. The Manhattan distance sums up the distances in each dimension, so it’s 8 + 3 = 11 in this case.


What would the Euclidean distance, symbolizes by the orange line, be? Visualized like this, it’s pretty obvious how we can use the Pythagorean formula to get the result:

d_E(p,q)^2 = | p_x - q_x | ^2 + | p_y - q_y |^2 = 8^2 + 3^2\iff

 d(p,q) = \sqrt{8^2 + 3^2} = \sqrt{73} \approx 8,5

Amazing what can be done with a little trigonometry, right? Take a deep breath, because there’s more! Let’s look at some Non-Euclidean distance measures to make sure we can satisfy all our similarity measuring needs.

Cosine distance and similarity

The Cosine distance is defined by the angle between two vectors. As we know from basic linear algebra, the dot product of two vectors is defined by

x \cdot y = \|x\| \|y\| \cos{\theta}

where \theta is the angle between the two vectors. the smaller the angle is, the closer to 1 the cosine of the angle is, and the bigger the angle, the closer it is to -1. If you take a look at what we expected from a similarity measure, then the cosine meets our demands rather well. After all, if the angle between two vectors is very small, that means they’re very close together, and therefore more similar. So we’ll just solve the above equation for the cosine and define the cosine similariy to be equal to

\cos{\theta} = \frac{x \cdot y}{\|x\| \|y\|}

If we need to construct a distance measure from here, we can just take the inverse, as we learned before. So the cosine distance is defined as

1 - \cos{\theta}

Since we’re talking about vectors, it might be easy to assume this is also a Euclidean distance measure — and that may be right. If the vectors in question are actual real values, the cosine distance is Euclidean. But if the vectors have to be, say, integer components, we can’t compute an average over the points or we might get a non-integer result. Also, the cosine distance as such doesn’t satisfy the triangle inequality unless we alter it a bit. The cosine similarity, though, is a nice and efficient way to determine similarity in all kinds of multi-dimensional, numeric data.

Jaccard distance and similarity

Like with the cosine distance and similarity, the Jaccard distance is defines by one minus the Jaccard similarity. The Jaccard similarity uses a different approach to similarity than the measures we’ve seen so far. To compare two objects, it looks at the elements they have in common (the intersection) and divides it by the number of elements the two objects have in total (the union). Written out as a formula, that definition looks like this

\frac{X \cap Y}{X \cup Y}

\cap is the mathematical sign for intersection, \cup means union. With this definition, the similarity is only equal to one if all elements are the same and only becomes zero if all elements are different. Perfect for a similarity measure, but the wrong way around for a distance measure. This is easily solved by defining the Jaccard distance to be

1 - \frac{X \cap Y}{X \cup Y}

As an example, let’s compare the two sentences “Yesterday, the warm weather was perfect for my cat” and “My cat liked the warm weather yesterday”. Let’s call them X and Y. We could, of course, have used numbers or a mix of both as well, the Jaccard similarity doesn’t care.

The sentences have 6 words in common and 10 unique words in total. So the Jaccard similarity between them is 6/10 = 0.6 = 60 %. Their Jaccard distance is 1 – 0.6 = 0.4 = 40%.

A nice way to represent objects you want to compute the Jaccard similarity of is in the form of a Boolean matrix, a matrix with only ones and zeroes. The columns of our matrix symbolize the objects we want to find the similarity of and our rows are the unique elements of both objects — in this case, the words. One means the word is present in the object, zero means it isn’t. To compute the Jaccard similarity over the columns, all we have to do is count the rows where both objects have ones (6) and divide it by the total number of rows (10).



We don’t have to stop at single sentences, though. The Jaccard similarity is an efficient way to compute similarity over entire documents — a lot of documents if necessary. Our corresponding Boolean matrix will get very big, of course, but since the formula is relatively simple, it scales rather well to large datasets.

Edit distance

Lastly, let’s think about how to measure the similarity of two character strings. One way to do that is the edit distance. The edit distance is simply the minimum number of inserts and deletes needed to get from one string to the other.

Let’s say we have the words “knock” and “flocks”. To get from one to the other, we have to delete one letter (k) and insert three (f,l,s):

knock → _nock → lock → flock → flocks

So the edit distance betweeen them is four. The edit distance is  a proper distance measure since it satisfies all four requirements we set at the beginning of this lesson.

  • The distance of two objects x and y can’t be less than zero. There’s no way to do a negative number of edits, so that’s true.
  • Two perfectly similar ojects have distance zero. We don’t need any edits to transform a word into itself.
  • The distance between x and y is the same as between y and x. Every insert into one word is equal to a delete from the other, so the paths you take are always inverse and have the same number of steps.
  • If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. Changing from word x to word y before you change to z is one way to go from x to z. The direct way might be shorter, but it can never be longer than the detour.


Congratulations! You made it through all of the math and learned a lot about some ways to measure distance and similarity in your data. In Part two of this lesson, we’re going to leave the theory behind us. We’ll take a look at how to actually compute these distance measures in R and think about how to visualize similarity in data.

Part 2

R: Tidy Data

R: Tidy Data

Unfortunately, data comes in all shapes and sizes. Especially when analyzing data from authorities. You’ll have to be able to deal with pdfs, fused table cells and frequent changes in terms and spelling.

When I analyzed the swiss arms export data as an intern at SRF Data, we had to work with scanned copies of data sheets that weren’t machine-readable, datasets with either french, german or french and german countrynames in the same column as well as fused cells and changing spelling of the categories.

Unsurprisingly, preparing and cleaning messy datasets is often the most time-consuming part of data analysis. Hardley Wickham, creator of  R packages like ggplot and reshape, wrote a very interesting paper about an important part of this data cleaning: the data tidying.
According to him, tidy data has a specific structure:

Each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.

As you may have seen in our post on ggplot2, Wickham calls this tidy format molten data. The idea behind this is to facilitate the analysis procedure by minimizing the effort in preparing the data for the different tools over and over again. His suggestion: Working on tidy, molten data with a set of tidy tools, allowing you to use the saved time to focus on the results.

Bildschirmfoto 2016-02-29 um 15.02.26

Excerpt of Wickhams “Tidy Data”

Practicing data tidying

But how do we tidy messy data? How do we get from raw to molten data and what are tidy tools? Let’s practice this on a messy dataset.

On our GitHub-page, we deposited an Excel file containing some data on marriages in Germany per state and for different years. Download it and open it with Excel to have a first look at it. As you’ll see, it’s a workbook with seven sheets. We have data for 1990, 2003 and for every year from 2010 through 2014. Although this is a quite small dataset which we could tidy manually in Excel, we’ll use this to practice skills that will come in handy when it comes to bigger datasets.

Now check whether this marriage data needs to be tidied:

  • Are there any changing terms?
  • Is the spelling correct?
  • Is every column that contains numbers correctly saved as a numeric column?
  • Are there needless titles, fused cells, empty cells or other annoying noise?

Spoiler alert: The sheets on 2010-2015 are okay, but the first two — not that much. We have different spelling and terms here, as well as three useless columns and one row plus all the numbers saved as text in the first sheet. As said, the mess in this example is limited and we could tidy it up manually with a few clicks. But let’s keep in mind that we’re here to learn how to handle those problems with larger datasets as well.

Within Excel, we will:

  • Delete spare rows and columns (we could do that in R too when it comes to REALLY messy data)
  • Save columns containing numbers as numeric type

Now we’ll change to R.

First of all, we need to install and require all the packages we’re going to use. We’ll do that with an if-statement telling R only to install the package if it hasn’t been installed yet. You could of course do this in the straightforward way without the conditional statement if you remember wether you already installed the package, but this is a quick way to make sure you don’t install something twice needlessly.

To read in the sheets of an Excel workbook, read_excel() from the readxl-package is a useful function. Because we don’t want to load the sheets separately, we’re going to use a loop for this. If you’re interested in learning more about loop functions, stay tuned for our upcomming tutorial on this topic.

messy_data is now a list of seven local data frames with messy_data[[1]] containing the data for 1990, messy_data[[2]] for 2003 and so on. Also, we added a “timestamp” column to each list element which contains the index of the list element.

To save the sheets as list elements is time saving, but we want all the data in one data frame:

If you get an error telling you the frames have different lengths you probably forgot to delete the spare columns in the 1990 sheet. Sometimes there even seems to be something invisble left in empty excel columns. I usually delete three or so of the empty columns and empty rows next to my data to be sure there isn’t something left I can’t see.

Next part: Restructuring the data

With the function gather() of Wickhams tidyr-package, we’ll melt the raw data frame to convert it to a molten format. And let’s change the timestamps created by the read-in loop to the actual year (we could do that with a loop, too, but this is good enough for now).

Oo-De-Lally! This is tidy data in a tidy format! Now we can check if we have to correct the state names (because with bigger datasets, you can’t quickly check and correct spelling and term issues within Excel):

So we got 19 different german Bundesländer. But Google tells us that there are only 16 states in Germany! Let’s have a closer look at the names to check whether we’ll find duplicates:

Yes, there are! For example Baden-Württemberg and BaWü refer to the same state, as well as Hessen, Hesse and Hesssen. You can just manually correct this. For really big datasets, you could also work with regular expressions and string replacement to find the duplicates, but for now, this should be enough:

Now that your data is tidy, the actual analysis can start. A very useful package to prepare molten data is dplyr. Its functions ease filtering the data or grouping it. Not only is this great for taking a closer look at certain subsets of your data, but, because Wickhams graphics package ggplot2 was created to fit the tidy data principle, we can quickly shape the data to be visually analyzed, too.

Here we have some examples for you showing how tidy data and tidy tools can work hand in hand. If you want to learn something about the graphics package ggplot2 first, visit our post for beginners on this!

Visual analysis with ggplot2: this may look complicated at first, but once you have coded the first ggplot you only have to change or/and add a few things to create several more and totally different plots.

Bildschirmfoto 2016-03-05 um 00.15.33

Maybe you’ve got some other questions this data could answer for you? Feel free to continue this analysis or try to tidy your own data set!

If you have any questions, problems or feedback, simply leave a comment, email us or join our slack team to talk to us any time!


{Credits for the awesome featured image go to Phil Ninh}

R: plotting with the ggplot2 package

R: plotting with the ggplot2 package

While crunching numbers, a visual analysis of your data may help you get an overview of your data or compare filtered information at a glance. Aside from the built-in graphics package, R has many additional packages to help you with that.
We want to focus on ggplot2 by Hadley Wickham, which is a very nice and quite popular graphics package.

Ggplot2 is based on a kind of statistical philosophy from a book I really recommend reading. In The Grammar of Graphics, author Leland Wilkinson goes deep into the structure of quantitative plotting. As a product, he establishes a rulebook for building charts the right way. Hadley Wickham built ggplot2 to follow these aesthetics and principles.

Your first interactive choropleth map with R

Your first interactive choropleth map with R

When it comes to data journalism, visualizing your data isn’t what it’s all about. Getting and cleaning your data, analyzing and verifying your findings is way more important.

Still, an interactive eye-catcher holding interesting information will definitely not hurt your data story. Plus, you can use graphics for a visual analysis, too.

Here, we’ll show you how to build a choropleth map, where your data is visualized as colored polygon areas like countries and states.
We will code a multilayer map on Dortmunds students as an example. You’ll be able to switch between layered data from different years. The popups hold additional information on Dortmunds districts.

Now for the data

First of all you need to read a kml-file into R. KML stands for Keyhole Markup Language and as I just learned from the comment section of this tutorial it is a XML data format used to display geospatial information in a web browser. With a bit of googling, you’ll find kml-files holding geospatial informations of your own city, state or country. For this example, we’ll use this data on Dortmunds districts. Right click the link and save the file. Download the kml-file and save it to a new directory named “journocode” (or anything you want, really, but we’ll work with this for now).

Start RStudio. If you haven’t installed it yet, have a look at our first R Tutorial post. After starting RStudio, open a new R script and save it to the right directory. For example, if your “journocode”-directory was placed on your desktop (and your Username was MarieLou), type

Remember to use a normal slash (/) in your file path instead of a backslash. Now, we can read the shape file directly into R. If you don’t use our example data, try open your kml-file with a text editor first to look for the layer name! As you can see on this screenshot, for “Statistische Bezirke.kml” we have a layer named “Statistische_Bezirke”, defined in row four, and utf-8 encoding (see row 1), since we have the german umlauts “ä”, “ö” and “ü” in our file.

Bildschirmfoto 2016-01-22 um 12.31.12

Let’s load the data into R. We’ll do this with a function from the rgdal-package.

If you get an Error that says “Cannot open data source”, chances are there’s something wrong with your file name. Check that your working directory is properly set and that the file name is correct. Some browsers will change the .kml fily type to .txt, or even just add the .txt ending so you get “filename.kml.txt”. You’ll usually find the “layer” argument in your text file, named something like “name” or “id”, as shown above.

Did it work? Try to plot the polygons with the generic plot() function:

You should now see the black outlines of your polygons. Neat, isn’t it?

Next, we’ll need a little data to add to our map. To show you how to build a multilayer map, we will use two different csv files:   student1 & student2

The data contains information on the percentage of 18 to 25 year olds living in Dortmund in 2000 and 2014. Download the files and save them to your journocode directory. Make sure they’re still named student1 and student2.

This can be tricky sometimes: For our data, the encoding is “latin1” and the separation marks are commas. Open the csv files with a text editor to check if your separator is a comma, a semicolon or even a slash.

If everything worked out for you, celebrate a little! You’re a big step closer to your multilayer map!


Now for the interactive part

After looking through your data and analyzing it, you will now have some important information on how many values you have, which are the smallest and the biggest. For our example, we did that for you:

The highest value is 26%, so we can now think of a color scale from 0 to 26 to fill in our map. There are different statistical ways to decide what classes we want to divide our data into. For this mapping exercise, we will simply take eight classes: 0-5, 5-8, 8-10, 10-12, 12-14, 14-18, 18-24 and 24-26.

For every class, we want our map to fill the polygons in a different color. We’ll use a color vector generated with ColorBrewer here. Just copy the colour code you want, put it in a vector and replace it in the code. To paste the colors to the classes, use the function colorBin(). This is were you’ll need the package leaflet, which we will use to build our map. Install it, if you haven’t already.

Next up is the little infowindow we want to pop up when we click on the map. As you can see, I used some html code to specify some parameters for the first popup. For the second popup, I used a simpler way.

paste0() does the same thing as paste() but with no default separator. Check ?paste0 for more info. If something doesn’t work, check the punctuation!


Now for the map

After that, we’ll start right away with puzzling together all the parts we need:

The %>% operator is special to the leaflet package. Similar to the “+” in ggplot, it’s used to link two functions together. So remember: If you have a “%>%” opearator at the end of the line, R will expect more input from you.

The call to the function leaflet() starts the mapping procedd. The Provider Tile is your map base and background. If you don’t want to use the grey Tile in the example, have a look at this page and choose your own. Don’t worry if no map appears yet. With leaflet, you won’t see the actual map right away. First we’ll add the polygon layers and the popups we’ve defined to our map:

In our map, we want to be able to switch layers by clicking on a layer control panel with the group names. We’ll code that now:

Next, we want to add a thin color legend that shows the minimum and maximum value and the palette colors

The big moment: did it work? No mistake with the brackets or the punctuation? You’ll find out by typing:

Congratulations! You made your first multilayer choropleth with R! Now have fun building multilayer maps of your own city/country or even the whole world! If you want to publish your map, make sure you have the “htmlwidgets” package installed and add the following code to your script:

This will create a directory named “mymap_files” and a “mymap.html”-file. Save these two files in the same directory and load that on to your server. Et voilà: Your map is online!

If you publish a map based on our tutorial, feel free to link to our webpage and tell your fellows! We’d be delighted!


{Credits for the awesome featured image go to Phil Ninh}

R crash course: Basic data structures

R crash course: Basic data structures


„To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.“John M. Chambers

Data structures in R are quite different from most programming languages. Understanding them is a necessity, because they define the way you’ll work with your data. Problems in understanding data structures will probably also produce problems in your code.

R crash course: Writing functions

R crash course: Writing functions

As you know by now, R is all about functions. In the event that there isn’t one for the exact thing you want to do, you can even write your own! Writing your own functions is a very useful way to automate your work. Once defined, it’s easy to call new functions as often as you need. It’s a good habit to get into when programming with R — and with lots of other languages as well.

Defining a function uses another function simply called function(). Function names follow pretty much the same rules as variable names, so you can call them anything that would also be acceptable as a variable name.

Let’s try an easy example to see how function definitions work:

A function of questionable usefulness: It essentially does the same thing as print(). It takes an argument called x, and prints whatever you put as x to the console.

Theoretically, you can make your function take as many arguments as you want. Just write them in the parentheses of function(). You can call the arguments however you want, too. Also, your functions will probably often require more than one line. In that case, just put whatever you want your function to do in curly brackets {}. It will look somewhat like this:

Let’s mess with that one a bit! Run the following code line by line and try to guess what went wrong.

Possible errors while writing functions

Errors aren’t just a necessary evil in coding. By making mistakes, you get to know your programming language better and find out what works — and, of course, what doesn’t work. Let’s go through the errors one by one:

  • squareadd(3): You passed the function only one argument (3, which was attributed to the “x” argument) to work with when it expected two values, one for x and one for y.
  • squareadd(3,”two”): Now you passed the function two arguments, but one’s not a number. It’s a character, since it has quotes around it. But R can’t execute the function with a character. After all, what is 3^2 + “two” supposed to mean?
  • squareadd(3,two): No quotes this time in the second argument. Because the “y” argument is not in quotes and not a number, either, R assumes it’s a variable or some other object. Problem is: R can’t find the object called two anywhere
  • After you define the object two to be equal to 2, though, R does find a matching object to put as an argument. So this time around, squareadd(3,two) should return the number 11

After we change the function definition to include only the “x” argument, the errors we get change a little. Note that we there’s still a “y” in the function body.

  • squareadd2(3,2): Other way around this time. Your function expected only one argument, but got two.
  • squareadd2(3): You passed the correct number of arguments, but R can’t find anything to use for the y in the function body, neither inside the function nor in the global environment.
  • This is why, after you defined y to be equal to four in the global environment, squareadd2(3) works fine and will return 13 (since 3^2 + 4 = 13).

Scoping Rules in R

Some of the errors you’ll get, such as those in the last two lines, are due to something called the scoping rules of R. These rules define how R looks for the variables it needs to execute a function. It does that by looking through different environments — sub-spaces of your working environment that have their own variables and object definitions — in a certain order. There’s two basic types of scoping:

  • Lexical scoping: Looking for missing objects in the environment where the function was defined.
  • Dynamic scoping: Looking for missing objects in the environment where the function was called.

R uses lexical scoping. So if it doesn’t find the stuff it needs within the function (which, incidentally, has its own little environment), it goes on to look in the environment where the function was defined. In many cases, this will be the global environment, which is what you’re coding in if you’re not inside a specific function. If it doesn’t find what it needs there either, it will continue down the search list of environments. You can take a look at the list by typing search() into your console.

Let’s take a quick look at the difference between dynamic and lexical scoping. Look at the following code and try to guess its output. Execute it in RStudio and see if you’re right.

The output depends on the scoping rules your programming languages uses. As you just learned, R uses lexical scoping. So if you call check(), a is set to FALSE only on the function environment of check(). But since istrue() was defined in the global environment, where a is still equal to TRUE, it will print “that’s right!” to your console. If R used dynamic scoping, it would go with a <- FALSE, since that is accurate for the environment where istrue() was called.

You don’t have to worry too much about the specifics of scoping rules and environments when starting to code, but it’s a useful thing to keep in mind. There’s lots of good info on scoping, searching and environments in R on the web, as well as more tutorials on writing your own functions. We’ll be putting together some resources on our website soon, so stay tuned for that.

But for now — well done! That was a lot of new info to process. print() yourself a “Good job!” to the console before you go on and practice writing some more functions. We’re looking forward to your coding experiences!

Bonus round: Can you count how often the word “function” appears in this text? Guess right and win a complimentary function congratulating you on your newly acquired coding skills.


{Credits for the awesome featured image go to Phil Ninh}

R exercise: Analysing data

R exercise: Analysing data

While using R for your everyday calculations is so much more fun than using your smartphone, that’s not the (only) reason we’re here. So let’s move on to the real thing: How to make data tell us a story.

First you’ll need some data. You haven’t learned how to get and clean data, yet. We’ll get to that later. For now you can practice on this data set. The data journalists at Berliner Morgenpost used it to take a closer look at refugees in Germany and kindly put the clean data set online. You can also play around with your own set of data. Feel free to look for something entertaining on the internet – or in hidden corners of your hard drive. Remember to save your data in your working directory to save yourself some unneccessary typing.

Read your data set into R with read.csv(). For this you need a .csv file. Excel sheets can easily be saved as such.

Now you have a data frame. Name it anything you want. We’ll go with data. Check out class(data). It tells you what kind of object you have before you. In this case, it should return data frame.

Time to play!

Remember, if you just type data and run that command, it will print the whole table to the console. That might be not exactly what you want if your dataset is very big. Instead, you can use the handy functions below to get an overview of your data.

Try them and play around a little bit. Found anything interesting yet? Anything odd? In the data set we suggested, you’ll notice that the mean and the median are very different in the column “Asylantraege” (applications for asylum). What does that tell you?

Row and column indices

This is how you can take a closer look at a part of the whole set using indices. Indices are the numbers or names by which R identifies the rows or columns of data.

The last two alternatives only work if your columns have names. Use the function names() to look them up or change them.

Here are some more useful functions that will give you more information about the columns you’re interested in. Try them!

Subsets and Logic

Now you can take and even more detailed look by forming subsets, parts of your data that meet certain criteria. You’ll need the following logical operators.

Try to form different subsets of your data to find out interesting stuff. Check if it worked with View()head()tail(), etc.

Try to kick out all the rows that have “0” in the column “Asylantraege” (applications for asylum). Look at it again. What happened to mean and median?

Get the answers you want

With everything you learned so far, you can start to get answers. See what questions about your data can be answered by forming data subsets. For example, if you used the data set we suggested: Where do most people seeking refuge in Germany come from?

We made a list of the ten most common countries of origin.


Ask your own questions. What do you want your data to tell you?


{Credits for the awesome featured image go to Phil Ninh}

R crash course: Workspace, packages and data import

R crash course: Workspace, packages and data import

In this crash course section, we’ll talk about importing all sorts of data into R and installing fancy new packages. Also, we’ll learn to know our way around the workspace.

Your workspace in R is like the desk you work at. It’s where all the data, defined variables and other objects you’re currently working with are stored. Like with a desk, you might want to clean it every once in a while and throw out stuff you don’t need any more. There’s a few useful commands to help you do that. Take a look and try them out:

R crash course: Vectors

R crash course: Vectors

Now that you installed RStudio, learned about assignments and wrote some basic code, there’s nothing stopping you from becoming a journocoder!

To get a deeper understanding of how R stores your data, we’re now going to take a closer look at data structures in R, starting with a central concept: Vectors.

Working with vectors

You will work with vectors a lot in R — and I mean a lot. R loves vectors. It treats a scalar — a single value — as nothing but a vector with only one value. There’s all kinds of data structures in R, but most of them are basically just different compositions of vectors. We will get to know them better as we go along. For example, a matrix consists of a vector cut into multiple pieces of the same length. A list is a combination of vectors with different lengths and R even manages to see data frames as something made of vectors. So if you know how to handle vectors in R, that’s a good step towards coding proficiency.

Vectors are created with the c()-function. Like single values, you can name your vectors however you want and perform all kinds of calculations on them.

Elements of a vector are seperated by a comma in the c() function, but you can generate sequences of numbers in different ways. For example, if you write “1:10” instead of a value, R will add the numbers 1 through 10 to your vector. Also, instead of writing “c(3,3,2,2)”, you can tell R to repeat the numbers 3 and 2 two times each with the rep() function — like I did below with the variable h2. You can also tell R to repeat a whole sequence like with p2. Run the code below and have a closer look at the variables and the output R returns.

Try to create some vectors in different ways by yourself!

Now, define two vectors of the same length (with the same number of elements) and try to do some basic math you’ve learned in the chapter before. For example, try:

Try some more things if you want. Now go for the basic math functions:

In the last chapter I said sum(5, 4) does the same as 5+4. Is this still true when it comes to vectors? Compare the results!

Operations like sqrt() and log() can only be applied to positive values. They will work for every positive value of your vector but will give you an error message and return NaN instead of a result for the negative elements. NaN stands for “not a number”. It is possible to work with a vector containing NaNs, but you should double check if you actually want them in there.


Watch out!

So far for vectors of the same length. What about vectors that have a different number of elements? Try this:

Works well, hm? But why? The answer is something you should keep in mind: If (for an operation where the vectors have to be the same length) one vector is shorter than the other, R repeats the elements of the shorter vector until the two are the same length! So for “n+m”, R doesn’t calculate “(1, 2)+(4, 5, 6, 7)” but “(1, 2, 1, 2)+(4, 5, 6,7)”.


Interesting functions for your first data analysis

Let’s look at a few useful functions that can help you analyze vectors. Remember to use the help functions or the internet if you don’t understand a function.

But wait, there’s more: You can round vector elements or turn a vector to a matrix. Look closely at the output of this piece of code: What is the difference between C and C2? What is the difference between C2 and C3?

Functions, as you may have already noticed, can work with different parameters that determine their output. These are called arguments. They can be specified in parentheses after the function name. Here, the second argument of the matrix function tells R how many rows the matrix will have. The logical argument byrow controls in what way my matrix will be filled with the vectors elements. Because this is a crash course, we won’t go much further into vectors and matrices. But if you want to learn more about them, go for it!



At this point you know enough about programming in R to have a closer look at what’s useful for journocoding! In the meantime, it’s always a good idea to play around with what you’ve already learned!

In the next chapters, we will get to know other data structures, like lists or data frames. We will learn how to load data into the workspace, like excel sheets or csv files.  And we will have a look at the most important statistical values that are interesting for journocoders like you and how R can help you analyze and visualize your data. You will learn how to use and write functions and how to use packages in R. Sounds awesome, right? Let’s do it!


{Credits for the awesome featured image go to Phil Ninh}