Similarity and distance in data: Part 2
In part one of this tutorial, you learned about what distance and similarity mean for data and how to measure it. Now, let’s see how we can implement distance measures in R. We’re going to look at the built-in dist() function and visualize similarities with a ggplot2 tile plot, also called a heatmap.
Implementation in R: the dist() function
The simplest way to do distance measures in R is the dist() function. It works with matrices as well as data frames and has options for a lot of the measures we’ve gotten to know in the last part.
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
The crucial argument here is method. It has six options — actually more like four and a half, but you’ll see:
- “euclidean” Is the Euclidean distance.
- “maximum” The maximum distance.
- “manhattan” The Manhattan or city block distance.
- “canberra” Another name for the Manhattan distance.
- “binary” The Jaccard distance.
- “minkowski” Also called L-norm. The generalized version of Euclidean and Manhattan distance. Returns the Manhattan/Canberra distance if p = 1 and the Euclidean distance for p = 2.
We’re going to be working with the Jaccard distance in this lecture, but it works just as well for the other distance measures.
Download today’s dataset on similarities between right wing parties in Europe. It’s in the .Rdata file format, so you can load it into R with the load() function.
It contains the data frame values, which contains data on which european right wing parties agree with which right wing policies. The columns represent parties, while the rows represent political views. The entries are ones and zeros — one if the party agrees with the idea, zero if it doesn’t. This means we’re working with a binary or Boolean matrix (data frame, to be exact, but you get the idea). If you remember what we talked about in part one of this tutorial, you’ll realize this is a perfect situation for the Jaccard distance measure.
Since we want to visualize the similarities between the different parties, we want to calculate the distances over the columns of our dataset. This is a very important distinction, since some distance functions will calculate over rows per default, some over columns.
The dist() function works on rows. Since there’s no argument to switch to columns, we’ll have to transpose our dataset. Thankfully, this is pretty easy for data frames in R. We’ll just use t():
#dist() calculates over rows, so we'll use t(values)
jacc <- dist(t(values), method = "binary")
Note that with the default settings for diag and upper, the resulting “dist” object will have a triangle shape. That’s because we’re calculating the distance of every party to every other party, so the resulting matrix would be symmetric. Since we want to visualize our results, though, that’s what we want. So to prepare for visualization, we’ll have to do two things:
- Add the diagonal and the upper triangle to make a complete rectangle shape.
- Convert back from a dist object to a data frame so ggplot can work with the data.
Also, remember how we wanted to visualize the similarity between the parties, not their distance? And remember how distance and similarity metrics are inverse to each other? Once we’ve converted back to a data frame, we can simply use
1 - jacc to get the Jaccard similarities from the Jaccard distances the dist() function returns.
#we want values for diagonal and upper as well, so:
jacc <- dist(t(values), method = "binary",
diag = TRUE, upper = TRUE)
#ggplot needs a dataframe:
jacc <- as.data.frame(as.matrix(jacc))
#we want the jaccard similarity, not the distance:
jaccsim <- 1 - jacc
If everything went according to plan, View(jaccsim) should show a symmetric data frame with values between zero and one, the diagonal consisting of all ones.
From here, let’s start preparing the dataset for ggplot visualization. For more info on how to work with ggplot, check out our tutorial, if you haven’t already.
Melting the data
If you’ve followed our tutorial on the tidy data principles ggplot is built on, you’ll remember how we need to convert our data to the specific tidy format ggplot works with. To melt our dataframe into a thin, long one instead of the rectangle shape it has right now, we’ll need to add a row containing the party names currently stored as row names, so the melting function will know what to melt on. Once we’ve done that, we can use melt() from the package reshape2 to convert our data.
#add a row with names to melt on
jaccsim$names <- rownames(jaccsim)
#melt the data frame to make it tidy
jacc.m <- reshape2::melt(jaccsim, id.vars = "names")
Notice how we used the double colon “::” to specify to which package the function melt() belongs? This is a convenient alternative to loading an entire package if you only want to use one or two functions from it. As long as the package is installed, it will work just as well as library(reshape2).
The only thing left to do before we can start plotting is to make sure the parties are going to be in the right order when we plot them. When working with qualitative data, ggplot works with factors and plots the elements on the axes in the order of their factor levels. So we’ll make sure the levels are in the right order by specifying them explicitly:
#make sure the parties are in correct order in the plot
#convert to factor
jacc.m$names <- factor(jacc.m$names, rownames(jaccsim))
jacc.m$variable <- factor(jacc.m$variable, rev(rownames(jaccsim)))
#sort the data frame
jacc.m <- plyr::arrange(jacc.m, variable, plyr::desc(names))
The second argument to factor() specifies the levels and their order. Since we want to plot the similarities of each party with every other, we’re going to have party names on both x and y axes. By specifying one of the axes to be in reverse order with the rev() function, we make sure our plot looks nice and clean, as we’ll see in the next step: The actual visualization.
Visualization: Tile plot
There’s lots of ways to visualize similarity in data. When dealing with very small datasets like this one, one way to do it is using a heat map in tile format. At least that’s what I did, so that’s what you’re learning today. For each combination of two parties, there’s going to be a tile. The color of the tile shows the level of similarity between them: The more intense the color, the higher the similarity. The code we’re going to use is adapted from this blog post, which is really worth checking out.
First, we’re going zo build the basic plot structure:
sim <- ggplot(jacc.m, aes(names, variable)) +
geom_tile(aes(fill=value), colour = "white") +
scale_fill_gradient(low = "#b7f7ff", high = "#0092a3")
Remember to load the ggplot2 package before you start plotting. We’re going to specify our x and y axes to be the two factors containing the party names with aes(names, variable). With geom_tile(), we define the basic structure of our plot: A set of tiles. They’re going to be filled according to the Jaccard similarities stored in the column value (aes(fill=value)). Their basic color is defined to be white, but we’ll create a gradient of blues with scale_file_gradient(). Try different color schemes if you like.
With these three basic set-up functions, you’re going to end up with something like this if you take a look at sim:
Not too bad, right? Notice how the diagonal of the tile matrix has the darkest possible blue. This makes sense, since those are the tile comparing one party to itself. The lighter the color, the lower the similarity between the parties.
But this plot doesn’t look as pretty as we’d like it to yet. The labels are to small, the axis labels aren’t necessary, the signature grey ggplot background isn’t visually appealing in this case and the legend doesn’t look as nice as it could.
Thankfully, ggplot2 let’s us edit all of that. Add these settings to your plot with the + operator and see what they do:
base_size <- 20
sim + theme_light(base_size = base_size) +
labs(x = "", y = "") +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
theme(axis.ticks = element_blank(),
axis.text.x = element_text(size = base_size * 0.8,
angle = 330, hjust = 0),
axis.text.y = element_text(size = base_size * 0.8)
theme_light() is a standard theme with a clean look to it that fits our needs for this plot. The base_size argument lets us modify the text size of every text element in our plot. The default is 12px, but we want something a bit bigger for our plot. We don’t need any axis labels, so we’ll just pass the labs() function two empty strings.
The expand argument in the next two functions adds some space between the axes and our tiles, which we don’t want in this case. We’re going to set the argument to zero to make our plot look even cleaner. Also, we’re going to delete the legend title in the guides() function and remove the axis ticks with theme(). The text on the x axis looks a bit packed right now, so we’re going to rotate it a bit to give it more space. If everything worked out, your finished plot should look like this:
That’s better, isn’t it? Play around with the settings a little if you like. Maybe change the text size, the legend title of the rotation angle of the x axis text.
Anyway, though: You did it! Yay! This is, of course, only one way to visualize similarities. I’m sure there’s lots of other cool alternatives. If you find your own, leave a link in the comments, we’d love to hear about it. Until then: Experiment a little with similarity measures and ggplot options. See you in our next tutorial, our next meeting or on slack if you want to keep up with all of the hot Journocode gossip. Have fun!