In part one of this tutorial, you learned about what distance and similarity mean for data and how to measure it. Now, let’s see how we can implement distance measures in R. We’re going to look at the built-in dist() function and visualize similarities with a ggplot2 tile plot, also called a heatmap.
Implementation in R: the dist() function
The simplest way to do distance measures in R is the dist() function. It works with matrices as well as data frames and has options for a lot of the measures we’ve gotten to know in the last part.
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
The crucial argument here is method. It has six options — actually more like four and a half, but you’ll see:
In your work, you might encounter a situation where you want to analyze how similar your data points are to each other. Depending on the structure of your data though, “similar” may mean very different things. For example, if you’re working with records containing real-valued vectors, the notion of similarity has to be different than, say, for character strings or even whole documents. That’s why there’s a small collection of similarity measures to choose from, each tailored to different types of data and different purposes.
A few weeks ago, we discovered it’s possible to export WhatsApp conversation logs as a .txt file. It’s quite an interesting piece of data, so we figured, why not analyze it? So here we go: A code-along R project in two steps.
- Cleaning the data: That’s what this part is for. We’ll get the .txt file ready to be properly evaluated.
- Visualizing the data: That’s what we’ll talk about in part two — creating some interesting visuals for our chat logs.
As you know by now, R is all about functions. In the event that there isn’t one for the exact thing you want to do, you can even write your own! Writing your own functions is a very useful way to automate your work. Once defined, it’s easy to call new functions as often as you need. It’s a good habit to get into when programming with R — and with lots of other languages as well.
Defining a function uses another function simply called function(). Function names follow pretty much the same rules as variable names, so you can call them anything that would also be acceptable as a variable name.
In this crash course section, we’ll talk about importing all sorts of data into R and installing fancy new packages. Also, we’ll learn to know our way around the workspace.
Your workspace in R is like the desk you work at. It’s where all the data, defined variables and other objects you’re currently working with are stored. Like with a desk, you might want to clean it every once in a while and throw out stuff you don’t need any more. There’s a few useful commands to help you do that. Take a look and try them out:
This is exciting! We’re very happy to announce that our website — this website — is finally up and running. Here you will find documentation and tutorials from all our meetings as well as info on our projects and upcoming events.
We are journocode, a group of journalists and computer scientists from Dortmund, Germany. We’ve been meeting since October to code for journalism. We’re especially invested in data driven journalism and want to teach ourselves and everyone who is interested the skills it takes to tell stories with data. This includes, but is not limited to: