# Similarity and distance in data: Part 1

1 Comment Part 2

In your work, you might encounter a situation where you want to analyze how similar your data points are to each other. Depending on the structure of your data though, “similar” may mean very different things. For example, if you’re working with records containing real-valued vectors, the notion of similarity has to be different than, say, for character strings or even whole documents. That’s why there’s a small collection of similarity measures to choose from, each tailored to different types of data and different purposes.

Before we get to know some of them, though, let’s think about what we’d expect such a measure to do. It’s easily done: If two objects are similar, the measure should be high (maximum for two perfectly similar objects). If they’re dissmilar, the value of the similarity measure should be low, so it should either converge to zero or to a negative number. We can, of course, set other expectations, but this is the bare minimum any measure of similarity should satisfy.

The more distant, the less similar

Because of these properties, similarity measures are often obtained by simply using the inverse of a distance metric. The intuition behind this is that the futher apart two objects are, the more dissimilar they are and the bigger the “distance” between them is. The more similar the objects are, the closer they are and the smaller the distance between them is. This is why in this tutorial, we’ll take a look at different ways to measure the elusive conept of a “distance” between two points of data.

Distance measures should have a few specific properties. They might sound a little math-y, but we’ll concentrate on the relatively straightforward concepts behind them: $d(x,y) \geq 0$
The distance of two objects x and y can’t be less than zero. $d(x,y) = 0 \iff x = y$
Two perfectly similar objects have distance zero. $d(x,y) = d(y,x)$
The distance between x and y is the same as between y and x — it doesn’t matter which way you go. $d(x,z) \leq d(x,y) + d(x,z)$
If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. This is called the triangle inequality.

Now that we got that out of the way, let’s look at a few distance measures. Again, if it sounds too mathematical, just take a deep breath and focus on the concepts. Or just skip the math altogether and look at how to implement and visualize distance measures in R, which we’ll focus on in the second part of this tutorial.

Euclidean or Non-Euclidean?

There’s two major classes of distance measures we can distinguish: Euclidean ones and Non-Euclidean ones. You should choose the appropriate one according to wether or not your data can be represented as points in a Euclidean space. A Euclidean space is any space that has some real-valued number of dimensions where points can be located. Your common two-dimensional or three-dimensional coordinate systems are examples for such spaces.

The important thing is that it has to be possible to define an average over the data points for it to be a Euclidean space. So if you’re working with vectors that have real-valued components you can compute an average over, then voilà, you’re working in a Euclidean space.

We’re going to look more closely at a few distance measures, Euclidean ones as well as Non-Euclidean ones:

Euclidean distance

This is pretty much the most common distance measure. It’s so common, in fact, that it’s often called the Euclidean distance, even though there’s many Euclidean distance measures, as we just learned. It’s defined as $\sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2}$

This Euclidean distance adds up all the squared distances between corresponding data points and takes the square root of the result. Remember the Pythagorean theorem? If you look closely, the Euclidean distance is just that theorem solved for the hypothenuse — which is, in this case, the distance between x and y. The Euclidean distance is pretty solid: It’s bigger for larger distances, and smaller for closer data points. It can get arbitrarily large and is only zero if the data points are all exactly the same. That’s fine though. If you take a look at the requirements we set for a distance function, that’s exactly what we want.

Manhattan distance $\sum\limits_{i=1}^n |x_i - y_i|$

Also known as city block distance, Canberra distance, taxicab metric or snake distance, this is definitely the distance measure with the coolest name(s). Incidentally, they’re also pretty decriptive: The Manhattan distance is the shortest distance a car would have to drive in a city block structure to get from x to y. Since it takes the absolute distances in each dimension before we sum them up, the Manhattan distance will always be bigger or equal to the Euclidean distance, which we can imagine as the linear distance between the two points.

Maximum distance

The maximum distance looks at the distance of two points in each dimension and selects the biggest one. This one is pretty straightforward, but we can express it as a fancy formula anyway: $\max_{i}(|x_i - y_i|)$

L-Norm / Minkoswki distance

The L-Norm is the generalized version of the aforementioned distance measures. It is defined as $(\sum\limits_{i=1}^n |x_i - y_i|^p)^{\frac{1}{p}}$

If $p$ is equal to 2, we get the Euclidean distance, which is why it’s also called the L2-Norm. $p = 1$ returns the Manhattan distance or L1-Norm and $p = \infty$ equals the maximum.

To sum up Euclidean distance measures, let’s take a look at how they work in a simple two-dimensional space. The maximum distance is equal to the biggest distance in any dimension. In this case, that’s the difference betwenn the x values of points p and q, which is 8. The Manhattan distance sums up the distances in each dimension, so it’s 8 + 3 = 11 in this case. What would the Euclidean distance, symbolizes by the orange line, be? Visualized like this, it’s pretty obvious how we can use the Pythagorean formula to get the result: $d_E(p,q)^2 = | p_x - q_x | ^2 + | p_y - q_y |^2 = 8^2 + 3^2\iff$ $d(p,q) = \sqrt{8^2 + 3^2} = \sqrt{73} \approx 8,5$

Amazing what can be done with a little trigonometry, right? Take a deep breath, because there’s more! Let’s look at some Non-Euclidean distance measures to make sure we can satisfy all our similarity measuring needs.

Cosine distance and similarity

The Cosine distance is defined by the angle between two vectors. As we know from basic linear algebra, the dot product of two vectors is defined by $x \cdot y = \|x\| \|y\| \cos{\theta}$

where $\theta$ is the angle between the two vectors. the smaller the angle is, the closer to 1 the cosine of the angle is, and the bigger the angle, the closer it is to -1. If you take a look at what we expected from a similarity measure, then the cosine meets our demands rather well. After all, if the angle between two vectors is very small, that means they’re very close together, and therefore more similar. So we’ll just solve the above equation for the cosine and define the cosine similariy to be equal to $\cos{\theta} = \frac{x \cdot y}{\|x\| \|y\|}$

If we need to construct a distance measure from here, we can just take the inverse, as we learned before. So the cosine distance is defined as $1 - \cos{\theta}$

Since we’re talking about vectors, it might be easy to assume this is also a Euclidean distance measure — and that may be right. If the vectors in question are actual real values, the cosine distance is Euclidean. But if the vectors have to be, say, integer components, we can’t compute an average over the points or we might get a non-integer result. Also, the cosine distance as such doesn’t satisfy the triangle inequality unless we alter it a bit. The cosine similarity, though, is a nice and efficient way to determine similarity in all kinds of multi-dimensional, numeric data.

Jaccard distance and similarity

Like with the cosine distance and similarity, the Jaccard distance is defines by one minus the Jaccard similarity. The Jaccard similarity uses a different approach to similarity than the measures we’ve seen so far. To compare two objects, it looks at the elements they have in common (the intersection) and divides it by the number of elements the two objects have in total (the union). Written out as a formula, that definition looks like this $\frac{X \cap Y}{X \cup Y}$ $\cap$ is the mathematical sign for intersection, $\cup$ means union. With this definition, the similarity is only equal to one if all elements are the same and only becomes zero if all elements are different. Perfect for a similarity measure, but the wrong way around for a distance measure. This is easily solved by defining the Jaccard distance to be $1 - \frac{X \cap Y}{X \cup Y}$

As an example, let’s compare the two sentences “Yesterday, the warm weather was perfect for my cat” and “My cat liked the warm weather yesterday”. Let’s call them X and Y. We could, of course, have used numbers or a mix of both as well, the Jaccard similarity doesn’t care.

The sentences have 6 words in common and 10 unique words in total. So the Jaccard similarity between them is 6/10 = 0.6 = 60 %. Their Jaccard distance is 1 – 0.6 = 0.4 = 40%.

A nice way to represent objects you want to compute the Jaccard similarity of is in the form of a Boolean matrix, a matrix with only ones and zeroes. The columns of our matrix symbolize the objects we want to find the similarity of and our rows are the unique elements of both objects — in this case, the words. One means the word is present in the object, zero means it isn’t. To compute the Jaccard similarity over the columns, all we have to do is count the rows where both objects have ones (6) and divide it by the total number of rows (10).

 words X Y yesterday 1 1 the 1 1 warm 1 1 weather 1 1 was 1 0 perfect 1 0 for 1 0 my 1 1 cat 1 1 liked 0 1

We don’t have to stop at single sentences, though. The Jaccard similarity is an efficient way to compute similarity over entire documents — a lot of documents if necessary. Our corresponding Boolean matrix will get very big, of course, but since the formula is relatively simple, it scales rather well to large datasets.

Edit distance

Lastly, let’s think about how to measure the similarity of two character strings. One way to do that is the edit distance. The edit distance is simply the minimum number of inserts and deletes needed to get from one string to the other.

Let’s say we have the words “knock” and “flocks”. To get from one to the other, we have to delete one letter (k) and insert three (f,l,s):

knock → _nock → lock → flock → flocks

So the edit distance betweeen them is four. The edit distance is  a proper distance measure since it satisfies all four requirements we set at the beginning of this lesson.

• The distance of two objects x and y can’t be less than zero. There’s no way to do a negative number of edits, so that’s true.
• Two perfectly similar ojects have distance zero. We don’t need any edits to transform a word into itself.
• The distance between x and y is the same as between y and x. Every insert into one word is equal to a delete from the other, so the paths you take are always inverse and have the same number of steps.
• If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. Changing from word x to word y before you change to z is one way to go from x to z. The direct way might be shorter, but it can never be longer than the detour.

Congratulations! You made it through all of the math and learned a lot about some ways to measure distance and similarity in your data. In Part two of this lesson, we’re going to leave the theory behind us. We’ll take a look at how to actually compute these distance measures in R and think about how to visualize similarity in data.

Part 2

### Comment ( 1 )

1. Similarity and distance in data: Part 2 | Journocode
[…] Part 1 | Code […]

This site uses Akismet to reduce spam. Learn how your comment data is processed.