## Extracting geodata from OpenStreetMap with Osmfilter

A guest post by Hans Hack

When working on map related projects, I often need specific geographical data from OpenStreetMap (OSM) from a certain area. For a recent project of mine, I needed all the roads in Germany in a useful format so I can work with them in a GIS program. So how do I do I get the data to work with? With a useful little program called Osmfilter.

There are various sites that provide OSM datasets for certain areas. However, these datasets include ALL the geodata OSM provides. That means houses, streets, rivers etc – basically everything you see on a normal map. In addition, the file is in a rather inconvenient format for direct usage. A quite complicated way to proceed with the dataset would be to load it into a database and query what you need. This can be time consuming and – depending on the power of your PC – impossible.

If you only need to get info about small areas, I recommend using the site overpass turbo. For larger datasets, there is Osmfilter, which easily lets you filter OSM geodata. With a little help from two other free tools, you will have a dataset you can work with in no time.

# Get an OSM Dataset

To start out, we need to download an OSM dataset, which is saved in a format called pbf (a format to compress large data sets). For this tutorial, I will use a dataset provided by geofabrik, but there are other sources, too. Lets download the pbf file for Liechtenstein and save it in a folder of your choice. Once you have downloaded the data, open the shell and go to the folder where your new dataset is stored with the command

Osmfilter only supports the file formats osm and o5m. For fast data processing, using o5m is recommended. You can convert your pbf file to o5m with osmconvert in your shell like this:

This translates to: Use the program osmconvert and convert the file called liechtenstein-latest.osm.pbf to a o5m file called liechtenstein. The -o stands for output.

You will now have the same dataset in the o5m format in the same folder.

Now, you can filter your geodata using the shell that should still be open. The osmfilter command logic is built up as follows:

Let’s look at the part about the filter commands. Here is where you can tell the program which parts of the dataset you need by writing --keep=DATA_I_WANT . Here is an example which creates a file for you called buildings.osm that contains all the buildings (and only the buildings) from the Liechtenstein dataset:

To find out how features are stored and classified in OSM, you can go to this site and look up the feature you want. Tip: You can head over to overpass turbo to test your query on a small area of your choice by using the ‘wizard’.

Of course you can do much more with Osmfilter. You can specify which building type you want. For example, you might only want to look at schools:

You can query multiple features by chaining them. For example if you want all the schools and universities:

You can exclude things by adding the flag –drop. For example, if you don’t want to have buildings that are warehouses but keep everything else:

You can reduce the final file size by dropping extra data on the authors and the version by adding:

You can, of course, combine these flags. Here is a query that gives you all the highway types that cars can use in Lichtenstein without the ones where motor vehicles are not allowed:

You find more on the filtering options on the Osmfilter site with some examples too.

## Convert to a useful format

As a final step, you can convert your osm file to the most widely supported geodata format called a Shapefile (the GIS program QGIS can handle osm files, but it sometimes doesn’t work well with large datasets). You can convert your osm file to a Shapefile with the program ogr2ogr like this:

The above command converts the file streets_liechtenstein.osm to the shapefile format and tells it to store it in a folder called streets_shapefiles. In the newly created folder you will find 4 different shapefiles (one for every geometry type). In case of the streets, we are only interested in the file lines.shp. You can open this file in a GIS program of your choice.

Ogr2org also allows you to convert your newly created Shapefiles to other geodata formats that you may need, such as GeoJSON, CSV and many more. Have a look at the ogr2ogr website for more info. If you’re tired of using the shell, try the online tool Mapshaper which allows you to convert your Shapefile file to formats such as GeoJSON, SVG and CSV. The file size for Mapshaper is limited but I have tried it with files bigger than 1 GB.

Have fun filtering OSM and happy mapping!

## Similarity and distance in data: Part 2

1 Comment

In part one of this tutorial, you learned about what distance and similarity mean for data and how to measure it. Now, let’s see how we can implement distance measures in R. We’re going to look at the built-in dist() function and visualize similarities with a ggplot2 tile plot, also called a heatmap.

Implementation in R: the dist() function

The simplest way to do distance measures in R is the dist() function. It works with matrices as well as data frames and has options for a lot of the measures we’ve gotten to know in the last part.

The crucial argument here is method. It has six options — actually more like four and a half, but you’ll see:

• euclidean” Is the Euclidean distance.
• maximum” The maximum distance.
• manhattan” The Manhattan or city block distance.
• canberra” Another name for the Manhattan distance.
• binary” The Jaccard distance.
• minkowski” Also called L-norm. The generalized version of Euclidean and Manhattan distance. Returns the Manhattan/Canberra distance if p = 1 and the Euclidean distance for p = 2.

We’re going to be working with the Jaccard distance in this lecture, but it works just as well for the other distance measures.

Download today’s dataset on similarities between right wing parties in Europe. It’s in the .Rdata file format, so you can load it into R with the load() function.

It contains the data frame values, which contains data on which european right wing parties agree with which right wing policies. The columns represent parties, while the rows represent political views. The entries are ones and zeros — one if the party agrees with the idea, zero if it doesn’t. This means we’re working with a binary or Boolean matrix (data frame, to be exact, but you get the idea). If you remember what we talked about in part one of this tutorial, you’ll realize this is a perfect situation for the Jaccard distance measure.

Since we want to visualize the similarities between the different parties, we want to calculate the distances over the columns of our dataset. This is a very important distinction, since some distance functions will calculate over rows per default, some over columns.

The dist() function works on rows. Since there’s no argument to switch to columns, we’ll have to transpose our dataset. Thankfully, this is pretty easy for data frames in R. We’ll just use t():

Note that with the default settings for diag and upper, the resulting “dist” object will have a triangle shape. That’s because we’re calculating the distance of every party to every other party, so the resulting matrix would be symmetric. Since we want to visualize our results, though, that’s what we want. So to prepare for visualization, we’ll have to do two things:

• Add the diagonal and the upper triangle to make a complete rectangle shape.
• Convert back from a dist object to a data frame so ggplot can work with the data.

Also, remember how we wanted to visualize the similarity between the parties, not their distance? And remember how distance and similarity metrics are inverse to each other? Once we’ve converted back to a data frame, we can simply use 1 - jacc  to get the Jaccard similarities from the Jaccard distances the dist() function returns.

If everything went according to plan, View(jaccsim) should show a symmetric data frame with values between zero and one, the diagonal consisting of all ones.

From here, let’s start preparing the dataset for ggplot visualization. For more info on how to work with ggplot, check out our tutorial, if you haven’t already.

Melting the data

If you’ve followed our tutorial on the tidy data principles ggplot is built on, you’ll remember how we need to convert our data to the specific tidy format ggplot works with. To melt our dataframe into a thin, long one instead of the rectangle shape it has right now, we’ll need to add a row containing the party names currently stored as row names, so the melting function will know what to melt on. Once we’ve done that, we can use melt() from the package reshape2 to convert our data.

Notice how we used the double colon “::” to specify to which package the function melt() belongs? This is a convenient alternative to loading an entire package if you only want to use one or two functions from it. As long as the package is installed, it will work just as well as library(reshape2).

The only thing left to do before we can start plotting is to make sure the parties are going to be in the right order when we plot them. When working with qualitative data, ggplot works with factors and plots the elements on the axes in the order of their factor levels. So we’ll make sure the levels are in the right order by specifying them explicitly:

The second argument to factor() specifies the levels and their order. Since we want to plot the similarities of each party with every other, we’re going to have party names on both x and y axes. By specifying one of the axes to be in reverse order with the rev() function, we make sure our plot looks nice and clean, as we’ll see in the next step: The actual visualization.

Visualization: Tile plot

There’s lots of ways to visualize similarity in data. When dealing with very small datasets like this one, one way to do it is using a heat map in tile format. At least that’s what I did, so that’s what you’re learning today. For each combination of two parties, there’s going to be a tile. The color of the tile shows the level of similarity between them: The more intense the color, the higher the similarity. The code we’re going to use is adapted from this blog post, which is really worth checking out.

First, we’re going zo build the basic plot structure:

Remember to load the ggplot2 package before you start plotting. We’re going to specify our x and y axes to be the two factors containing the party names with aes(names, variable). With geom_tile(), we define the basic structure of our plot: A set of tiles. They’re going to be filled according to the Jaccard similarities stored in the column value (aes(fill=value)). Their basic color is defined to be white, but we’ll create a gradient of blues with scale_file_gradient(). Try different color schemes if you like.

With these three basic set-up functions, you’re going to end up with something like this if you take a look at sim:

Not too bad, right? Notice how the diagonal of the tile matrix has the darkest possible blue. This makes sense, since those are the tile comparing one party to itself. The lighter the color, the lower the similarity between the parties.

But this plot doesn’t look as pretty as we’d like it to yet. The labels are to small, the axis labels aren’t necessary, the signature grey ggplot background isn’t visually appealing in this case and the legend doesn’t look as nice as it could.

Thankfully, ggplot2 let’s us edit all of that. Add these settings to your plot with the + operator and see what they do:

theme_light() is a standard theme with a clean look to it that fits our needs for this plot. The base_size argument lets us modify the text size of every text element in our plot. The default is 12px, but we want something a bit bigger for our plot. We don’t need any axis labels, so we’ll just pass the labs() function two empty strings.

The expand argument in the next two functions adds some space between the axes and our tiles, which we don’t want in this case. We’re going to set the argument to zero to make our plot look even cleaner. Also, we’re going to delete the legend title in the guides() function and remove the axis ticks with theme(). The text on the x axis looks a bit packed right now, so we’re going to rotate it a bit to give it more space. If everything worked out, your finished plot should look like this:

That’s better, isn’t it? Play around with the settings a little if you like. Maybe change the text size, the legend title of the rotation angle of the x axis text.

Anyway, though: You did it! Yay! This is, of course, only one way to visualize similarities. I’m sure there’s lots of other cool alternatives. If you find your own, leave a link in the comments, we’d love to hear about it. Until then: Experiment a little with similarity measures and ggplot options. See you in our next tutorial, our next meeting or on slack if you want to keep up with all of the hot Journocode gossip. Have fun!

## Similarity and distance in data: Part 1

1 Comment

Part 2

In your work, you might encounter a situation where you want to analyze how similar your data points are to each other. Depending on the structure of your data though, “similar” may mean very different things. For example, if you’re working with records containing real-valued vectors, the notion of similarity has to be different than, say, for character strings or even whole documents. That’s why there’s a small collection of similarity measures to choose from, each tailored to different types of data and different purposes.

Before we get to know some of them, though, let’s think about what we’d expect such a measure to do. It’s easily done: If two objects are similar, the measure should be high (maximum for two perfectly similar objects). If they’re dissmilar, the value of the similarity measure should be low, so it should either converge to zero or to a negative number. We can, of course, set other expectations, but this is the bare minimum any measure of similarity should satisfy.

The more distant, the less similar

Because of these properties, similarity measures are often obtained by simply using the inverse of a distance metric. The intuition behind this is that the futher apart two objects are, the more dissimilar they are and the bigger the “distance” between them is. The more similar the objects are, the closer they are and the smaller the distance between them is. This is why in this tutorial, we’ll take a look at different ways to measure the elusive conept of a “distance” between two points of data.

Distance measures should have a few specific properties. They might sound a little math-y, but we’ll concentrate on the relatively straightforward concepts behind them:

$d(x,y) \geq 0$
The distance of two objects x and y can’t be less than zero.

$d(x,y) = 0 \iff x = y$
Two perfectly similar objects have distance zero.

$d(x,y) = d(y,x)$
The distance between x and y is the same as between y and x — it doesn’t matter which way you go.

$d(x,z) \leq d(x,y) + d(x,z)$
If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. This is called the triangle inequality.

Now that we got that out of the way, let’s look at a few distance measures. Again, if it sounds too mathematical, just take a deep breath and focus on the concepts. Or just skip the math altogether and look at how to implement and visualize distance measures in R, which we’ll focus on in the second part of this tutorial.

Euclidean or Non-Euclidean?

There’s two major classes of distance measures we can distinguish: Euclidean ones and Non-Euclidean ones. You should choose the appropriate one according to wether or not your data can be represented as points in a Euclidean space. A Euclidean space is any space that has some real-valued number of dimensions where points can be located. Your common two-dimensional or three-dimensional coordinate systems are examples for such spaces.

The important thing is that it has to be possible to define an average over the data points for it to be a Euclidean space. So if you’re working with vectors that have real-valued components you can compute an average over, then voilà, you’re working in a Euclidean space.

We’re going to look more closely at a few distance measures, Euclidean ones as well as Non-Euclidean ones:

Euclidean distance

This is pretty much the most common distance measure. It’s so common, in fact, that it’s often called the Euclidean distance, even though there’s many Euclidean distance measures, as we just learned. It’s defined as

$\sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2}$

This Euclidean distance adds up all the squared distances between corresponding data points and takes the square root of the result. Remember the Pythagorean theorem? If you look closely, the Euclidean distance is just that theorem solved for the hypothenuse — which is, in this case, the distance between x and y. The Euclidean distance is pretty solid: It’s bigger for larger distances, and smaller for closer data points. It can get arbitrarily large and is only zero if the data points are all exactly the same. That’s fine though. If you take a look at the requirements we set for a distance function, that’s exactly what we want.

Manhattan distance

$\sum\limits_{i=1}^n |x_i - y_i|$

Also known as city block distance, Canberra distance, taxicab metric or snake distance, this is definitely the distance measure with the coolest name(s). Incidentally, they’re also pretty decriptive: The Manhattan distance is the shortest distance a car would have to drive in a city block structure to get from x to y. Since it takes the absolute distances in each dimension before we sum them up, the Manhattan distance will always be bigger or equal to the Euclidean distance, which we can imagine as the linear distance between the two points.

Maximum distance

The maximum distance looks at the distance of two points in each dimension and selects the biggest one. This one is pretty straightforward, but we can express it as a fancy formula anyway:

$\max_{i}(|x_i - y_i|)$

L-Norm / Minkoswki distance

The L-Norm is the generalized version of the aforementioned distance measures. It is defined as

$(\sum\limits_{i=1}^n |x_i - y_i|^p)^{\frac{1}{p}}$

If $p$ is equal to 2, we get the Euclidean distance, which is why it’s also called the L2-Norm. $p = 1$ returns the Manhattan distance or L1-Norm and $p = \infty$ equals the maximum.

To sum up Euclidean distance measures, let’s take a look at how they work in a simple two-dimensional space. The maximum distance is equal to the biggest distance in any dimension. In this case, that’s the difference betwenn the x values of points p and q, which is 8. The Manhattan distance sums up the distances in each dimension, so it’s 8 + 3 = 11 in this case.

What would the Euclidean distance, symbolizes by the orange line, be? Visualized like this, it’s pretty obvious how we can use the Pythagorean formula to get the result:

$d_E(p,q)^2 = | p_x - q_x | ^2 + | p_y - q_y |^2 = 8^2 + 3^2\iff$

$d(p,q) = \sqrt{8^2 + 3^2} = \sqrt{73} \approx 8,5$

Amazing what can be done with a little trigonometry, right? Take a deep breath, because there’s more! Let’s look at some Non-Euclidean distance measures to make sure we can satisfy all our similarity measuring needs.

Cosine distance and similarity

The Cosine distance is defined by the angle between two vectors. As we know from basic linear algebra, the dot product of two vectors is defined by

$x \cdot y = \|x\| \|y\| \cos{\theta}$

where $\theta$ is the angle between the two vectors. the smaller the angle is, the closer to 1 the cosine of the angle is, and the bigger the angle, the closer it is to -1. If you take a look at what we expected from a similarity measure, then the cosine meets our demands rather well. After all, if the angle between two vectors is very small, that means they’re very close together, and therefore more similar. So we’ll just solve the above equation for the cosine and define the cosine similariy to be equal to

$\cos{\theta} = \frac{x \cdot y}{\|x\| \|y\|}$

If we need to construct a distance measure from here, we can just take the inverse, as we learned before. So the cosine distance is defined as

$1 - \cos{\theta}$

Since we’re talking about vectors, it might be easy to assume this is also a Euclidean distance measure — and that may be right. If the vectors in question are actual real values, the cosine distance is Euclidean. But if the vectors have to be, say, integer components, we can’t compute an average over the points or we might get a non-integer result. Also, the cosine distance as such doesn’t satisfy the triangle inequality unless we alter it a bit. The cosine similarity, though, is a nice and efficient way to determine similarity in all kinds of multi-dimensional, numeric data.

Jaccard distance and similarity

Like with the cosine distance and similarity, the Jaccard distance is defines by one minus the Jaccard similarity. The Jaccard similarity uses a different approach to similarity than the measures we’ve seen so far. To compare two objects, it looks at the elements they have in common (the intersection) and divides it by the number of elements the two objects have in total (the union). Written out as a formula, that definition looks like this

$\frac{X \cap Y}{X \cup Y}$

$\cap$ is the mathematical sign for intersection, $\cup$ means union. With this definition, the similarity is only equal to one if all elements are the same and only becomes zero if all elements are different. Perfect for a similarity measure, but the wrong way around for a distance measure. This is easily solved by defining the Jaccard distance to be

$1 - \frac{X \cap Y}{X \cup Y}$

As an example, let’s compare the two sentences “Yesterday, the warm weather was perfect for my cat” and “My cat liked the warm weather yesterday”. Let’s call them X and Y. We could, of course, have used numbers or a mix of both as well, the Jaccard similarity doesn’t care.

The sentences have 6 words in common and 10 unique words in total. So the Jaccard similarity between them is 6/10 = 0.6 = 60 %. Their Jaccard distance is 1 – 0.6 = 0.4 = 40%.

A nice way to represent objects you want to compute the Jaccard similarity of is in the form of a Boolean matrix, a matrix with only ones and zeroes. The columns of our matrix symbolize the objects we want to find the similarity of and our rows are the unique elements of both objects — in this case, the words. One means the word is present in the object, zero means it isn’t. To compute the Jaccard similarity over the columns, all we have to do is count the rows where both objects have ones (6) and divide it by the total number of rows (10).

 words X Y yesterday 1 1 the 1 1 warm 1 1 weather 1 1 was 1 0 perfect 1 0 for 1 0 my 1 1 cat 1 1 liked 0 1

We don’t have to stop at single sentences, though. The Jaccard similarity is an efficient way to compute similarity over entire documents — a lot of documents if necessary. Our corresponding Boolean matrix will get very big, of course, but since the formula is relatively simple, it scales rather well to large datasets.

Edit distance

Lastly, let’s think about how to measure the similarity of two character strings. One way to do that is the edit distance. The edit distance is simply the minimum number of inserts and deletes needed to get from one string to the other.

Let’s say we have the words “knock” and “flocks”. To get from one to the other, we have to delete one letter (k) and insert three (f,l,s):

knock → _nock → lock → flock → flocks

So the edit distance betweeen them is four. The edit distance is  a proper distance measure since it satisfies all four requirements we set at the beginning of this lesson.

• The distance of two objects x and y can’t be less than zero. There’s no way to do a negative number of edits, so that’s true.
• Two perfectly similar ojects have distance zero. We don’t need any edits to transform a word into itself.
• The distance between x and y is the same as between y and x. Every insert into one word is equal to a delete from the other, so the paths you take are always inverse and have the same number of steps.
• If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. Changing from word x to word y before you change to z is one way to go from x to z. The direct way might be shorter, but it can never be longer than the detour.

Congratulations! You made it through all of the math and learned a lot about some ways to measure distance and similarity in your data. In Part two of this lesson, we’re going to leave the theory behind us. We’ll take a look at how to actually compute these distance measures in R and think about how to visualize similarity in data.

Part 2