Extracting geodata from OpenStreetMap with Osmfilter

Extracting geodata from OpenStreetMap with Osmfilter

A guest post by Hans Hack

When working on map related projects, I often need specific geographical data from OpenStreetMap (OSM) from a certain area. For a recent project of mine, I needed all the roads in Germany in a useful format so I can work with them in a GIS program. So how do I do I get the data to work with? With a useful little program called Osmfilter.

There are various sites that provide OSM datasets for certain areas. However, these datasets include ALL the geodata OSM provides. That means houses, streets, rivers etc – basically everything you see on a normal map. In addition, the file is in a rather inconvenient format for direct usage. A quite complicated way to proceed with the dataset would be to load it into a database and query what you need. This can be time consuming and – depending on the power of your PC – impossible.

If you only need to get info about small areas, I recommend using the site overpass turbo. For larger datasets, there is Osmfilter, which easily lets you filter OSM geodata. With a little help from two other free tools, you will have a dataset you can work with in no time.

 

What you’ll need for this tutorial

 

Get an OSM Dataset

To start out, we need to download an OSM dataset, which is saved in a format called pbf (a format to compress large data sets). For this tutorial, I will use a dataset provided by geofabrik, but there are other sources, too. Lets download the pbf file for Liechtenstein and save it in a folder of your choice. Once you have downloaded the data, open the shell and go to the folder where your new dataset is stored with the command

Prepare your data

Osmfilter only supports the file formats osm and o5m. For fast data processing, using o5m is recommended. You can convert your pbf file to o5m with osmconvert in your shell like this:

This translates to: Use the program osmconvert and convert the file called liechtenstein-latest.osm.pbf to a o5m file called liechtenstein. The -o stands for output.

You will now have the same dataset in the o5m format in the same folder.

Filter your data

Now, you can filter your geodata using the shell that should still be open. The osmfilter command logic is built up as follows:

Let’s look at the part about the filter commands. Here is where you can tell the program which parts of the dataset you need by writing --keep=DATA_I_WANT . Here is an example which creates a file for you called buildings.osm that contains all the buildings (and only the buildings) from the Liechtenstein dataset:

To find out how features are stored and classified in OSM, you can go to this site and look up the feature you want. Tip: You can head over to overpass turbo to test your query on a small area of your choice by using the ‘wizard’.

Of course you can do much more with Osmfilter. You can specify which building type you want. For example, you might only want to look at schools:

You can query multiple features by chaining them. For example if you want all the schools and universities:

You can exclude things by adding the flag –drop. For example, if you don’t want to have buildings that are warehouses but keep everything else:

You can reduce the final file size by dropping extra data on the authors and the version by adding:

You can, of course, combine these flags. Here is a query that gives you all the highway types that cars can use in Lichtenstein without the ones where motor vehicles are not allowed:

You find more on the filtering options on the Osmfilter site with some examples too.

Convert to a useful format

As a final step, you can convert your osm file to the most widely supported geodata format called a Shapefile (the GIS program QGIS can handle osm files, but it sometimes doesn’t work well with large datasets). You can convert your osm file to a Shapefile with the program ogr2ogr like this:

The above command converts the file streets_liechtenstein.osm to the shapefile format and tells it to store it in a folder called streets_shapefiles. In the newly created folder you will find 4 different shapefiles (one for every geometry type). In case of the streets, we are only interested in the file lines.shp. You can open this file in a GIS program of your choice.

Ogr2org also allows you to convert your newly created Shapefiles to other geodata formats that you may need, such as GeoJSON, CSV and many more. Have a look at the ogr2ogr website for more info. If you’re tired of using the shell, try the online tool Mapshaper which allows you to convert your Shapefile file to formats such as GeoJSON, SVG and CSV. The file size for Mapshaper is limited but I have tried it with files bigger than 1 GB.

 

Have fun filtering OSM and happy mapping!

 

Similarity and distance in data: Part 2

Similarity and distance in data: Part 2

Part 1 | Code

In part one of this tutorial, you learned about what distance and similarity mean for data and how to measure it. Now, let’s see how we can implement distance measures in R. We’re going to look at the built-in dist() function and visualize similarities with a ggplot2 tile plot, also called a heatmap.

Implementation in R: the dist() function

The simplest way to do distance measures in R is the dist() function. It works with matrices as well as data frames and has options for a lot of the measures we’ve gotten to know in the last part.

The crucial argument here is method. It has six options — actually more like four and a half, but you’ll see:

  • euclidean” Is the Euclidean distance.
  • maximum” The maximum distance.
  • manhattan” The Manhattan or city block distance.
  • canberra” Another name for the Manhattan distance.
  • binary” The Jaccard distance.
  • minkowski” Also called L-norm. The generalized version of Euclidean and Manhattan distance. Returns the Manhattan/Canberra distance if p = 1 and the Euclidean distance for p = 2.

We’re going to be working with the Jaccard distance in this lecture, but it works just as well for the other distance measures.

Download today’s dataset on similarities between right wing parties in Europe. It’s in the .Rdata file format, so you can load it into R with the load() function.

It contains the data frame values, which contains data on which european right wing parties agree with which right wing policies. The columns represent parties, while the rows represent political views. The entries are ones and zeros — one if the party agrees with the idea, zero if it doesn’t. This means we’re working with a binary or Boolean matrix (data frame, to be exact, but you get the idea). If you remember what we talked about in part one of this tutorial, you’ll realize this is a perfect situation for the Jaccard distance measure.

Since we want to visualize the similarities between the different parties, we want to calculate the distances over the columns of our dataset. This is a very important distinction, since some distance functions will calculate over rows per default, some over columns.

The dist() function works on rows. Since there’s no argument to switch to columns, we’ll have to transpose our dataset. Thankfully, this is pretty easy for data frames in R. We’ll just use t():

Note that with the default settings for diag and upper, the resulting “dist” object will have a triangle shape. That’s because we’re calculating the distance of every party to every other party, so the resulting matrix would be symmetric. Since we want to visualize our results, though, that’s what we want. So to prepare for visualization, we’ll have to do two things:

  • Add the diagonal and the upper triangle to make a complete rectangle shape.
  • Convert back from a dist object to a data frame so ggplot can work with the data.

Also, remember how we wanted to visualize the similarity between the parties, not their distance? And remember how distance and similarity metrics are inverse to each other? Once we’ve converted back to a data frame, we can simply use 1 - jacc  to get the Jaccard similarities from the Jaccard distances the dist() function returns.

If everything went according to plan, View(jaccsim) should show a symmetric data frame with values between zero and one, the diagonal consisting of all ones.

From here, let’s start preparing the dataset for ggplot visualization. For more info on how to work with ggplot, check out our tutorial, if you haven’t already.

Melting the data

If you’ve followed our tutorial on the tidy data principles ggplot is built on, you’ll remember how we need to convert our data to the specific tidy format ggplot works with. To melt our dataframe into a thin, long one instead of the rectangle shape it has right now, we’ll need to add a row containing the party names currently stored as row names, so the melting function will know what to melt on. Once we’ve done that, we can use melt() from the package reshape2 to convert our data.

Notice how we used the double colon “::” to specify to which package the function melt() belongs? This is a convenient alternative to loading an entire package if you only want to use one or two functions from it. As long as the package is installed, it will work just as well as library(reshape2).

The only thing left to do before we can start plotting is to make sure the parties are going to be in the right order when we plot them. When working with qualitative data, ggplot works with factors and plots the elements on the axes in the order of their factor levels. So we’ll make sure the levels are in the right order by specifying them explicitly:

The second argument to factor() specifies the levels and their order. Since we want to plot the similarities of each party with every other, we’re going to have party names on both x and y axes. By specifying one of the axes to be in reverse order with the rev() function, we make sure our plot looks nice and clean, as we’ll see in the next step: The actual visualization.

Visualization: Tile plot

There’s lots of ways to visualize similarity in data. When dealing with very small datasets like this one, one way to do it is using a heat map in tile format. At least that’s what I did, so that’s what you’re learning today. For each combination of two parties, there’s going to be a tile. The color of the tile shows the level of similarity between them: The more intense the color, the higher the similarity. The code we’re going to use is adapted from this blog post, which is really worth checking out.

First, we’re going zo build the basic plot structure:

Remember to load the ggplot2 package before you start plotting. We’re going to specify our x and y axes to be the two factors containing the party names with aes(names, variable). With geom_tile(), we define the basic structure of our plot: A set of tiles. They’re going to be filled according to the Jaccard similarities stored in the column value (aes(fill=value)). Their basic color is defined to be white, but we’ll create a gradient of blues with scale_file_gradient(). Try different color schemes if you like.

With these three basic set-up functions, you’re going to end up with something like this if you take a look at sim:

sim1

Not too bad, right? Notice how the diagonal of the tile matrix has the darkest possible blue. This makes sense, since those are the tile comparing one party to itself. The lighter the color, the lower the similarity between the parties.

But this plot doesn’t look as pretty as we’d like it to yet. The labels are to small, the axis labels aren’t necessary, the signature grey ggplot background isn’t visually appealing in this case and the legend doesn’t look as nice as it could.

Thankfully, ggplot2 let’s us edit all of that. Add these settings to your plot with the + operator and see what they do:

theme_light() is a standard theme with a clean look to it that fits our needs for this plot. The base_size argument lets us modify the text size of every text element in our plot. The default is 12px, but we want something a bit bigger for our plot. We don’t need any axis labels, so we’ll just pass the labs() function two empty strings.

The expand argument in the next two functions adds some space between the axes and our tiles, which we don’t want in this case. We’re going to set the argument to zero to make our plot look even cleaner. Also, we’re going to delete the legend title in the guides() function and remove the axis ticks with theme(). The text on the x axis looks a bit packed right now, so we’re going to rotate it a bit to give it more space. If everything worked out, your finished plot should look like this:

sim2

That’s better, isn’t it? Play around with the settings a little if you like. Maybe change the text size, the legend title of the rotation angle of the x axis text.

Anyway, though: You did it! Yay! This is, of course, only one way to visualize similarities. I’m sure there’s lots of other cool alternatives. If you find your own, leave a link in the comments, we’d love to hear about it. Until then: Experiment a little with similarity measures and ggplot options. See you in our next tutorial, our next meeting or on slack if you want to keep up with all of the hot Journocode gossip. Have fun!

Part 1 | Code

Project: Visualizing WhatsApp chat logs –
Part 1: Cleaning the data

Project: Visualizing WhatsApp chat logs – <br>Part 1: Cleaning the data

Part 2 | Code

A few weeks ago, we discovered it’s possible to export WhatsApp conversation logs as a .txt file. It’s quite an interesting piece of data, so we figured, why not analyze it? So here we go: A code-along R project in two steps.

  1. Cleaning the data: That’s what this part is for. We’ll get the .txt file ready to be properly evaluated.
  2. Visualizing the data: That’s what we’ll talk about in part two — creating some interesting visuals for our chat logs.

You can find the entire code for the project on our github page. In this part, we’ll walk you through the process of cleaning a dataset step by step. This is what the final product of part two will look like (Of course, yours could be something entirely else. There are heaps of great material in those logs):

 

is this the tooltip?

Getting the data

First things first: We’ll need some data to work with. WhatsApp has a built-in function for exporting chat logs via e-mail. Just choose any chat you want to analyze. Group chats are especially interesting for this particular visualization, since we’ll take a look at the number and timing of messages. In case you already know how to get the logs, you can just skip this step.

How to get the chat logs

Depending on what kind of phone you have, this might work a little differently. But this is how it works on Android:

While in a chat, tap the three dots in the upper right corner and select “More”, the last option.

Screenshot_2016-01-28-12-02-49

Then, select “E-mail chat”, the second option. It will let you choose an address to send to and voilà, there’s your text file.

Screenshot_2016-01-28-12-02-54

Alternatively, you can also go via the WhatsApp main page. Tap the three dots and select “Settings” > “Chat history” > “Send chat history”. Then, just select the chat you want to export.

Our to-do-list

The .txt file you’ll get is, well, not as difficult to handle as it could be, but it has a few quirks. The basic structure is pretty easy. Every row follows this basic pattern:

<time stamp> – <name>: <text>

Looks alright, doesn’t it? But there’s a few issues we’ll run into, especially if we don’t want to analyze just the  message count but the content as well. Some of them are easy to correct, like the dash between the time stamp and the name, some are more complicated:

  • The .txt file isn’t formatted like a .csv or a proper table. That is, not every row has the same amount of elements
  • Some rows don’t have a time stamp, but immediately start with text if a message has multiple paragraphs.
  • Some names have one, some names have two words, depending on wether they’re saved with or without surname.
  • The time stamp isn’t formatted to be evaluated and spans multiple columns.
  • Names have a colon at the end. That doesn’t look nice in graphics.

Converting and importing the file

Before we can start cleaning in R, we have to tackle the first issue on our to-do-list. If you try to read the text file into R right away, you’ll get an error:

We’ll have to convert it into a proper table structure. There’s multiple ways to do that. We used Excel to convert the file to .csv. If you already have your own favourite way to convert the file, you can do it your way.

Converting text to .csv with Excel

First, obviously, open Excel. Open the .txt file. Remember to switch from “Excel files” to “All file types” in the drop-down menu so your text file is visible. You should be led to the the text import wizard.

In the first step, set the file type to “Delimited”.

text-import-wizard-step-1

Then, separate by spaces (remember to un-check “Tab”).

text-import-wizard-step-2

In the last step, you can just leave the data format at “General” and click “Finish”.

text-import-wizard-step-3

The resulting dataset should look somewhat like this:

csv

Just save this as a .csv file and you should be good to go.

Now that we’ve got a proper .csv file, we can start cleaning it in R. First, read in the file and save it as a variable. For more info on data import in R, check our our previous tutorial.

Check that you specify the right separator. It’s probably a comma or a semicolon. Just open your file in a text editor and find out.

Regular expressions

To clean up the file, we’ll need to work with regular expressions. They’re used for finding and manipulating patterns in text and have their own syntax. Regex syntax is a little hard to wrap your head around, but there’s lots of reference sheets and expression testers like regexr online that help you translate. In R, you’ll use the grep() function family for text matching and replacement. Let’s try it out. As mentioned on our to-do-list, some rows don’t start with a timestamp. Visually, they’re easy to spot, because they don’t have a number at the beginning.

In regex, the pattern “character string without a digit at the beginning” is translated as “^\D”“^” matches the beginning of a string, “\D” means “anything except a digit”.

The call  to grep(“^\\D”, chat[,1]) tells R to look in the first column of our chat for rows that fit the regular expression “^\D”. The second backslash is an excape character only necessary in R, because the backslash serves other purposes there as well.

We’re not going to get into the details of regular expressions here, that’s a post for another time. Feel free to look them up on your own, though. If you want to analyze text files in your projects, it’s pretty sure you’ll encounter them anyway.

Shifting stampless rows

We’re going to shift the rows without time stamp a bit to the right, so we can copy down the time stamp and name of the sender. First, we’re going to make room at the end of the data frame, in case the stampless rows also happen to be very long messages:

Then, we’ll just move the first five rows that block the space for the time stamp to the end of the line, leaving the beginning of the line blank for the time being. We’ll use a for loop that goes through every row without a time stamp and moves the first five elements to the end of the line.

We’ll write a tutorial on loops and conditional statements soon, but in the meantime, check out this short explanation of you want to know more about loops.

We could copy down the time stamp and name right now, but since there’s still a few issues with the name columns, we’ll sort out those first. Before we do that, though, we’ll just quickly delete any entirely empty rows that might have snuck in. We’ll use the apply() function for that. It’s basically like a loop, just much faster and easier to handle in most cases. The R package swirl contains in-console tutorials and has a great one for the apply() function family as well.

Cleaning the surname column

Now, some contacts might be saved by first name, some by first and last name, right? So the column containing the surnames also sometimes contains a bit of text. The difference is, the text bit probably won’t end with a colon, the surnames definitely will. We can use regular expressions to filter the surname column accordingly.
Also, some messages aren’t actually chat content, but rather activity notifications like adding new members to a group. They’ll say something like “Marie Timcke added you”. Good thing is: Those messages don’t contain colons either, so we can use the same regular expression for the surnames and the notifications.

The regex we’ll use is “.+:$”. It matches any pattern with one or more characters (“.” for any character, “+” for “one or more”) followed by a literal colon (“:”) and then the end of the line (“$”).

The first part reduces the chat data frame to all columns that either have a colon in columns 5 (“grepl(“.+:$”, chat[,5])”) or (“|”) in column 4 (“grepl(“.+:$”, chat[,4])”). Of course, the stampless rows we just created are left in as well (“is.na(chat[,1])”). This effectively removes the notifications.

In the second chunk, we move the text parts in the surname column to the end of the line, the same way we shifted the rows without time stamp. By now, our file looks something like this (check yours with View(chat)):

chat2

The bigger part of our work is done. We’ll just format the time stamp in such a way that R can evaluate it and make a few cosmetic adjustments.

Converting the time stamp to date format

For R to convert the first two columns into a format it can work with, we’ll have to help it a bit. First, we’ll copy down time stamps and name from the previous row to all the rows we shifted before.

Then, we’ll clean the first few columns a bit, deleting the column with the dash, merging the time stamp so its all in one place and naming the first few columns.

Now, we can easily convert the first column into a date format. R has a few different classes for date and time objects. We’ll use the strptime() function, which produces an object of class “Posixlt”. If you want to know more about dates and times in R, again, the swirl lesson on that topic is great.

We need to tell strptime in which format the date is stored. In our case, it’s
“<day>.<month>.<full year>, <hours>:<minutes>”. In strptime() language, this is written as “%d.%m.%Y, %H:%M”.

One last cosmetic edit: The names still have colons at the end. This issue is easily solved with — you guessed it — regular expressions! We can use the gsub() function to search and replace patterns. We’ll use it on the “name” and “surname” column by replacing every colon at the end of a line with nothing, like this:

Congratulations, you’ve cleaned up the entire dataset! It should now look like this — no empty lines, no colons or text in the name columns and a wonderfully formatted time stamp.

chat3

Saving the data

Now, the only thing left to do is to save our beautiful, sparkly clean dataset to a new file. If you want to work with the file in another programm except R, you can use, for example, the write.table() or write.csv() function to export your data frame. Since we want to continue working in R for our visualizations, we’ll go with save() for now. It will create an .Rdata file that can be read back into R easily with the load() function.

There you go, all done! Give yourself a big pat on the back, because cleaning data is hard.

If you want to continue right away, check out part two of our WhatsApp project where we visualize the data we just cleaned. If you need help with the cleaning script or have suggestions on how to improve it, write us an e-mail or join our slack team. Our help and discussion channels are open for everyone!

 

Part 2 | Code

 

{Credits for the awesome featured image go to Phil Ninh}

R exercise: Analysing data

R exercise: Analysing data

While using R for your everyday calculations is so much more fun than using your smartphone, that’s not the (only) reason we’re here. So let’s move on to the real thing: How to make data tell us a story.

First you’ll need some data. You haven’t learned how to get and clean data, yet. We’ll get to that later. For now you can practice on this data set. The data journalists at Berliner Morgenpost used it to take a closer look at refugees in Germany and kindly put the clean data set online. You can also play around with your own set of data. Feel free to look for something entertaining on the internet – or in hidden corners of your hard drive. Remember to save your data in your working directory to save yourself some unneccessary typing.

Read your data set into R with read.csv(). For this you need a .csv file. Excel sheets can easily be saved as such.

Now you have a data frame. Name it anything you want. We’ll go with data. Check out class(data). It tells you what kind of object you have before you. In this case, it should return data frame.

Time to play!

Remember, if you just type data and run that command, it will print the whole table to the console. That might be not exactly what you want if your dataset is very big. Instead, you can use the handy functions below to get an overview of your data.

Try them and play around a little bit. Found anything interesting yet? Anything odd? In the data set we suggested, you’ll notice that the mean and the median are very different in the column “Asylantraege” (applications for asylum). What does that tell you?

Row and column indices

This is how you can take a closer look at a part of the whole set using indices. Indices are the numbers or names by which R identifies the rows or columns of data.

The last two alternatives only work if your columns have names. Use the function names() to look them up or change them.

Here are some more useful functions that will give you more information about the columns you’re interested in. Try them!

Subsets and Logic

Now you can take and even more detailed look by forming subsets, parts of your data that meet certain criteria. You’ll need the following logical operators.

Try to form different subsets of your data to find out interesting stuff. Check if it worked with View()head()tail(), etc.

Try to kick out all the rows that have “0” in the column “Asylantraege” (applications for asylum). Look at it again. What happened to mean and median?

Get the answers you want

With everything you learned so far, you can start to get answers. See what questions about your data can be answered by forming data subsets. For example, if you used the data set we suggested: Where do most people seeking refuge in Germany come from?

We made a list of the ten most common countries of origin.

Unbenannt

Ask your own questions. What do you want your data to tell you?

 

{Credits for the awesome featured image go to Phil Ninh}