Project: Visualizing WhatsApp chat logs –
Part 1: Cleaning the data

by Kira Schacht 1 Comment
Project: Visualizing WhatsApp chat logs – <br>Part 1: Cleaning the data

Part 2 | Code

A few weeks ago, we discovered it’s possible to export WhatsApp conversation logs as a .txt file. It’s quite an interesting piece of data, so we figured, why not analyze it? So here we go: A code-along R project in two steps.

  1. Cleaning the data: That’s what this part is for. We’ll get the .txt file ready to be properly evaluated.
  2. Visualizing the data: That’s what we’ll talk about in part two — creating some interesting visuals for our chat logs.

You can find the entire code for the project on our github page. In this part, we’ll walk you through the process of cleaning a dataset step by step. This is what the final product of part two will look like (Of course, yours could be something entirely else. There are heaps of great material in those logs):


is this the tooltip?

Getting the data

First things first: We’ll need some data to work with. WhatsApp has a built-in function for exporting chat logs via e-mail. Just choose any chat you want to analyze. Group chats are especially interesting for this particular visualization, since we’ll take a look at the number and timing of messages. In case you already know how to get the logs, you can just skip this step.

How to get the chat logs

Our to-do-list

The .txt file you’ll get is, well, not as difficult to handle as it could be, but it has a few quirks. The basic structure is pretty easy. Every row follows this basic pattern:

<time stamp> – <name>: <text>

Looks alright, doesn’t it? But there’s a few issues we’ll run into, especially if we don’t want to analyze just the  message count but the content as well. Some of them are easy to correct, like the dash between the time stamp and the name, some are more complicated:

  • The .txt file isn’t formatted like a .csv or a proper table. That is, not every row has the same amount of elements
  • Some rows don’t have a time stamp, but immediately start with text if a message has multiple paragraphs.
  • Some names have one, some names have two words, depending on wether they’re saved with or without surname.
  • The time stamp isn’t formatted to be evaluated and spans multiple columns.
  • Names have a colon at the end. That doesn’t look nice in graphics.

Converting and importing the file

Before we can start cleaning in R, we have to tackle the first issue on our to-do-list. If you try to read the text file into R right away, you’ll get an error:

We’ll have to convert it into a proper table structure. There’s multiple ways to do that. We used Excel to convert the file to .csv. If you already have your own favourite way to convert the file, you can do it your way.

Converting text to .csv with Excel

Now that we’ve got a proper .csv file, we can start cleaning it in R. First, read in the file and save it as a variable. For more info on data import in R, check our our previous tutorial.

Check that you specify the right separator. It’s probably a comma or a semicolon. Just open your file in a text editor and find out.

Regular expressions

To clean up the file, we’ll need to work with regular expressions. They’re used for finding and manipulating patterns in text and have their own syntax. Regex syntax is a little hard to wrap your head around, but there’s lots of reference sheets and expression testers like regexr online that help you translate. In R, you’ll use the grep() function family for text matching and replacement. Let’s try it out. As mentioned on our to-do-list, some rows don’t start with a timestamp. Visually, they’re easy to spot, because they don’t have a number at the beginning.

In regex, the pattern “character string without a digit at the beginning” is translated as “^\D”“^” matches the beginning of a string, “\D” means “anything except a digit”.

The call  to grep(“^\\D”, chat[,1]) tells R to look in the first column of our chat for rows that fit the regular expression “^\D”. The second backslash is an excape character only necessary in R, because the backslash serves other purposes there as well.

We’re not going to get into the details of regular expressions here, that’s a post for another time. Feel free to look them up on your own, though. If you want to analyze text files in your projects, it’s pretty sure you’ll encounter them anyway.

Shifting stampless rows

We’re going to shift the rows without time stamp a bit to the right, so we can copy down the time stamp and name of the sender. First, we’re going to make room at the end of the data frame, in case the stampless rows also happen to be very long messages:

Then, we’ll just move the first five rows that block the space for the time stamp to the end of the line, leaving the beginning of the line blank for the time being. We’ll use a for loop that goes through every row without a time stamp and moves the first five elements to the end of the line.

We’ll write a tutorial on loops and conditional statements soon, but in the meantime, check out this short explanation of you want to know more about loops.

We could copy down the time stamp and name right now, but since there’s still a few issues with the name columns, we’ll sort out those first. Before we do that, though, we’ll just quickly delete any entirely empty rows that might have snuck in. We’ll use the apply() function for that. It’s basically like a loop, just much faster and easier to handle in most cases. The R package swirl contains in-console tutorials and has a great one for the apply() function family as well.

Cleaning the surname column

Now, some contacts might be saved by first name, some by first and last name, right? So the column containing the surnames also sometimes contains a bit of text. The difference is, the text bit probably won’t end with a colon, the surnames definitely will. We can use regular expressions to filter the surname column accordingly.
Also, some messages aren’t actually chat content, but rather activity notifications like adding new members to a group. They’ll say something like “Marie Timcke added you”. Good thing is: Those messages don’t contain colons either, so we can use the same regular expression for the surnames and the notifications.

The regex we’ll use is “.+:$”. It matches any pattern with one or more characters (“.” for any character, “+” for “one or more”) followed by a literal colon (“:”) and then the end of the line (“$”).

The first part reduces the chat data frame to all columns that either have a colon in columns 5 (“grepl(“.+:$”, chat[,5])”) or (“|”) in column 4 (“grepl(“.+:$”, chat[,4])”). Of course, the stampless rows we just created are left in as well (“[,1])”). This effectively removes the notifications.

In the second chunk, we move the text parts in the surname column to the end of the line, the same way we shifted the rows without time stamp. By now, our file looks something like this (check yours with View(chat)):


The bigger part of our work is done. We’ll just format the time stamp in such a way that R can evaluate it and make a few cosmetic adjustments.

Converting the time stamp to date format

For R to convert the first two columns into a format it can work with, we’ll have to help it a bit. First, we’ll copy down time stamps and name from the previous row to all the rows we shifted before.

Then, we’ll clean the first few columns a bit, deleting the column with the dash, merging the time stamp so its all in one place and naming the first few columns.

Now, we can easily convert the first column into a date format. R has a few different classes for date and time objects. We’ll use the strptime() function, which produces an object of class “Posixlt”. If you want to know more about dates and times in R, again, the swirl lesson on that topic is great.

We need to tell strptime in which format the date is stored. In our case, it’s
“<day>.<month>.<full year>, <hours>:<minutes>”. In strptime() language, this is written as “%d.%m.%Y, %H:%M”.

One last cosmetic edit: The names still have colons at the end. This issue is easily solved with — you guessed it — regular expressions! We can use the gsub() function to search and replace patterns. We’ll use it on the “name” and “surname” column by replacing every colon at the end of a line with nothing, like this:

Congratulations, you’ve cleaned up the entire dataset! It should now look like this — no empty lines, no colons or text in the name columns and a wonderfully formatted time stamp.


Saving the data

Now, the only thing left to do is to save our beautiful, sparkly clean dataset to a new file. If you want to work with the file in another programm except R, you can use, for example, the write.table() or write.csv() function to export your data frame. Since we want to continue working in R for our visualizations, we’ll go with save() for now. It will create an .Rdata file that can be read back into R easily with the load() function.

There you go, all done! Give yourself a big pat on the back, because cleaning data is hard.

If you want to continue right away, check out part two of our WhatsApp project where we visualize the data we just cleaned. If you need help with the cleaning script or have suggestions on how to improve it, write us an e-mail or join our slack team. Our help and discussion channels are open for everyone!


Part 2 | Code


{Credits for the awesome featured image go to Phil Ninh}

Comment ( 1 )

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>