R: Tidy Data

R: Tidy Data

Unfortunately, data comes in all shapes and sizes. Especially when analyzing data from authorities. You’ll have to be able to deal with pdfs, fused table cells and frequent changes in terms and spelling.

When I analyzed the swiss arms export data as an intern at SRF Data, we had to work with scanned copies of data sheets that weren’t machine-readable, datasets with either french, german or french and german countrynames in the same column as well as fused cells and changing spelling of the categories.

Unsurprisingly, preparing and cleaning messy datasets is often the most time-consuming part of data analysis. Hardley Wickham, creator of  R packages like ggplot and reshape, wrote a very interesting paper about an important part of this data cleaning: the data tidying.
According to him, tidy data has a specific structure:

Each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.

As you may have seen in our post on ggplot2, Wickham calls this tidy format molten data. The idea behind this is to facilitate the analysis procedure by minimizing the effort in preparing the data for the different tools over and over again. His suggestion: Working on tidy, molten data with a set of tidy tools, allowing you to use the saved time to focus on the results.

Bildschirmfoto 2016-02-29 um 15.02.26

Excerpt of Wickhams “Tidy Data”

Practicing data tidying

But how do we tidy messy data? How do we get from raw to molten data and what are tidy tools? Let’s practice this on a messy dataset.

On our GitHub-page, we deposited an Excel file containing some data on marriages in Germany per state and for different years. Download it and open it with Excel to have a first look at it. As you’ll see, it’s a workbook with seven sheets. We have data for 1990, 2003 and for every year from 2010 through 2014. Although this is a quite small dataset which we could tidy manually in Excel, we’ll use this to practice skills that will come in handy when it comes to bigger datasets.

Now check whether this marriage data needs to be tidied:

  • Are there any changing terms?
  • Is the spelling correct?
  • Is every column that contains numbers correctly saved as a numeric column?
  • Are there needless titles, fused cells, empty cells or other annoying noise?

Spoiler alert: The sheets on 2010-2015 are okay, but the first two — not that much. We have different spelling and terms here, as well as three useless columns and one row plus all the numbers saved as text in the first sheet. As said, the mess in this example is limited and we could tidy it up manually with a few clicks. But let’s keep in mind that we’re here to learn how to handle those problems with larger datasets as well.

Within Excel, we will:

  • Delete spare rows and columns (we could do that in R too when it comes to REALLY messy data)
  • Save columns containing numbers as numeric type

Now we’ll change to R.

First of all, we need to install and require all the packages we’re going to use. We’ll do that with an if-statement telling R only to install the package if it hasn’t been installed yet. You could of course do this in the straightforward way without the conditional statement if you remember wether you already installed the package, but this is a quick way to make sure you don’t install something twice needlessly.

To read in the sheets of an Excel workbook, read_excel() from the readxl-package is a useful function. Because we don’t want to load the sheets separately, we’re going to use a loop for this. If you’re interested in learning more about loop functions, stay tuned for our upcomming tutorial on this topic.

messy_data is now a list of seven local data frames with messy_data[[1]] containing the data for 1990, messy_data[[2]] for 2003 and so on. Also, we added a “timestamp” column to each list element which contains the index of the list element.

To save the sheets as list elements is time saving, but we want all the data in one data frame:

If you get an error telling you the frames have different lengths you probably forgot to delete the spare columns in the 1990 sheet. Sometimes there even seems to be something invisble left in empty excel columns. I usually delete three or so of the empty columns and empty rows next to my data to be sure there isn’t something left I can’t see.

Next part: Restructuring the data

With the function gather() of Wickhams tidyr-package, we’ll melt the raw data frame to convert it to a molten format. And let’s change the timestamps created by the read-in loop to the actual year (we could do that with a loop, too, but this is good enough for now).

Oo-De-Lally! This is tidy data in a tidy format! Now we can check if we have to correct the state names (because with bigger datasets, you can’t quickly check and correct spelling and term issues within Excel):

So we got 19 different german Bundesländer. But Google tells us that there are only 16 states in Germany! Let’s have a closer look at the names to check whether we’ll find duplicates:

Yes, there are! For example Baden-Württemberg and BaWü refer to the same state, as well as Hessen, Hesse and Hesssen. You can just manually correct this. For really big datasets, you could also work with regular expressions and string replacement to find the duplicates, but for now, this should be enough:

Now that your data is tidy, the actual analysis can start. A very useful package to prepare molten data is dplyr. Its functions ease filtering the data or grouping it. Not only is this great for taking a closer look at certain subsets of your data, but, because Wickhams graphics package ggplot2 was created to fit the tidy data principle, we can quickly shape the data to be visually analyzed, too.

Here we have some examples for you showing how tidy data and tidy tools can work hand in hand. If you want to learn something about the graphics package ggplot2 first, visit our post for beginners on this!

Visual analysis with ggplot2: this may look complicated at first, but once you have coded the first ggplot you only have to change or/and add a few things to create several more and totally different plots.

Bildschirmfoto 2016-03-05 um 00.15.33

Maybe you’ve got some other questions this data could answer for you? Feel free to continue this analysis or try to tidy your own data set!

If you have any questions, problems or feedback, simply leave a comment, email us or join our slack team to talk to us any time!

 

{Credits for the awesome featured image go to Phil Ninh}

R: plotting with the ggplot2 package

R: plotting with the ggplot2 package

While crunching numbers, a visual analysis of your data may help you get an overview of your data or compare filtered information at a glance. Aside from the built-in graphics package, R has many additional packages to help you with that.
We want to focus on ggplot2 by Hadley Wickham, which is a very nice and quite popular graphics package.

Ggplot2 is based on a kind of statistical philosophy from a book I really recommend reading. In The Grammar of Graphics, author Leland Wilkinson goes deep into the structure of quantitative plotting. As a product, he establishes a rulebook for building charts the right way. Hadley Wickham built ggplot2 to follow these aesthetics and principles.

Project: Visualizing WhatsApp chat logs –
Part 1: Cleaning the data

Project: Visualizing WhatsApp chat logs – <br>Part 1: Cleaning the data

Part 2 | Code

A few weeks ago, we discovered it’s possible to export WhatsApp conversation logs as a .txt file. It’s quite an interesting piece of data, so we figured, why not analyze it? So here we go: A code-along R project in two steps.

  1. Cleaning the data: That’s what this part is for. We’ll get the .txt file ready to be properly evaluated.
  2. Visualizing the data: That’s what we’ll talk about in part two — creating some interesting visuals for our chat logs.

You can find the entire code for the project on our github page. In this part, we’ll walk you through the process of cleaning a dataset step by step. This is what the final product of part two will look like (Of course, yours could be something entirely else. There are heaps of great material in those logs):

 

is this the tooltip?

Getting the data

First things first: We’ll need some data to work with. WhatsApp has a built-in function for exporting chat logs via e-mail. Just choose any chat you want to analyze. Group chats are especially interesting for this particular visualization, since we’ll take a look at the number and timing of messages. In case you already know how to get the logs, you can just skip this step.

How to get the chat logs

Depending on what kind of phone you have, this might work a little differently. But this is how it works on Android:

While in a chat, tap the three dots in the upper right corner and select “More”, the last option.

Screenshot_2016-01-28-12-02-49

Then, select “E-mail chat”, the second option. It will let you choose an address to send to and voilà, there’s your text file.

Screenshot_2016-01-28-12-02-54

Alternatively, you can also go via the WhatsApp main page. Tap the three dots and select “Settings” > “Chat history” > “Send chat history”. Then, just select the chat you want to export.

Our to-do-list

The .txt file you’ll get is, well, not as difficult to handle as it could be, but it has a few quirks. The basic structure is pretty easy. Every row follows this basic pattern:

<time stamp> – <name>: <text>

Looks alright, doesn’t it? But there’s a few issues we’ll run into, especially if we don’t want to analyze just the  message count but the content as well. Some of them are easy to correct, like the dash between the time stamp and the name, some are more complicated:

  • The .txt file isn’t formatted like a .csv or a proper table. That is, not every row has the same amount of elements
  • Some rows don’t have a time stamp, but immediately start with text if a message has multiple paragraphs.
  • Some names have one, some names have two words, depending on wether they’re saved with or without surname.
  • The time stamp isn’t formatted to be evaluated and spans multiple columns.
  • Names have a colon at the end. That doesn’t look nice in graphics.

Converting and importing the file

Before we can start cleaning in R, we have to tackle the first issue on our to-do-list. If you try to read the text file into R right away, you’ll get an error:

We’ll have to convert it into a proper table structure. There’s multiple ways to do that. We used Excel to convert the file to .csv. If you already have your own favourite way to convert the file, you can do it your way.

Converting text to .csv with Excel

First, obviously, open Excel. Open the .txt file. Remember to switch from “Excel files” to “All file types” in the drop-down menu so your text file is visible. You should be led to the the text import wizard.

In the first step, set the file type to “Delimited”.

text-import-wizard-step-1

Then, separate by spaces (remember to un-check “Tab”).

text-import-wizard-step-2

In the last step, you can just leave the data format at “General” and click “Finish”.

text-import-wizard-step-3

The resulting dataset should look somewhat like this:

csv

Just save this as a .csv file and you should be good to go.

Now that we’ve got a proper .csv file, we can start cleaning it in R. First, read in the file and save it as a variable. For more info on data import in R, check our our previous tutorial.

Check that you specify the right separator. It’s probably a comma or a semicolon. Just open your file in a text editor and find out.

Regular expressions

To clean up the file, we’ll need to work with regular expressions. They’re used for finding and manipulating patterns in text and have their own syntax. Regex syntax is a little hard to wrap your head around, but there’s lots of reference sheets and expression testers like regexr online that help you translate. In R, you’ll use the grep() function family for text matching and replacement. Let’s try it out. As mentioned on our to-do-list, some rows don’t start with a timestamp. Visually, they’re easy to spot, because they don’t have a number at the beginning.

In regex, the pattern “character string without a digit at the beginning” is translated as “^\D”“^” matches the beginning of a string, “\D” means “anything except a digit”.

The call  to grep(“^\\D”, chat[,1]) tells R to look in the first column of our chat for rows that fit the regular expression “^\D”. The second backslash is an excape character only necessary in R, because the backslash serves other purposes there as well.

We’re not going to get into the details of regular expressions here, that’s a post for another time. Feel free to look them up on your own, though. If you want to analyze text files in your projects, it’s pretty sure you’ll encounter them anyway.

Shifting stampless rows

We’re going to shift the rows without time stamp a bit to the right, so we can copy down the time stamp and name of the sender. First, we’re going to make room at the end of the data frame, in case the stampless rows also happen to be very long messages:

Then, we’ll just move the first five rows that block the space for the time stamp to the end of the line, leaving the beginning of the line blank for the time being. We’ll use a for loop that goes through every row without a time stamp and moves the first five elements to the end of the line.

We’ll write a tutorial on loops and conditional statements soon, but in the meantime, check out this short explanation of you want to know more about loops.

We could copy down the time stamp and name right now, but since there’s still a few issues with the name columns, we’ll sort out those first. Before we do that, though, we’ll just quickly delete any entirely empty rows that might have snuck in. We’ll use the apply() function for that. It’s basically like a loop, just much faster and easier to handle in most cases. The R package swirl contains in-console tutorials and has a great one for the apply() function family as well.

Cleaning the surname column

Now, some contacts might be saved by first name, some by first and last name, right? So the column containing the surnames also sometimes contains a bit of text. The difference is, the text bit probably won’t end with a colon, the surnames definitely will. We can use regular expressions to filter the surname column accordingly.
Also, some messages aren’t actually chat content, but rather activity notifications like adding new members to a group. They’ll say something like “Marie Timcke added you”. Good thing is: Those messages don’t contain colons either, so we can use the same regular expression for the surnames and the notifications.

The regex we’ll use is “.+:$”. It matches any pattern with one or more characters (“.” for any character, “+” for “one or more”) followed by a literal colon (“:”) and then the end of the line (“$”).

The first part reduces the chat data frame to all columns that either have a colon in columns 5 (“grepl(“.+:$”, chat[,5])”) or (“|”) in column 4 (“grepl(“.+:$”, chat[,4])”). Of course, the stampless rows we just created are left in as well (“is.na(chat[,1])”). This effectively removes the notifications.

In the second chunk, we move the text parts in the surname column to the end of the line, the same way we shifted the rows without time stamp. By now, our file looks something like this (check yours with View(chat)):

chat2

The bigger part of our work is done. We’ll just format the time stamp in such a way that R can evaluate it and make a few cosmetic adjustments.

Converting the time stamp to date format

For R to convert the first two columns into a format it can work with, we’ll have to help it a bit. First, we’ll copy down time stamps and name from the previous row to all the rows we shifted before.

Then, we’ll clean the first few columns a bit, deleting the column with the dash, merging the time stamp so its all in one place and naming the first few columns.

Now, we can easily convert the first column into a date format. R has a few different classes for date and time objects. We’ll use the strptime() function, which produces an object of class “Posixlt”. If you want to know more about dates and times in R, again, the swirl lesson on that topic is great.

We need to tell strptime in which format the date is stored. In our case, it’s
“<day>.<month>.<full year>, <hours>:<minutes>”. In strptime() language, this is written as “%d.%m.%Y, %H:%M”.

One last cosmetic edit: The names still have colons at the end. This issue is easily solved with — you guessed it — regular expressions! We can use the gsub() function to search and replace patterns. We’ll use it on the “name” and “surname” column by replacing every colon at the end of a line with nothing, like this:

Congratulations, you’ve cleaned up the entire dataset! It should now look like this — no empty lines, no colons or text in the name columns and a wonderfully formatted time stamp.

chat3

Saving the data

Now, the only thing left to do is to save our beautiful, sparkly clean dataset to a new file. If you want to work with the file in another programm except R, you can use, for example, the write.table() or write.csv() function to export your data frame. Since we want to continue working in R for our visualizations, we’ll go with save() for now. It will create an .Rdata file that can be read back into R easily with the load() function.

There you go, all done! Give yourself a big pat on the back, because cleaning data is hard.

If you want to continue right away, check out part two of our WhatsApp project where we visualize the data we just cleaned. If you need help with the cleaning script or have suggestions on how to improve it, write us an e-mail or join our slack team. Our help and discussion channels are open for everyone!

 

Part 2 | Code

 

{Credits for the awesome featured image go to Phil Ninh}

Your first interactive choropleth map with R

Your first interactive choropleth map with R

When it comes to data journalism, visualizing your data isn’t what it’s all about. Getting and cleaning your data, analyzing and verifying your findings is way more important.

Still, an interactive eye-catcher holding interesting information will definitely not hurt your data story. Plus, you can use graphics for a visual analysis, too.

Here, we’ll show you how to build a choropleth map, where your data is visualized as colored polygon areas like countries and states.
We will code a multilayer map on Dortmunds students as an example. You’ll be able to switch between layered data from different years. The popups hold additional information on Dortmunds districts.

Now for the data

First of all you need to read a kml-file into R. KML stands for Keyhole Markup Language and as I just learned from the comment section of this tutorial it is a XML data format used to display geospatial information in a web browser. With a bit of googling, you’ll find kml-files holding geospatial informations of your own city, state or country. For this example, we’ll use this data on Dortmunds districts. Right click the link and save the file. Download the kml-file and save it to a new directory named “journocode” (or anything you want, really, but we’ll work with this for now).

Start RStudio. If you haven’t installed it yet, have a look at our first R Tutorial post. After starting RStudio, open a new R script and save it to the right directory. For example, if your “journocode”-directory was placed on your desktop (and your Username was MarieLou), type

Remember to use a normal slash (/) in your file path instead of a backslash. Now, we can read the shape file directly into R. If you don’t use our example data, try open your kml-file with a text editor first to look for the layer name! As you can see on this screenshot, for “Statistische Bezirke.kml” we have a layer named “Statistische_Bezirke”, defined in row four, and utf-8 encoding (see row 1), since we have the german umlauts “ä”, “ö” and “ü” in our file.

Bildschirmfoto 2016-01-22 um 12.31.12

Let’s load the data into R. We’ll do this with a function from the rgdal-package.

If you get an Error that says “Cannot open data source”, chances are there’s something wrong with your file name. Check that your working directory is properly set and that the file name is correct. Some browsers will change the .kml fily type to .txt, or even just add the .txt ending so you get “filename.kml.txt”. You’ll usually find the “layer” argument in your text file, named something like “name” or “id”, as shown above.

Did it work? Try to plot the polygons with the generic plot() function:

You should now see the black outlines of your polygons. Neat, isn’t it?

Next, we’ll need a little data to add to our map. To show you how to build a multilayer map, we will use two different csv files:   student1 & student2

The data contains information on the percentage of 18 to 25 year olds living in Dortmund in 2000 and 2014. Download the files and save them to your journocode directory. Make sure they’re still named student1 and student2.

This can be tricky sometimes: For our data, the encoding is “latin1” and the separation marks are commas. Open the csv files with a text editor to check if your separator is a comma, a semicolon or even a slash.

If everything worked out for you, celebrate a little! You’re a big step closer to your multilayer map!

 

Now for the interactive part

After looking through your data and analyzing it, you will now have some important information on how many values you have, which are the smallest and the biggest. For our example, we did that for you:

The highest value is 26%, so we can now think of a color scale from 0 to 26 to fill in our map. There are different statistical ways to decide what classes we want to divide our data into. For this mapping exercise, we will simply take eight classes: 0-5, 5-8, 8-10, 10-12, 12-14, 14-18, 18-24 and 24-26.

For every class, we want our map to fill the polygons in a different color. We’ll use a color vector generated with ColorBrewer here. Just copy the colour code you want, put it in a vector and replace it in the code. To paste the colors to the classes, use the function colorBin(). This is were you’ll need the package leaflet, which we will use to build our map. Install it, if you haven’t already.

Next up is the little infowindow we want to pop up when we click on the map. As you can see, I used some html code to specify some parameters for the first popup. For the second popup, I used a simpler way.

paste0() does the same thing as paste() but with no default separator. Check ?paste0 for more info. If something doesn’t work, check the punctuation!

 

Now for the map

After that, we’ll start right away with puzzling together all the parts we need:

The %>% operator is special to the leaflet package. Similar to the “+” in ggplot, it’s used to link two functions together. So remember: If you have a “%>%” opearator at the end of the line, R will expect more input from you.

The call to the function leaflet() starts the mapping procedd. The Provider Tile is your map base and background. If you don’t want to use the grey Tile in the example, have a look at this page and choose your own. Don’t worry if no map appears yet. With leaflet, you won’t see the actual map right away. First we’ll add the polygon layers and the popups we’ve defined to our map:

In our map, we want to be able to switch layers by clicking on a layer control panel with the group names. We’ll code that now:

Next, we want to add a thin color legend that shows the minimum and maximum value and the palette colors

The big moment: did it work? No mistake with the brackets or the punctuation? You’ll find out by typing:

Congratulations! You made your first multilayer choropleth with R! Now have fun building multilayer maps of your own city/country or even the whole world! If you want to publish your map, make sure you have the “htmlwidgets” package installed and add the following code to your script:

This will create a directory named “mymap_files” and a “mymap.html”-file. Save these two files in the same directory and load that on to your server. Et voilà: Your map is online!

If you publish a map based on our tutorial, feel free to link to our webpage and tell your fellows! We’d be delighted!

 

{Credits for the awesome featured image go to Phil Ninh}

R crash course: Basic data structures

R crash course: Basic data structures

 

„To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.“John M. Chambers

Data structures in R are quite different from most programming languages. Understanding them is a necessity, because they define the way you’ll work with your data. Problems in understanding data structures will probably also produce problems in your code.

R crash course: Writing functions

R crash course: Writing functions

As you know by now, R is all about functions. In the event that there isn’t one for the exact thing you want to do, you can even write your own! Writing your own functions is a very useful way to automate your work. Once defined, it’s easy to call new functions as often as you need. It’s a good habit to get into when programming with R — and with lots of other languages as well.

Defining a function uses another function simply called function(). Function names follow pretty much the same rules as variable names, so you can call them anything that would also be acceptable as a variable name.

Let’s try an easy example to see how function definitions work:

A function of questionable usefulness: It essentially does the same thing as print(). It takes an argument called x, and prints whatever you put as x to the console.

Theoretically, you can make your function take as many arguments as you want. Just write them in the parentheses of function(). You can call the arguments however you want, too. Also, your functions will probably often require more than one line. In that case, just put whatever you want your function to do in curly brackets {}. It will look somewhat like this:

Let’s mess with that one a bit! Run the following code line by line and try to guess what went wrong.

Possible errors while writing functions

Errors aren’t just a necessary evil in coding. By making mistakes, you get to know your programming language better and find out what works — and, of course, what doesn’t work. Let’s go through the errors one by one:

  • squareadd(3): You passed the function only one argument (3, which was attributed to the “x” argument) to work with when it expected two values, one for x and one for y.
  • squareadd(3,”two”): Now you passed the function two arguments, but one’s not a number. It’s a character, since it has quotes around it. But R can’t execute the function with a character. After all, what is 3^2 + “two” supposed to mean?
  • squareadd(3,two): No quotes this time in the second argument. Because the “y” argument is not in quotes and not a number, either, R assumes it’s a variable or some other object. Problem is: R can’t find the object called two anywhere
  • After you define the object two to be equal to 2, though, R does find a matching object to put as an argument. So this time around, squareadd(3,two) should return the number 11

After we change the function definition to include only the “x” argument, the errors we get change a little. Note that we there’s still a “y” in the function body.

  • squareadd2(3,2): Other way around this time. Your function expected only one argument, but got two.
  • squareadd2(3): You passed the correct number of arguments, but R can’t find anything to use for the y in the function body, neither inside the function nor in the global environment.
  • This is why, after you defined y to be equal to four in the global environment, squareadd2(3) works fine and will return 13 (since 3^2 + 4 = 13).

Scoping Rules in R

Some of the errors you’ll get, such as those in the last two lines, are due to something called the scoping rules of R. These rules define how R looks for the variables it needs to execute a function. It does that by looking through different environments — sub-spaces of your working environment that have their own variables and object definitions — in a certain order. There’s two basic types of scoping:

  • Lexical scoping: Looking for missing objects in the environment where the function was defined.
  • Dynamic scoping: Looking for missing objects in the environment where the function was called.

R uses lexical scoping. So if it doesn’t find the stuff it needs within the function (which, incidentally, has its own little environment), it goes on to look in the environment where the function was defined. In many cases, this will be the global environment, which is what you’re coding in if you’re not inside a specific function. If it doesn’t find what it needs there either, it will continue down the search list of environments. You can take a look at the list by typing search() into your console.

Let’s take a quick look at the difference between dynamic and lexical scoping. Look at the following code and try to guess its output. Execute it in RStudio and see if you’re right.

The output depends on the scoping rules your programming languages uses. As you just learned, R uses lexical scoping. So if you call check(), a is set to FALSE only on the function environment of check(). But since istrue() was defined in the global environment, where a is still equal to TRUE, it will print “that’s right!” to your console. If R used dynamic scoping, it would go with a <- FALSE, since that is accurate for the environment where istrue() was called.

You don’t have to worry too much about the specifics of scoping rules and environments when starting to code, but it’s a useful thing to keep in mind. There’s lots of good info on scoping, searching and environments in R on the web, as well as more tutorials on writing your own functions. We’ll be putting together some resources on our website soon, so stay tuned for that.

But for now — well done! That was a lot of new info to process. print() yourself a “Good job!” to the console before you go on and practice writing some more functions. We’re looking forward to your coding experiences!

Bonus round: Can you count how often the word “function” appears in this text? Guess right and win a complimentary function congratulating you on your newly acquired coding skills.

 

{Credits for the awesome featured image go to Phil Ninh}

R exercise: Analysing data

R exercise: Analysing data

While using R for your everyday calculations is so much more fun than using your smartphone, that’s not the (only) reason we’re here. So let’s move on to the real thing: How to make data tell us a story.

First you’ll need some data. You haven’t learned how to get and clean data, yet. We’ll get to that later. For now you can practice on this data set. The data journalists at Berliner Morgenpost used it to take a closer look at refugees in Germany and kindly put the clean data set online. You can also play around with your own set of data. Feel free to look for something entertaining on the internet – or in hidden corners of your hard drive. Remember to save your data in your working directory to save yourself some unneccessary typing.

Read your data set into R with read.csv(). For this you need a .csv file. Excel sheets can easily be saved as such.

Now you have a data frame. Name it anything you want. We’ll go with data. Check out class(data). It tells you what kind of object you have before you. In this case, it should return data frame.

Time to play!

Remember, if you just type data and run that command, it will print the whole table to the console. That might be not exactly what you want if your dataset is very big. Instead, you can use the handy functions below to get an overview of your data.

Try them and play around a little bit. Found anything interesting yet? Anything odd? In the data set we suggested, you’ll notice that the mean and the median are very different in the column “Asylantraege” (applications for asylum). What does that tell you?

Row and column indices

This is how you can take a closer look at a part of the whole set using indices. Indices are the numbers or names by which R identifies the rows or columns of data.

The last two alternatives only work if your columns have names. Use the function names() to look them up or change them.

Here are some more useful functions that will give you more information about the columns you’re interested in. Try them!

Subsets and Logic

Now you can take and even more detailed look by forming subsets, parts of your data that meet certain criteria. You’ll need the following logical operators.

Try to form different subsets of your data to find out interesting stuff. Check if it worked with View()head()tail(), etc.

Try to kick out all the rows that have “0” in the column “Asylantraege” (applications for asylum). Look at it again. What happened to mean and median?

Get the answers you want

With everything you learned so far, you can start to get answers. See what questions about your data can be answered by forming data subsets. For example, if you used the data set we suggested: Where do most people seeking refuge in Germany come from?

We made a list of the ten most common countries of origin.

Unbenannt

Ask your own questions. What do you want your data to tell you?

 

{Credits for the awesome featured image go to Phil Ninh}

R crash course: Workspace, packages and data import

R crash course: Workspace, packages and data import

In this crash course section, we’ll talk about importing all sorts of data into R and installing fancy new packages. Also, we’ll learn to know our way around the workspace.

Your workspace in R is like the desk you work at. It’s where all the data, defined variables and other objects you’re currently working with are stored. Like with a desk, you might want to clean it every once in a while and throw out stuff you don’t need any more. There’s a few useful commands to help you do that. Take a look and try them out: