R exercise: Analysing data

While using R for your everyday calculations is so much more fun than using your smartphone, that’s not the (only) reason we’re here. So let’s move on to the real thing: How to make data tell us a story.
First you’ll need some data. You haven’t learned how to get and clean data, yet. We’ll get to that later. For now you can practice on this data set. The data journalists at Berliner Morgenpost used it to take a closer look at refugees in Germany and kindly put the clean data set online. You can also play around with your own set of data. Feel free to look for something entertaining on the internet – or in hidden corners of your hard drive. Remember to save your data in your working directory to save yourself some unneccessary typing.
Read your data set into R with read.csv(). For this you need a .csv file. Excel sheets can easily be saved as such.
1 | data <- read.csv("filename.csv", header = TRUE, sep = ",") |
Now you have a data frame. Name it anything you want. We’ll go with data. Check out class(data). It tells you what kind of object you have before you. In this case, it should return data frame.
Time to play!
Remember, if you just type data and run that command, it will print the whole table to the console. That might be not exactly what you want if your dataset is very big. Instead, you can use the handy functions below to get an overview of your data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | View(data) # Beware the capital letter! It opens a new tab showing your data set. dim(data) # Tells you the dimension of the object: First the number of rows, than columns. names(data) # Gives you the names of the columns as a vector. str(data) # Short for "structure". A very powerful function. # Shows which class of data is in each column. head(data) # Shows the first 6 rows. tail(data) # Shows the last 6 rows. head(data, n = 3) # Shows the first 3 rows. tail(data, n = 5) # Shows the last 5 rows. summary(data) # Shows basic statistically interesting values for each column: # minimum, maximum, median, mean, 1st and 3rd quarter. |
Try them and play around a little bit. Found anything interesting yet? Anything odd? In the data set we suggested, you’ll notice that the mean and the median are very different in the column “Asylantraege” (applications for asylum). What does that tell you?
Row and column indices
This is how you can take a closer look at a part of the whole set using indices. Indices are the numbers or names by which R identifies the rows or columns of data.
1 2 3 4 5 6 7 | data[5,2] #The element in the fifth row and the second column data[3,] # The third row and all columns, so the entire third row. data[,2] # For the second column (of all rows). Also try: data[,"Land"] # Land is the name of the second column. Or: data$Land #The $ operator is a useful way to save on typing square brackets |
The last two alternatives only work if your columns have names. Use the function names() to look them up or change them.
Here are some more useful functions that will give you more information about the columns you’re interested in. Try them!
1 2 3 4 5 6 7 8 9 10 11 12 13 | levels(data$Land) # Lists with unique values in that column #levels() only works for factor variables length(data$Land) # Number of elements in a vector #(or a row/column of a data frame, same thing) mean(data$Asylantraege) median(data$Asylantraege) min(data$Asylantraege) max(data$Asylantraege) |
Subsets and Logic
Now you can take and even more detailed look by forming subsets, parts of your data that meet certain criteria. You’ll need the following logical operators.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | == # equals != # does not equal <= # smaller or equal >= # bigger or equal < # smaller > # bigger # You can use them in statements like this: zero <- data[data$Asylantraege == 0,] # Forms a new data frame, that only contains those rows # which have a zero in the column 'Aslyantraege'. & # and | # or ! # without sort() # sorts a vector order() # permutes a whole data set |
Try to form different subsets of your data to find out interesting stuff. Check if it worked with View(), head(), tail(), etc.
Try to kick out all the rows that have “0” in the column “Asylantraege” (applications for asylum). Look at it again. What happened to mean and median?
Get the answers you want
With everything you learned so far, you can start to get answers. See what questions about your data can be answered by forming data subsets. For example, if you used the data set we suggested: Where do most people seeking refuge in Germany come from?
We made a list of the ten most common countries of origin.
1 2 3 4 5 6 7 8 9 10 | ranked <- data[order(-data$Asylantraege),] #we know, it's weird that order() is inside the index #this tells R to order data by applications for asylum rang <- (1:10) country <- (ranked$Land[1:10]) applications <- (ranked$Asylantraege[1:10]) data2 <- data.frame(rang, country, applications) #make a new dataframe with these 3 columns View(data2) |
Ask your own questions. What do you want your data to tell you?
{Credits for the awesome featured image go to Phil Ninh}
Comments ( 4 )
English:
Looks like there might be a problem with the vector data$Asylantraege. Maybe something went wrong while reading in the data? min(numeric(0)) returns the same warning; so R thinks your vector does not contain any numeric elements. Try checking head(data$Asylantraege) to see if the numbers look right, or str(data$Asylantraege) to look at the format of the vector (should be numeric). If something doesn't look right there, try reading the data in again and double check the arguments to read.csv(). A more detailed look at that function can be found in our data import tutorial.