R crash course: Basic data structures

by Sakander Zirai 1 Comment
R crash course: Basic data structures

 

„To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.“John M. Chambers

Data structures in R are quite different from most programming languages. Understanding them is a necessity, because they define the way you’ll work with your data. Problems in understanding data structures will probably also produce problems in your code.

The five main data structures are:

  • Atomic vectors
  • Matrices
  • Arrays
  • Lists
  • Data frames

Data structures have different properties. “Homogenous” structures only contain one type of data. Strings (like “hello”,”good”,”morning”) or numeric values (22, 23, 24) are an example for homogenous structures. Heterogenous means the data structure can contain data of different types, as is the case with lists.

Atomic vectors

Vectors are the foundation of data structures. We’ve discussed how to work with them before, but we’ll cover a few structural specifics right here. Remember: In R, there are no scalar or 0-dimensional data. A number is actually a 1-dimensional vector with length one.
Homogenous vector are called atomic vectors. They are mostly of the type logical, integer, double or character and can be created with the infamous c() function which combines elements. Look up more info on it with help(‘c’).

To access a value of a vector, you can type vector_name[index number]. In many other programming languages the index begins with 0, but in R the vectors index starts with 1, so be careful.

If you want to combine two atomic vectors, the resulting vector needs to contain data of one and the same type. R automatically tries to convert the types for you, so the resulting vector will have the most flexible type. There is a defined hierarchy of flexibility in R:

logical<integer<double<character

Some examples:

  • TRUE will be interpreted as a  numerical 1
  • FALSE Will be interpreted a numerical 0

NA is a legitimate value in all vectors. In atomic vectors, they will be threated as the type of the vector itself.

Because a numerical is more flexible then a logical  value, the whole vector will be type double.

But what if you want R to interpret the 0 and 1 in the vector as logical?
as.logical(which_type) will do the job. Telling R explicitly which type a vector should be converted to can avoid confusion.
What will happen if you try to convert a vector whith numbers other than 0 or 1 into a logical vector? Lets try!

R is handling every number except 0 as TRUE! This is not trivial, for example it would also make sense if negative numbers including 0 are converted to logical FALSE and positive numbers expect 0 into TRUE.

There are also examples where R does not know how to convert elements. For example:

In this example, there is no logical way to convert characters to a number. So, R is replacing the character “a string ” with a NA since NA does not interfere with the type of the resulting vector.

Why is R evaluating 1==’1′  to TRUE? Remember, everything that happens is a function call, so 1==’1′ is essentially a call to the function ‘==’ with the arguments (1, ‘1’). R tries to convert 1 and ‘1’ to character, because characters are more flexible than double numbers, so R will actually calculate ‘1’==’1′ which evaluates to TRUE.

Matrices

Everything you’ve just learned about vectors applies to matrices as well. A matrix in R is just a vector cut up into some pieces of the same length. So they’re homogenous as well, but they’re 2-dimensional. You create them by passing a vector to the matrix() function and specifying how many forws and columns it should have. Technically, you only have to specify either the rows or the columns if R can compute the other component through the number of elements the matrix should contain. For example, the following code will create a matrix with elements from 1 till 6 , with 3 columns and 2 rows. Notice how the elements have to be the same amount as the dimensions multiplied: 6 = 3*2

Arrays

Arrays are just one step further: They’re vectors with more than two dimensions. Similarly to a matrix, you create them by passing a vector to the array() function and specifying the rows, columns and, in the case of a 3-dimensional array, the layers. This code will create an array with elements from 1 till 16. Again, the number of elements is the same as the dimensions multiplied: 16 = 2*4*2

 

Lists

Vectors and matrixes are fine for a lot of purposes. But what if you want to store different types of data? Lists are handy for this task. A list can have elements of different types — even lists itself.

Because of that, lists are called a recursive data structure.

Here, long_list  would be a list of two elements, a list and an atomic vector.  c(3,4) would not be converted to a list though, because lists can contain different types at the same time. You don’t have to name your list elements, but it may come in handy.

You can address the elements of a list by their indices in multiple different ways.

The “$” operator can address list elements by name and can save you quite a bit of typing.

Data frames

We have one important structure left: The data frame. A data frame is the data structure you’ll probably use most. It’s how we save tabular data. It works pretty much like a list — because it is one. Data frames in R are just lists where all elements have the same length. You’ve already gotten to know them a bit in our exercise on analyzing data.

This code will create a data frame with two columns: y, a factor variable, and a numeric vector x. Set it to stringsAsFactors = FALSE if you don’t want strings to be interpreted as factors. As mentioned, under the hood of a data frame is a list of equal length vectors.
Asking the type of the data frame with typeof(data_frame) will tell you its a list. If you want to check if what you work with is really a data frame, use the function class() or is.data.frame().

With cbind(data_frame, data.frame(4th= 1:3)), you can add a new column with the name ‘4th’ and element 1, 2, 3 and
rbind(data_frame, data.frame(1st=4, 2nd=4, 3rd=4)) is for adding a row to the end of the data frame.

data_frame$1st let you get the vector for the column named ‘1st’, so data_frame$1st[1] would directly give you the first element of the column vector named ‘1st’.

Attributes

All objects can have additional attributes. They might give you meta information about your object. One important attribute is the name() attribute. It lets you define names for the elements of your vector. For example,

will create a vector with the element names x,y, and z. Now we could access the elements not just by index, but also by their names. v[‘z’] would output ‘3rd’

Another special attribute for vectors is the dim() attribute. It can change the dimensionality of a vector, transforming it into a matrix if the dimensionality is 2 or into an array if its dimensionality if bigger than 2. Remember how we created a matrix with the matrix() function? You could do the same by adding the dim()  attribute to a vector:

Same with the array we created:

Well done!

This is probably a lot to chew on, but knowing your data structures can save you lots of headaches later. Checking if the data type fits is always a good place to start when dealing with errors in your code. Of course, we can’t cover every last detail in this tutorial. But there’s lots more great resources out there. Go check them out!

 

{Credits for the awesome featured image go to Phil Ninh}

Comment ( 1 )

  1. Journocode-Beitrag: Tidy Data (R) РDatentäter
    […] is now a list of seven local data frames with messy_data[[1]] containing the data for 1990, messy_data[[2]] for […]

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>