# R crash course: Basic data structures

Data structures in R are quite different from most programming languages. Understanding them is a necessity, because they define the way you’ll work with your data. Problems in understanding data structures will probably also produce problems in your code.

The five main data structures are:

- Atomic vectors
- Matrices
- Arrays
- Lists
- Data frames

Data structures have different properties. “Homogenous” structures only contain one type of data. Strings (like “hello”,”good”,”morning”) or numeric values (22, 23, 24) are an example for homogenous structures. Heterogenous means the data structure can contain data of different types, as is the case with lists.

## Atomic vectors

Vectors are the foundation of data structures. We’ve discussed how to work with them before, but we’ll cover a few structural specifics right here. Remember: In R, there are no scalar or 0-dimensional data. A number is actually a 1-dimensional vector with length one.

Homogenous vector are called atomic vectors. They are mostly of the type *logical, integer, double* or *character* and can be created with the infamous *c()* function which **c**ombines elements. Look up more info on it with *help(‘c’)*.

To access a value of a vector, you can type *vector_name[index number]*. In many other programming languages the index begins with 0, but in R the vectors index starts with *1*, so be careful.

1 2 3 |
vector <- c('1st','2nd','3rd') vector[1] #[1] "1st" |

If you want to combine two atomic vectors, the resulting vector needs to contain data of one and the same type. R automatically tries to convert the types for you, so the resulting vector will have the most flexible type. There is a defined hierarchy of flexibility in R:

*logical<integer<double<character*

Some examples:

1 2 3 4 5 6 7 8 9 10 |
which_type <- c('string', 1) typeof(which_type[2]) #[1] "character" which_type[2] + 2 #Error in which_type[2] + 2 : non-numeric argument to binary operator which_type <- c(TRUE, 1) typeof(which_type[1]) #[1] "double" |

- TRUE will be interpreted as a numerical 1
- FALSE Will be interpreted a numerical 0

*NA *is a legitimate value in all vectors. In atomic vectors, they will be threated as the type of the vector itself.

1 2 3 4 5 6 7 8 |
double_v <- c(1, NA ,3) char_v <- c('1st', NA, '3rd') typeof(double_v[2]) #[1] "double" typeof(char_v[2]) #[1] "character" |

Because a numerical is more flexible then a logical value, the whole vector will be type *double*.

But what if you want R to interpret the 0 and 1 in the vector as logical?

*as.logical(which_type)* will do the job. Telling R explicitly which type a vector should be converted to can avoid confusion.

What will happen if you try to convert a vector whith numbers other than 0 or 1 into a logical vector? Lets try!

1 2 3 4 5 6 7 8 |
v1 <- c(TRUE,-1,0,1,2) v1 <- as.logical(v1) v1[1] # TRUE v1[2] # TRUE v1[3] # FALSE v1[4] # TRUE v1[4] # TRUE |

R is handling every number except 0 as TRUE! This is not trivial, for example it would also make sense if negative numbers including 0 are converted to logical *FALSE *and positive numbers expect 0 into *TRUE*.

There are also examples where R does not know how to convert elements. For example:

1 2 3 4 5 6 7 |
mixed <- c('a string', 10) typeof(mixed) # [1] "character" as.double(mixed) # [1] NA 10 # Warning message: NAs introduced by coercion |

In this example, there is no logical way to convert characters to a number. So, R is replacing the character “a string ” with a *NA *since *NA *does not interfere with the type of the resulting vector.

1 2 |
1=='1' # TRUE. |

Why is R evaluating *1==’1′ *to TRUE? Remember, everything that happens is a function call, so 1==’1′ is essentially a call to the function *‘==’* with the arguments *(1, ‘1’).* R tries to convert 1 and ‘1’ to character, because characters are more flexible than double numbers, so R will actually calculate ‘1’==’1′ which evaluates to TRUE.

## Matrices

Everything you’ve just learned about vectors applies to matrices as well. A matrix in R is just a vector cut up into some pieces of the same length. So they’re homogenous as well, but they’re 2-dimensional. You create them by passing a vector to the *matrix()* function and specifying how many forws and columns it should have. Technically, you only have to specify either the rows or the columns if R can compute the other component through the number of elements the matrix should contain. For example, the following code will create a matrix with elements from 1 till 6 , with 3 columns and 2 rows. Notice how the elements have to be the same amount as the dimensions multiplied: 6 = 3*2

1 2 3 4 5 |
m <- matrix(1:6, ncol=3. nrow=2) m[2,3] # Element at 2nd row,3rd column. m[2,] # Vector of the 2nd row m[,3] # Vector the 3rd column |

## Arrays

Arrays are just one step further: They’re vectors with more than two dimensions. Similarly to a matrix, you create them by passing a vector to the *array()* function and specifying the rows, columns and, in the case of a 3-dimensional array, the layers. This code will create an array with elements from 1 till 16. Again, the number of elements is the same as the dimensions multiplied: 16 = 2*4*2

1 2 3 4 5 6 7 8 |
a <- array(1:16, c(2,4,2)) a[1,1,1] # 1 a[2.1.1] # 2 a[1,2,1] # 3 ... a[1,4,2] # 15 a[2,4,2] # 16 |

## Lists

Vectors and matrixes are fine for a lot of purposes. But what if you want to store different types of data? *Lists* are handy for this task. A list can have elements of different types — even lists itself.

1 2 3 4 5 6 |
a_list <- list(list(list(list()))) str(a_list) # List of 1 # $ :List of 1 # ..$ :List of 1 # .. ..$ : list() |

Because of that, lists are called a *recursive* data structure.

1 |
long_list <- list(first = list(1, 2), second = c(3, 4)) |

Here,* long_list* would be a list of two elements, a list and an atomic vector. *c(3,4)* would not be converted to a list though, because lists can contain different types at the same time. You don’t have to name your list elements, but it may come in handy.

You can address the elements of a list by their *indices *in multiple different ways.

1 2 3 4 5 6 |
long_list[[1]] #returns the entire first element of the list long_list[[2]][1] #returns the first element of the second list element #in this case, that's the number 3, the first element of the vector in our list. long_list$second #will return the second list element. only works with named elements. |

The *“$”* operator can address list elements by name and can save you quite a bit of typing.

## Data frames

We have one important structure left: The data frame. A data frame is the data structure you’ll probably use most. It’s how we save tabular data. It works pretty much like a list — because it is one. Data frames in R are just lists where all elements have the same length. You’ve already gotten to know them a bit in our exercise on analyzing data.

1 2 3 4 5 |
data_frame <- data.frame( x = 1:3, y = c("1st", "2nd", "3rd"), stringsAsFactors = TRUE) # stringsAsFactors = TRUE is the default setting |

This code will create a data frame with two columns: *y*, a factor variable, and a numeric vector *x*. Set it to* stringsAsFactors = FALSE* if you don’t want strings to be interpreted as factors. As mentioned, under the hood of a data frame is a list of equal length vectors.

Asking the type of the data frame with *typeof(data_frame) *will tell you its a list. If you want to check if what you work with is really a data frame, use the function *class() *or* is.data.frame().*

With* cbind(data_frame, data.frame(4th= 1:3)),* you can add a new column with the name ‘4th’ and element 1, 2, 3 and

*rbind(data_frame, data.frame(1st=4, 2nd=4, 3rd=4))* is for adding a row to the end of the data frame.

*data_frame$1st* let you get the vector for the column named *‘1st’*, so *data_frame$1st[1] *would directly give you the first element of the column vector named *‘1st’*.

**Attributes**

All objects can have additional attributes. They might give you meta information about your object. One important attribute is the name() attribute. It lets you define names for the elements of your vector. For example,

1 |
v <- c(x='1st', y='2nd, z='3rd') |

will create a vector with the element names x,y, and z. Now we could access the elements not just by index, but also by their names. v[‘z’] would output ‘3rd’

Another special attribute for vectors is the *dim()* attribute. It can change the dimensionality of a vector, transforming it into a matrix if the dimensionality is 2 or into an array if its dimensionality if bigger than 2. Remember how we created a matrix with the matrix() function? You could do the same by adding the *dim()* attribute to a vector:

1 2 |
c <- 1:6 dim(c) <- c(3,2) |

Same with the array we created:

1 2 |
c <- 1:16 dim(c) <-c(2,4,2) |

**Well done!**

This is probably a lot to chew on, but knowing your data structures can save you lots of headaches later. Checking if the data type fits is always a good place to start when dealing with errors in your code. Of course, we can’t cover every last detail in this tutorial. But there’s lots more great resources out there. Go check them out!

*{Credits for the awesome featured image go to Phil Ninh}*

## Comment (

1)