Data Structures

Adapted from Software Carpentry

Overview

Questions:

  • How can I read data in R?
  • What are the basic data types in R?
  • How do I represent categorical information in R?

Objectives:

  • To be able to identify the 5 main data types
  • To begin exploring data frames, and understand how they are related to vectors and lists
  • To be able to ask questions from R about the type, class, and structure of an object
  • To understand the information of the attributes “names”, “class”, and “dim”

Creating and Reading Data

One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file. Let’s start by making a toy dataset in your data/ directory, called feline-data.csv:

We can now save cats as a CSV file. It is good practice to call the argument names explicitly:

We can load this into R via the following:

Editing Text Files in R

Alternatively, you can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File → New File → Text File menu item.

We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:

We can do other operations on the columns:

But what about:

This will return an error because 2.1 plus "black" is nonsense!

Data Types

Understanding data types is key to successfully analyzing data in R. We can ask what type of data something is:

There are 5 main types: double, integer, complex, logical and character. For historic reasons, double is also called numeric.

No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.

Type Coercion in Data Frames

A user has provided details of another cat. We can add an additional row to our cats table:

Let’s check what type of data we find in the weight column:

Oh no! A given column in a data frame cannot be composed of different data types. When R can’t store everything as numbers (because of “2.3 or 2.4”), the entire column changes to character type.

Vectors and Type Coercion

To better understand this behavior, let’s meet another of the data structures: the vector.

A vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type.

You can check if something is a vector:

Combining Vectors

You can make vectors with explicit contents using the combine function:

What do you think the following will produce?

This is called type coercion - when R encounters a mix of types to be combined, it forces them all to be the same type:

The Type Hierarchy

The coercion rules go: logicalintegerdoublecomplexcharacter

You can force coercion using the as. functions:

Challenge 1: Cleaning the Cat Data

An important part of every data analysis is cleaning the input data. Clean the cat data set:

  1. Print cats2 to the console
  2. Use str(cats2) to see the overview of data types
  3. The “weight” column has the incorrect data type. What is the correct data type?
  4. Correct the 4th weight data point with the mean: cats2$weight[4] <- 2.35
  5. Convert the weight to the right data type using as.numeric() or as.double()
  6. Calculate mean(cats2$weight) to test yourself

Solution to Challenge 1

cats2  # Print the data
str(cats2)  # Shows weight is "character"
# The correct data type is "double" or "numeric"
cats2$weight[4] <- 2.35
cats2$weight <- as.numeric(cats2$weight)
mean(cats2$weight)  # Should work now!

Vector Functions

The combine function will also append things to an existing vector:

You can make series of numbers:

We can ask questions about vectors:

We can get individual elements using bracket notation:

Challenge 2: Vector Arithmetic

Start by making a vector with the numbers 1 through 26. Then, multiply the vector by 2.

Solution to Challenge 2

x <- 1:26
x * 2

Lists

Another data structure you’ll want is the list. A list can have different data types:

When printing the structure:

To retrieve an element of a list, use double brackets:

List elements can have names:

Accessing Vectors by Name

Named vectors are generated similarly to named lists:

Note: The $ operator doesn’t work for vectors, only for lists and data frames.

If you’re only interested in the names:

You can access and change names:

Challenge 3: Data Type of Names

What is the data type of the names of pizza_price? Use str() or typeof() to find out.

Solution to Challenge 3

typeof(names(pizza_price))  # "character"

Challenge 4: Letters and Numbers

Create a vector that gives the number for each letter in the alphabet:

  1. Generate a vector called letter_no with numbers from 1 to 26
  2. R has a built-in object called LETTERS (A to Z). Set the names to these letters
  3. Test by calling letter_no["B"], which should give you 2

Solution to Challenge 4

letter_no <- 1:26
names(letter_no) <- LETTERS
letter_no["B"]  # B: 2

Data Frames

We can now understand something surprising about data frames:

Data frames are really lists of vectors! It is a special list in which all vectors must have the same length.

The class is an attribute that tells us what this object means for humans. typeof() tells us how the object is constructed, while class() tells us its purpose.

Each column is a vector:

Each row is a data frame:

Challenge 5: Subsetting Data Frames

There are several ways to call variables from data frames. Try these and explain what each returns:

  • cats[1]
  • cats[[1]]
  • cats$coat
  • cats["coat"]
  • cats[1, 1]
  • cats[, 1]
  • cats[1, ]

Hint: Use typeof() to examine what is returned.

Solution to Challenge 5

cats[1]        # data frame with 1 column
cats[[1]]      # vector (the column itself)
cats$coat      # vector (same as above)
cats["coat"]   # data frame with 1 column
cats[1, 1]     # single value (character)
cats[, 1]      # vector (column)
cats[1, ]      # data frame (row)

Renaming Data Frame Columns

Data frames have column names, which can be accessed with names():

names(cats)

# Rename the second column
names(cats)[2] <- "weight_kg"
cats

Matrices

Last but not least is the matrix. We can declare a matrix full of zeros:

What makes it special is the dim() attribute:

Challenge 6: Matrix Length

What do you think will be the result of length(matrix_example)? Try it. Were you right? Why / why not?

Solution to Challenge 6

length(matrix_example)  # 18

Because a matrix is a vector with added dimension information, length gives you the total number of elements (3 × 6 = 18).

Challenge 7: Creating Matrices

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (Hint: read the documentation for matrix!)

Solution to Challenge 7

x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE)  # filled by row

Challenge 8: Lists of Data Structures

Create a list of length two containing a character vector for each of the sections in this part of the workshop:

  • Data types
  • Data structures

Populate each character vector with the names of the data types and data structures we’ve seen so far.

Solution to Challenge 8

dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
answer <- list(dataTypes, dataStructures)

Key Points

  • Use read.csv to read tabular data in R
  • The basic data types in R are double, integer, complex, logical, and character
  • Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes
  • Vectors must contain elements of the same type; type coercion happens when mixing types
  • Data frames are lists of vectors with equal length
  • Use typeof(), class(), and str() to understand your data structures