cats <- data.frame(coat = c("calico", "black", "tabby"),
weight = c(2.1, 5.0, 3.2),
likes_catnip = c(1, 0, 1))Data Structures
Adapted from Software Carpentry
Overview
Questions:
- How can I read data in R?
- What are the basic data types in R?
- How do I represent categorical information in R?
Objectives:
- To be able to identify the 5 main data types
- To begin exploring data frames, and understand how they are related to vectors and lists
- To be able to ask questions from R about the type, class, and structure of an object
- To understand the information of the attributes “names”, “class”, and “dim”
Creating and Reading Data
One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file. Let’s start by making a toy dataset in your data/ directory, called feline-data.csv:
We can now save cats as a CSV file. It is good practice to call the argument names explicitly:
write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)We can load this into R via the following:
cats <- read.csv(file = "data/feline-data.csv")
catsAlternatively, you can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File → New File → Text File menu item.
We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:
cats$weight
cats$coatWe can do other operations on the columns:
# Say we discovered that the scale weighs two Kg light:
cats$weight + 2
paste("My cat is", cats$coat)But what about:
cats$weight + cats$coatThis will return an error because 2.1 plus "black" is nonsense!
Data Types
Understanding data types is key to successfully analyzing data in R. We can ask what type of data something is:
typeof(cats$weight)There are 5 main types: double, integer, complex, logical and character. For historic reasons, double is also called numeric.
typeof(3.14) # "double"
typeof(1L) # "integer" (L suffix forces integer)
typeof(1+1i) # "complex"
typeof(TRUE) # "logical"
typeof('banana') # "character"No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.
Type Coercion in Data Frames
A user has provided details of another cat. We can add an additional row to our cats table:
additional_cat <- data.frame(coat = "tabby",
weight = "2.3 or 2.4",
likes_catnip = 1)
cats2 <- rbind(cats, additional_cat)Let’s check what type of data we find in the weight column:
typeof(cats2$weight) # "character"Oh no! A given column in a data frame cannot be composed of different data types. When R can’t store everything as numbers (because of “2.3 or 2.4”), the entire column changes to character type.
Vectors and Type Coercion
To better understand this behavior, let’s meet another of the data structures: the vector.
my_vector <- vector(length = 3)
my_vectorA vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type.
another_vector <- vector(mode='character', length=3)
another_vectorYou can check if something is a vector:
str(another_vector)
str(cats$weight) # Columns are vectors too!Combining Vectors
You can make vectors with explicit contents using the combine function:
combine_vector <- c(2, 6, 3)
combine_vectorWhat do you think the following will produce?
quiz_vector <- c(2, 6, '3')This is called type coercion - when R encounters a mix of types to be combined, it forces them all to be the same type:
coercion_vector <- c('a', TRUE)
coercion_vector # "a" "TRUE"
another_coercion_vector <- c(0, TRUE)
another_coercion_vector # 0 1The Type Hierarchy
The coercion rules go: logical → integer → double → complex → character
You can force coercion using the as. functions:
character_vector_example <- c('0','2','4')
character_coerced_to_double <- as.double(character_vector_example)
character_coerced_to_double # 0 2 4
double_coerced_to_logical <- as.logical(character_coerced_to_double)
double_coerced_to_logical # FALSE TRUE TRUEVector Functions
The combine function will also append things to an existing vector:
ab_vector <- c('a', 'b')
combine_example <- c(ab_vector, 'SWC')
combine_example # "a" "b" "SWC"You can make series of numbers:
mySeries <- 1:10
seq(10)
seq(1, 10, by=0.1)We can ask questions about vectors:
sequence_example <- 20:25
head(sequence_example, n=2) # 20 21
tail(sequence_example, n=4) # 22 23 24 25
length(sequence_example) # 6
typeof(sequence_example) # "integer"We can get individual elements using bracket notation:
first_element <- sequence_example[1]
first_element # 20
# Change a single element
sequence_example[1] <- 30
sequence_example # 30 21 22 23 24 25Lists
Another data structure you’ll want is the list. A list can have different data types:
list_example <- list(1, "a", TRUE, 1+4i)
list_exampleWhen printing the structure:
str(list_example)To retrieve an element of a list, use double brackets:
list_example[[2]] # "a"List elements can have names:
another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE)
another_list
# Access by name
another_list$title # "Numbers"Names
Accessing Vectors by Name
Named vectors are generated similarly to named lists:
pizza_price <- c(pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50)
pizza_price["pizzasubito"] # pizzasubito 5.64Note: The $ operator doesn’t work for vectors, only for lists and data frames.
If you’re only interested in the names:
names(pizza_price)You can access and change names:
names(pizza_price)[3] # "callapizza"
names(pizza_price)[3] <- "call-a-pizza"
pizza_priceData Frames
We can now understand something surprising about data frames:
typeof(cats) # "list"Data frames are really lists of vectors! It is a special list in which all vectors must have the same length.
class(cats) # "data.frame"The class is an attribute that tells us what this object means for humans. typeof() tells us how the object is constructed, while class() tells us its purpose.
Each column is a vector:
cats$coat
cats[,1]
typeof(cats[,1]) # "character"Each row is a data frame:
cats[1,]
typeof(cats[1,]) # "list"Data frames have column names, which can be accessed with names():
names(cats)
# Rename the second column
names(cats)[2] <- "weight_kg"
catsMatrices
Last but not least is the matrix. We can declare a matrix full of zeros:
matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_exampleWhat makes it special is the dim() attribute:
dim(matrix_example) # 3 6
typeof(matrix_example) # "double"
class(matrix_example) # "matrix" "array"
nrow(matrix_example) # 3
ncol(matrix_example) # 6Key Points
- Use
read.csvto read tabular data in R - The basic data types in R are double, integer, complex, logical, and character
- Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes
- Vectors must contain elements of the same type; type coercion happens when mixing types
- Data frames are lists of vectors with equal length
- Use
typeof(),class(), andstr()to understand your data structures