Adapted from Software Carpentry
Questions:
Objectives:
One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file. Let’s start by making a toy dataset in your data/ directory, called feline-data.csv:
We can now save cats as a CSV file. It is good practice to call the argument names explicitly:
We can load this into R via the following:
Editing Text Files in R
Alternatively, you can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File → New File → Text File menu item.
We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:
We can do other operations on the columns:
But what about:
This will return an error because 2.1 plus "black" is nonsense!
Understanding data types is key to successfully analyzing data in R. We can ask what type of data something is:
There are 5 main types: double, integer, complex, logical and character. For historic reasons, double is also called numeric.
No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.
A user has provided details of another cat. We can add an additional row to our cats table:
Let’s check what type of data we find in the weight column:
Oh no! A given column in a data frame cannot be composed of different data types. When R can’t store everything as numbers (because of “2.3 or 2.4”), the entire column changes to character type.
To better understand this behavior, let’s meet another of the data structures: the vector.
A vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type.
You can check if something is a vector:
You can make vectors with explicit contents using the combine function:
What do you think the following will produce?
This is called type coercion - when R encounters a mix of types to be combined, it forces them all to be the same type:
The coercion rules go: logical → integer → double → complex → character
You can force coercion using the as. functions:
Challenge 1: Cleaning the Cat Data
An important part of every data analysis is cleaning the input data. Clean the cat data set:
cats2 to the consolestr(cats2) to see the overview of data typescats2$weight[4] <- 2.35as.numeric() or as.double()mean(cats2$weight) to test yourselfThe combine function will also append things to an existing vector:
You can make series of numbers:
We can ask questions about vectors:
We can get individual elements using bracket notation:
Another data structure you’ll want is the list. A list can have different data types:
When printing the structure:
To retrieve an element of a list, use double brackets:
List elements can have names:
Named vectors are generated similarly to named lists:
Note: The $ operator doesn’t work for vectors, only for lists and data frames.
If you’re only interested in the names:
You can access and change names:
Challenge 4: Letters and Numbers
Create a vector that gives the number for each letter in the alphabet:
letter_no with numbers from 1 to 26LETTERS (A to Z). Set the names to these lettersletter_no["B"], which should give you 2We can now understand something surprising about data frames:
Data frames are really lists of vectors! It is a special list in which all vectors must have the same length.
The class is an attribute that tells us what this object means for humans. typeof() tells us how the object is constructed, while class() tells us its purpose.
Each column is a vector:
Each row is a data frame:
Challenge 5: Subsetting Data Frames
There are several ways to call variables from data frames. Try these and explain what each returns:
cats[1]cats[[1]]cats$coatcats["coat"]cats[1, 1]cats[, 1]cats[1, ]Hint: Use typeof() to examine what is returned.
Last but not least is the matrix. We can declare a matrix full of zeros:
What makes it special is the dim() attribute:
Challenge 6: Matrix Length
Challenge 7: Creating Matrices
Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (Hint: read the documentation for matrix!)
Challenge 8: Lists of Data Structures
Create a list of length two containing a character vector for each of the sections in this part of the workshop:
Populate each character vector with the names of the data types and data structures we’ve seen so far.
read.csv to read tabular data in Rtypeof(), class(), and str() to understand your data structures