Exploring Data Frames in R
Adapted from Software Carpentry
Overview
Today we will learn to:
- Add and remove rows or columns
- Append two data frames
- Explore data frame properties
- Read data from CSV files
What is a Data Frame?
A data frame is a table where:
- Columns are vectors (same data type within column)
- Rows are lists (can mix data types across columns)
- Most common data structure in R
Reading Data
Load data from CSV files:
cats <- read.csv("data/feline-data.csv")
cats
Adding Columns
Use cbind() to add columns:
age <- c(2, 3, 5)
cbind(cats, age)
Try: Adding a Different Vector
What happens if we try with a different number of values?
age <- c(2, 3, 5, 12)
cbind(cats, age)
Error: Too Many Values
Error in `data.frame()`:
! arguments imply differing number of rows: 3, 4
The data frame has 3 rows but age has 4 values.
Adding Columns: Key Rule
Number of rows must match vector length:
nrow(cats) # 3
length(age) # must also be 3
Mismatched lengths will cause an error.
Saving the New Column
To keep the new column, assign it back to cats:
age <- c(2, 3, 5)
cats <- cbind(cats, age)
cats
Adding Rows
Use rbind() to add rows (as lists):
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
cats
Removing Rows
Use negative indices to drop rows:
# Remove row 4
cats[-4, ]
Remove multiple rows:
# Remove rows 3 and 4
cats[c(-3, -4), ]
Removing Columns
Use negative column indices:
# Remove column 4
cats[, -4]
Or use column names with %in%:
drop <- names(cats) %in% c("age")
cats[, !drop]
Appending Data Frames
Combine two data frames with rbind():
cats <- rbind(cats, cats)
cats
Result: duplicate rows appended to original data
Challenge 1
Create a data frame with your information:
df <- data.frame(
first_name = c("Your", "Name"),
last_name = c("Goes", "Here"),
lucky_number = c(7, 13)
)
Then:
Use rbind() to add an entry for the person sitting beside you
Use cbind() to add a column with each person’s answer to the question, “Is it time for coffee break?”
Commit Your Work
Save progress to version control:
git add .
git commit -m "Add data frame manipulation examples"
Reading the Gapminder Dataset
Now let’s work with a realistic dataset. Load the gapminder data:
gapminder <- read.csv("data/gapminder_data.csv")
Or read directly from the internet:
gapminder <- read.csv(
"https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv"
)
Exploring Gapminder: str()
Check the structure of the dataset:
Output shows 1704 observations and 6 variables:
country (character)
year (integer)
pop (numeric)
continent (character)
lifeExp (numeric)
gdpPercap (numeric)
Exploring Gapminder: summary()
Get summary statistics:
Shows min, quartiles, median, mean, and max for each column.
Data Type by Column
Check individual column types:
typeof(gapminder$year) # integer
typeof(gapminder$country) # character
str(gapminder$country) # character vector length 1704
Checking Data Frame Properties
Examine dimensions and types:
typeof(gapminder) # list
length(gapminder) # 6 (columns)
nrow(gapminder) # 1704 rows
ncol(gapminder) # 6 columns
dim(gapminder) # 1704 6
colnames(gapminder) # column names
Viewing the Data
Check the first and last rows:
head(gapminder)
tail(gapminder)
Challenge 2
Examine the gapminder data:
- Check the last few lines of the data
- Check some rows in the middle
- Try to pull a few random rows
(Hint: Use functions like tail(), subsetting, and sample())
Challenge 2 Solution
tail(gapminder)
# Middle rows (example: rows 800–810)
gapminder[800:810, ]
# Random rows
gapminder[sample(nrow(gapminder), 5), ]
Challenge 3
Create an R script:
- Go to File > New File > R Script
- Write code to load the gapminder dataset
- Save in the
scripts/ directory
- Add to version control
- Run the script using
source() function
Challenge 4
Interpret the output of str(gapminder):
Using what you know about lists and vectors, explain:
- What does each line of
str() output mean?
- Why is length 6 but nrow is 1704?
- How does this relate to the data structure?
Discuss with your neighbor.
Important Reminders
- Columns must have consistent types
- Number of elements must match when adding columns
- Rows are added as lists
- Always check structure after loading data
- Use
str() first, then explore further
Commit Your Final Work
Save all changes to version control:
git add .
git commit -m "Complete data frame exploration lesson"
Always commit your final work.
Resources
- Software Carpentry: r-novice-gapminder
- RStudio Cheatsheets
- R Documentation:
?read.csv, ?str, etc.