Data Frame Manipulation with dplyr
Adapted from Software Carpentry
Overview
Today we will learn to:
Use the six main data frame manipulation ‘verbs’ with pipes in dplyr
Understand how group_by() and summarize() can be combined to summarize datasets
Analyze a subset of data using logical filtering
Questions
How can I manipulate data frames without repeating myself?
The Problem with Base R
We often need to select observations, group data, or calculate statistics.
Using base R:
gapminder <- read.csv (
"https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv"
)
mean (gapminder$ gdpPercap[gapminder$ continent == "Africa" ])
mean (gapminder$ gdpPercap[gapminder$ continent == "Americas" ])
mean (gapminder$ gdpPercap[gapminder$ continent == "Asia" ])
This is repetitive and error-prone!
The dplyr Package
The dplyr package provides functions for manipulating data frames that:
Reduce repetition
Reduce probability of errors
Are easier to read
Tidyverse
dplyr belongs to the Tidyverse - a family of R packages designed for data science.
These packages work harmoniously together. Learn more at: tidyverse.org
The Five Main Verbs
We’ll cover these commonly used functions:
select() - choose columns
filter() - choose rows
group_by() - group data
summarize() - calculate summaries
mutate() - create new columns
Plus the pipe operator: %>%
Loading dplyr
If not installed, run: install.packages("dplyr")
Using select()
Keep only specific columns:
year_country_gdp <- select (gapminder, year, country, gdpPercap)
head (year_country_gdp)
Removing Columns with select()
Use - to remove columns:
smaller_gapminder <- select (gapminder, - continent)
head (smaller_gapminder)
Introduction to Pipes
The pipe symbol %>% passes data from one function to the next:
year_country_gdp <- gapminder %>%
select (year, country, gdpPercap)
head (year_country_gdp)
Understanding Pipes
Step by step:
Start with gapminder data frame
Pass it using %>% to the next step
select() receives the data automatically
Fun Fact: Similar to the shell pipe | but the concept is the same!
Renaming Columns
Use rename() within a pipeline:
tidy_gdp <- year_country_gdp %>%
rename (gdp_per_capita = gdpPercap)
head (tidy_gdp)
Syntax: rename(new_name = old_name)
Using filter()
Keep only rows that meet criteria:
year_country_gdp_euro <- gapminder %>%
filter (continent == "Europe" ) %>%
select (year, country, gdpPercap)
head (year_country_gdp_euro)
Combining filter() Conditions
Filter with multiple conditions:
europe_lifeExp_2007 <- gapminder %>%
filter (continent == "Europe" , year == 2007 ) %>%
select (country, lifeExp)
head (europe_lifeExp_2007)
Challenge 1
Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents.
How many rows does your data frame have and why?
Challenge 1 Solution
year_country_lifeExp_Africa <- gapminder %>%
filter (continent == "Africa" ) %>%
select (year, country, lifeExp)
nrow (year_country_lifeExp_Africa)
624 rows (52 African countries × 12 years of data)
Order of Operations
The order of operations matters!
If we used select() first, filter() would not be able to find the variable continent since we would have removed it.
Always filter before selecting when you need to filter on a column you won’t keep!
Using group_by()
Instead of filtering one group at a time, use group_by() to work with all groups:
group_by() Structure
str (gapminder %>% group_by (continent))
Notice the grouped_df class - it’s now organized by continent!
Using summarize()
Combine group_by() with summarize() to calculate statistics per group:
gdp_bycontinents <- gapminder %>%
group_by (continent) %>%
summarize (mean_gdpPercap = mean (gdpPercap))
gdp_bycontinents
Challenge 2
Calculate the average life expectancy per country.
Which has the longest average life expectancy and which has the shortest average life expectancy?
Challenge 2 Solution
lifeExp_bycountry <- gapminder %>%
group_by (country) %>%
summarize (mean_lifeExp = mean (lifeExp))
lifeExp_bycountry %>%
filter (mean_lifeExp == min (mean_lifeExp) | mean_lifeExp == max (mean_lifeExp))
Challenge 2 Solution (using arrange)
Another way using arrange():
# Shortest life expectancy
lifeExp_bycountry %>%
arrange (mean_lifeExp) %>%
head (1 )
# Longest life expectancy
lifeExp_bycountry %>%
arrange (desc (mean_lifeExp)) %>%
head (1 )
Grouping by Multiple Variables
Group by more than one variable:
gdp_bycontinents_byyear <- gapminder %>%
group_by (continent, year) %>%
summarize (mean_gdpPercap = mean (gdpPercap))
head (gdp_bycontinents_byyear)
Multiple Summary Statistics
Create multiple summary variables at once:
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by (continent, year) %>%
summarize (
mean_gdpPercap = mean (gdpPercap),
sd_gdpPercap = sd (gdpPercap),
mean_pop = mean (pop),
sd_pop = sd (pop)
)
head (gdp_pop_bycontinents_byyear)
count() and n()
Count observations per group with count():
gapminder %>%
filter (year == 2002 ) %>%
count (continent, sort = TRUE )
Using n() in Calculations
Use n() for the number of observations in calculations:
gapminder %>%
group_by (continent) %>%
summarize (se_le = sd (lifeExp) / sqrt (n ()))
Multiple Summary Statistics Example
Calculate min, max, mean, and standard error:
gapminder %>%
group_by (continent) %>%
summarize (
mean_le = mean (lifeExp),
min_le = min (lifeExp),
max_le = max (lifeExp),
se_le = sd (lifeExp) / sqrt (n ())
)
Using mutate()
Create new columns with mutate():
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate (gdp_billion = gdpPercap * pop / 10 ^ 9 ) %>%
group_by (continent, year) %>%
summarize (
mean_gdpPercap = mean (gdpPercap),
mean_gdp_billion = mean (gdp_billion)
)
head (gdp_pop_bycontinents_byyear)
mutate() with ifelse()
Combine mutate() with ifelse() for conditional values:
gdp_pop_above25 <- gapminder %>%
mutate (gdp_billion = ifelse (lifeExp > 25 , gdpPercap * pop / 10 ^ 9 , NA )) %>%
group_by (continent, year) %>%
summarize (mean_gdp_billion = mean (gdp_billion))
head (gdp_pop_above25)
Scaling Values Conditionally
Update values based on conditions:
gdp_future <- gapminder %>%
mutate (gdp_futureExpectation = ifelse (lifeExp > 40 , gdpPercap * 1.5 , gdpPercap)) %>%
group_by (continent, year) %>%
summarize (
mean_gdpPercap = mean (gdpPercap),
mean_gdpPercap_expected = mean (gdp_futureExpectation)
)
head (gdp_future)
Combining dplyr and ggplot2
Load ggplot2:
Traditional Approach
Creating an intermediate variable:
americas <- gapminder[gapminder$ continent == "Americas" , ]
ggplot (data = americas, mapping = aes (x = year, y = lifeExp)) +
geom_line () +
facet_wrap (~ country) +
theme (axis.text.x = element_text (angle = 45 ))
Piping into ggplot2
No intermediate variable needed:
gapminder %>%
filter (continent == "Americas" ) %>%
ggplot (mapping = aes (x = year, y = lifeExp)) +
geom_line () +
facet_wrap (~ country) +
theme (axis.text.x = element_text (angle = 45 ))
Combining mutate and ggplot2
Create new columns and plot in one pipeline:
gapminder %>%
mutate (startsWith = substr (country, 1 , 1 )) %>%
filter (startsWith %in% c ("A" , "Z" )) %>%
ggplot (aes (x = year, y = lifeExp, colour = continent)) +
geom_line () +
facet_wrap (vars (country)) +
theme_minimal ()
Advanced Challenge
Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent.
Then arrange the continent names in reverse order.
Hint: Use the dplyr functions arrange() and sample_n().
Advanced Challenge Solution
set.seed (42 ) # For reproducibility
lifeExp_2countries_bycontinents <- gapminder %>%
filter (year == 2002 ) %>%
group_by (continent) %>%
sample_n (2 ) %>%
summarize (mean_lifeExp = mean (lifeExp)) %>%
arrange (desc (mean_lifeExp))
lifeExp_2countries_bycontinents
Key dplyr Verbs Summary
select()
Choose columns
filter()
Choose rows
group_by()
Group data
summarize()
Calculate summaries
mutate()
Create new columns
arrange()
Sort rows
count()
Count observations
rename()
Rename columns
Key Points
Use the dplyr package to manipulate data frames
Use select() to choose variables from a data frame
Use filter() to choose data based on values
Use group_by() and summarize() to work with subsets of data
Use mutate() to create new variables
Use pipes (%>%) to chain operations together
Important Reminders
Order of operations matters when combining select() and filter()
group_by() changes the data frame structure
n() returns the count of observations in the current group
Pipes make code more readable and reduce intermediate variables
Combine dplyr with ggplot2 for powerful data visualization workflows