Data Frame Manipulation with dplyr

Adapted from Software Carpentry

Overview

Today we will learn to:

Use the six main data frame manipulation ‘verbs’ with pipes in dplyr
Understand how group_by() and summarize() can be combined to summarize datasets
Analyze a subset of data using logical filtering

Questions

How can I manipulate data frames without repeating myself?

The Problem with Base R

We often need to select observations, group data, or calculate statistics.

Using base R:

gapminder <- read.csv(
  "https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv"
)

mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
mean(gapminder$gdpPercap[gapminder$continent == "Asia"])

This is repetitive and error-prone!

The dplyr Package

The dplyr package provides functions for manipulating data frames that:

Reduce repetition
Reduce probability of errors
Are easier to read

Tidyverse

dplyr belongs to the Tidyverse - a family of R packages designed for data science.

These packages work harmoniously together. Learn more at: tidyverse.org

The Five Main Verbs

We’ll cover these commonly used functions:

select() - choose columns
filter() - choose rows
group_by() - group data
summarize() - calculate summaries
mutate() - create new columns

Plus the pipe operator: %>%

Loading dplyr

library(dplyr)

If not installed, run: install.packages("dplyr")

Using select()

Keep only specific columns:

year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)

Removing Columns with select()

Use - to remove columns:

smaller_gapminder <- select(gapminder, -continent)
head(smaller_gapminder)

Introduction to Pipes

The pipe symbol %>% passes data from one function to the next:

year_country_gdp <- gapminder %>% 
  select(year, country, gdpPercap)

head(year_country_gdp)

Understanding Pipes

Step by step:

Start with gapminder data frame
Pass it using %>% to the next step
select() receives the data automatically

Fun Fact: Similar to the shell pipe | but the concept is the same!

Renaming Columns

Use rename() within a pipeline:

tidy_gdp <- year_country_gdp %>% 
  rename(gdp_per_capita = gdpPercap)

head(tidy_gdp)

Syntax: rename(new_name = old_name)

Using filter()

Keep only rows that meet criteria:

year_country_gdp_euro <- gapminder %>%
  filter(continent == "Europe") %>%
  select(year, country, gdpPercap)

head(year_country_gdp_euro)

Combining filter() Conditions

Filter with multiple conditions:

europe_lifeExp_2007 <- gapminder %>%
  filter(continent == "Europe", year == 2007) %>%
  select(country, lifeExp)

head(europe_lifeExp_2007)

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents.

How many rows does your data frame have and why?

Challenge 1 Solution

year_country_lifeExp_Africa <- gapminder %>%
  filter(continent == "Africa") %>%
  select(year, country, lifeExp)

nrow(year_country_lifeExp_Africa)

624 rows (52 African countries × 12 years of data)

Order of Operations

The order of operations matters!

If we used select() first, filter() would not be able to find the variable continent since we would have removed it.

Always filter before selecting when you need to filter on a column you won’t keep!

Using group_by()

Instead of filtering one group at a time, use group_by() to work with all groups:

str(gapminder)

group_by() Structure

str(gapminder %>% group_by(continent))

Notice the grouped_df class - it’s now organized by continent!

Using summarize()

Combine group_by() with summarize() to calculate statistics per group:

gdp_bycontinents <- gapminder %>%
  group_by(continent) %>%
  summarize(mean_gdpPercap = mean(gdpPercap))

gdp_bycontinents

Challenge 2

Calculate the average life expectancy per country.

Which has the longest average life expectancy and which has the shortest average life expectancy?

Challenge 2 Solution

lifeExp_bycountry <- gapminder %>%
  group_by(country) %>%
  summarize(mean_lifeExp = mean(lifeExp))

lifeExp_bycountry %>%
  filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))

Challenge 2 Solution (using arrange)

Another way using arrange():

# Shortest life expectancy
lifeExp_bycountry %>%
  arrange(mean_lifeExp) %>%
  head(1)

# Longest life expectancy
lifeExp_bycountry %>%
  arrange(desc(mean_lifeExp)) %>%
  head(1)

Grouping by Multiple Variables

Group by more than one variable:

gdp_bycontinents_byyear <- gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap = mean(gdpPercap))

head(gdp_bycontinents_byyear)

Multiple Summary Statistics

Create multiple summary variables at once:

gdp_pop_bycontinents_byyear <- gapminder %>%
  group_by(continent, year) %>%
  summarize(
    mean_gdpPercap = mean(gdpPercap),
    sd_gdpPercap = sd(gdpPercap),
    mean_pop = mean(pop),
    sd_pop = sd(pop)
  )

head(gdp_pop_bycontinents_byyear)

count() and n()

Count observations per group with count():

gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)

Using n() in Calculations

Use n() for the number of observations in calculations:

gapminder %>%
  group_by(continent) %>%
  summarize(se_le = sd(lifeExp) / sqrt(n()))

Multiple Summary Statistics Example

Calculate min, max, mean, and standard error:

gapminder %>%
  group_by(continent) %>%
  summarize(
    mean_le = mean(lifeExp),
    min_le = min(lifeExp),
    max_le = max(lifeExp),
    se_le = sd(lifeExp) / sqrt(n())
  )

Using mutate()

Create new columns with mutate():

gdp_pop_bycontinents_byyear <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9) %>%
  group_by(continent, year) %>%
  summarize(
    mean_gdpPercap = mean(gdpPercap),
    mean_gdp_billion = mean(gdp_billion)
  )

head(gdp_pop_bycontinents_byyear)

mutate() with ifelse()

Combine mutate() with ifelse() for conditional values:

gdp_pop_above25 <- gapminder %>%
  mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
  group_by(continent, year) %>%
  summarize(mean_gdp_billion = mean(gdp_billion))

head(gdp_pop_above25)

Scaling Values Conditionally

Update values based on conditions:

gdp_future <- gapminder %>%
  mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
  group_by(continent, year) %>%
  summarize(
    mean_gdpPercap = mean(gdpPercap),
    mean_gdpPercap_expected = mean(gdp_futureExpectation)
  )

head(gdp_future)

Combining dplyr and ggplot2

Load ggplot2:

library(ggplot2)

Traditional Approach

Creating an intermediate variable:

americas <- gapminder[gapminder$continent == "Americas", ]

ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap(~ country) +
  theme(axis.text.x = element_text(angle = 45))

Piping into ggplot2

No intermediate variable needed:

gapminder %>%
  filter(continent == "Americas") %>%
  ggplot(mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap(~ country) +
  theme(axis.text.x = element_text(angle = 45))

Combining mutate and ggplot2

Create new columns and plot in one pipeline:

gapminder %>%
  mutate(startsWith = substr(country, 1, 1)) %>%
  filter(startsWith %in% c("A", "Z")) %>%
  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
  geom_line() +
  facet_wrap(vars(country)) +
  theme_minimal()

Advanced Challenge

Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent.

Then arrange the continent names in reverse order.

Hint: Use the dplyr functions arrange() and sample_n().

Advanced Challenge Solution

set.seed(42)  # For reproducibility

lifeExp_2countries_bycontinents <- gapminder %>%
  filter(year == 2002) %>%
  group_by(continent) %>%
  sample_n(2) %>%
  summarize(mean_lifeExp = mean(lifeExp)) %>%
  arrange(desc(mean_lifeExp))

lifeExp_2countries_bycontinents

Key dplyr Verbs Summary

Verb	Purpose
`select()`	Choose columns
`filter()`	Choose rows
`group_by()`	Group data
`summarize()`	Calculate summaries
`mutate()`	Create new columns
`arrange()`	Sort rows
`count()`	Count observations
`rename()`	Rename columns

Key Points

Use the dplyr package to manipulate data frames
Use select() to choose variables from a data frame
Use filter() to choose data based on values
Use group_by() and summarize() to work with subsets of data
Use mutate() to create new variables
Use pipes (%>%) to chain operations together

Important Reminders

Order of operations matters when combining select() and filter()
group_by() changes the data frame structure
n() returns the count of observations in the current group
Pipes make code more readable and reduce intermediate variables
Combine dplyr with ggplot2 for powerful data visualization workflows

Resources

R for Data Science (online book)
Data Wrangling Cheat Sheet (PDF)
Introduction to dplyr (documentation)
Tidyverse Skills for Data Science (online book)