Data Frame Manipulation with dplyr

Adapted from Software Carpentry

Overview

Today we will learn to:

  • Use the six main data frame manipulation ‘verbs’ with pipes in dplyr
  • Understand how group_by() and summarize() can be combined to summarize datasets
  • Analyze a subset of data using logical filtering

Questions

  • How can I manipulate data frames without repeating myself?

The Problem with Base R

We often need to select observations, group data, or calculate statistics.

Using base R:

gapminder <- read.csv(
  "https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv"
)

mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
[1] 2193.755
mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
[1] 7136.11
mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
[1] 7902.15

This is repetitive and error-prone!

The dplyr Package

The dplyr package provides functions for manipulating data frames that:

  • Reduce repetition
  • Reduce probability of errors
  • Are easier to read

Tidyverse

dplyr belongs to the Tidyverse - a family of R packages designed for data science.

These packages work harmoniously together. Learn more at: tidyverse.org

The Five Main Verbs

We’ll cover these commonly used functions:

  1. select() - choose columns
  2. filter() - choose rows
  3. group_by() - group data
  4. summarize() - calculate summaries
  5. mutate() - create new columns

Plus the pipe operator: %>%

Loading dplyr

library(dplyr)

If not installed, run: install.packages("dplyr")

Using select()

Keep only specific columns:

year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)
  year     country gdpPercap
1 1952 Afghanistan  779.4453
2 1957 Afghanistan  820.8530
3 1962 Afghanistan  853.1007
4 1967 Afghanistan  836.1971
5 1972 Afghanistan  739.9811
6 1977 Afghanistan  786.1134

Removing Columns with select()

Use - to remove columns:

smaller_gapminder <- select(gapminder, -continent)
head(smaller_gapminder)
      country year      pop lifeExp gdpPercap
1 Afghanistan 1952  8425333  28.801  779.4453
2 Afghanistan 1957  9240934  30.332  820.8530
3 Afghanistan 1962 10267083  31.997  853.1007
4 Afghanistan 1967 11537966  34.020  836.1971
5 Afghanistan 1972 13079460  36.088  739.9811
6 Afghanistan 1977 14880372  38.438  786.1134

Introduction to Pipes

The pipe symbol %>% passes data from one function to the next:

year_country_gdp <- gapminder %>% 
  select(year, country, gdpPercap)

head(year_country_gdp)
  year     country gdpPercap
1 1952 Afghanistan  779.4453
2 1957 Afghanistan  820.8530
3 1962 Afghanistan  853.1007
4 1967 Afghanistan  836.1971
5 1972 Afghanistan  739.9811
6 1977 Afghanistan  786.1134

Understanding Pipes

Step by step:

  1. Start with gapminder data frame
  2. Pass it using %>% to the next step
  3. select() receives the data automatically

Fun Fact: Similar to the shell pipe | but the concept is the same!

Renaming Columns

Use rename() within a pipeline:

tidy_gdp <- year_country_gdp %>% 
  rename(gdp_per_capita = gdpPercap)

head(tidy_gdp)
  year     country gdp_per_capita
1 1952 Afghanistan       779.4453
2 1957 Afghanistan       820.8530
3 1962 Afghanistan       853.1007
4 1967 Afghanistan       836.1971
5 1972 Afghanistan       739.9811
6 1977 Afghanistan       786.1134

Syntax: rename(new_name = old_name)

Using filter()

Keep only rows that meet criteria:

year_country_gdp_euro <- gapminder %>%
  filter(continent == "Europe") %>%
  select(year, country, gdpPercap)

head(year_country_gdp_euro)
  year country gdpPercap
1 1952 Albania  1601.056
2 1957 Albania  1942.284
3 1962 Albania  2312.889
4 1967 Albania  2760.197
5 1972 Albania  3313.422
6 1977 Albania  3533.004

Combining filter() Conditions

Filter with multiple conditions:

europe_lifeExp_2007 <- gapminder %>%
  filter(continent == "Europe", year == 2007) %>%
  select(country, lifeExp)

head(europe_lifeExp_2007)
                 country lifeExp
1                Albania  76.423
2                Austria  79.829
3                Belgium  79.441
4 Bosnia and Herzegovina  74.852
5               Bulgaria  73.005
6                Croatia  75.748

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents.

How many rows does your data frame have and why?

Challenge 1 Solution

year_country_lifeExp_Africa <- gapminder %>%
  filter(continent == "Africa") %>%
  select(year, country, lifeExp)

nrow(year_country_lifeExp_Africa)
[1] 624

624 rows (52 African countries × 12 years of data)

Order of Operations

The order of operations matters!

If we used select() first, filter() would not be able to find the variable continent since we would have removed it.

Always filter before selecting when you need to filter on a column you won’t keep!

Using group_by()

Instead of filtering one group at a time, use group_by() to work with all groups:

str(gapminder)
'data.frame':   1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

group_by() Structure

str(gapminder %>% group_by(continent))
gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
 - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
  ..$ .rows    : list<int> [1:5] 
  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

Notice the grouped_df class - it’s now organized by continent!

Using summarize()

Combine group_by() with summarize() to calculate statistics per group:

gdp_bycontinents <- gapminder %>%
  group_by(continent) %>%
  summarize(mean_gdpPercap = mean(gdpPercap))

gdp_bycontinents
# A tibble: 5 × 2
  continent mean_gdpPercap
  <chr>              <dbl>
1 Africa             2194.
2 Americas           7136.
3 Asia               7902.
4 Europe            14469.
5 Oceania           18622.

Challenge 2

Calculate the average life expectancy per country.

Which has the longest average life expectancy and which has the shortest average life expectancy?

Challenge 2 Solution

lifeExp_bycountry <- gapminder %>%
  group_by(country) %>%
  summarize(mean_lifeExp = mean(lifeExp))

lifeExp_bycountry %>%
  filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
# A tibble: 2 × 2
  country      mean_lifeExp
  <chr>               <dbl>
1 Iceland              76.5
2 Sierra Leone         36.8

Challenge 2 Solution (using arrange)

Another way using arrange():

# Shortest life expectancy
lifeExp_bycountry %>%
  arrange(mean_lifeExp) %>%
  head(1)
# A tibble: 1 × 2
  country      mean_lifeExp
  <chr>               <dbl>
1 Sierra Leone         36.8
# Longest life expectancy
lifeExp_bycountry %>%
  arrange(desc(mean_lifeExp)) %>%
  head(1)
# A tibble: 1 × 2
  country mean_lifeExp
  <chr>          <dbl>
1 Iceland         76.5

Grouping by Multiple Variables

Group by more than one variable:

gdp_bycontinents_byyear <- gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap = mean(gdpPercap))

head(gdp_bycontinents_byyear)
# A tibble: 6 × 3
# Groups:   continent [1]
  continent  year mean_gdpPercap
  <chr>     <int>          <dbl>
1 Africa     1952          1253.
2 Africa     1957          1385.
3 Africa     1962          1598.
4 Africa     1967          2050.
5 Africa     1972          2340.
6 Africa     1977          2586.

Multiple Summary Statistics

Create multiple summary variables at once:

gdp_pop_bycontinents_byyear <- gapminder %>%
  group_by(continent, year) %>%
  summarize(
    mean_gdpPercap = mean(gdpPercap),
    sd_gdpPercap = sd(gdpPercap),
    mean_pop = mean(pop),
    sd_pop = sd(pop)
  )

head(gdp_pop_bycontinents_byyear)
# A tibble: 6 × 6
# Groups:   continent [1]
  continent  year mean_gdpPercap sd_gdpPercap mean_pop    sd_pop
  <chr>     <int>          <dbl>        <dbl>    <dbl>     <dbl>
1 Africa     1952          1253.         983. 4570010.  6317450.
2 Africa     1957          1385.        1135. 5093033.  7076042.
3 Africa     1962          1598.        1462. 5702247.  7957545.
4 Africa     1967          2050.        2848. 6447875.  8985505.
5 Africa     1972          2340.        3287. 7305376. 10130833.
6 Africa     1977          2586.        4142. 8328097. 11585184.

count() and n()

Count observations per group with count():

gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)
  continent  n
1    Africa 52
2      Asia 33
3    Europe 30
4  Americas 25
5   Oceania  2

Using n() in Calculations

Use n() for the number of observations in calculations:

gapminder %>%
  group_by(continent) %>%
  summarize(se_le = sd(lifeExp) / sqrt(n()))
# A tibble: 5 × 2
  continent se_le
  <chr>     <dbl>
1 Africa    0.366
2 Americas  0.540
3 Asia      0.596
4 Europe    0.286
5 Oceania   0.775

Multiple Summary Statistics Example

Calculate min, max, mean, and standard error:

gapminder %>%
  group_by(continent) %>%
  summarize(
    mean_le = mean(lifeExp),
    min_le = min(lifeExp),
    max_le = max(lifeExp),
    se_le = sd(lifeExp) / sqrt(n())
  )
# A tibble: 5 × 5
  continent mean_le min_le max_le se_le
  <chr>       <dbl>  <dbl>  <dbl> <dbl>
1 Africa       48.9   23.6   76.4 0.366
2 Americas     64.7   37.6   80.7 0.540
3 Asia         60.1   28.8   82.6 0.596
4 Europe       71.9   43.6   81.8 0.286
5 Oceania      74.3   69.1   81.2 0.775

Using mutate()

Create new columns with mutate():

gdp_pop_bycontinents_byyear <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9) %>%
  group_by(continent, year) %>%
  summarize(
    mean_gdpPercap = mean(gdpPercap),
    mean_gdp_billion = mean(gdp_billion)
  )

head(gdp_pop_bycontinents_byyear)
# A tibble: 6 × 4
# Groups:   continent [1]
  continent  year mean_gdpPercap mean_gdp_billion
  <chr>     <int>          <dbl>            <dbl>
1 Africa     1952          1253.             5.99
2 Africa     1957          1385.             7.36
3 Africa     1962          1598.             8.78
4 Africa     1967          2050.            11.4 
5 Africa     1972          2340.            15.1 
6 Africa     1977          2586.            18.7 

mutate() with ifelse()

Combine mutate() with ifelse() for conditional values:

gdp_pop_above25 <- gapminder %>%
  mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
  group_by(continent, year) %>%
  summarize(mean_gdp_billion = mean(gdp_billion))

head(gdp_pop_above25)
# A tibble: 6 × 3
# Groups:   continent [1]
  continent  year mean_gdp_billion
  <chr>     <int>            <dbl>
1 Africa     1952             5.99
2 Africa     1957             7.36
3 Africa     1962             8.78
4 Africa     1967            11.4 
5 Africa     1972            15.1 
6 Africa     1977            18.7 

Scaling Values Conditionally

Update values based on conditions:

gdp_future <- gapminder %>%
  mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
  group_by(continent, year) %>%
  summarize(
    mean_gdpPercap = mean(gdpPercap),
    mean_gdpPercap_expected = mean(gdp_futureExpectation)
  )

head(gdp_future)
# A tibble: 6 × 4
# Groups:   continent [1]
  continent  year mean_gdpPercap mean_gdpPercap_expected
  <chr>     <int>          <dbl>                   <dbl>
1 Africa     1952          1253.                   1578.
2 Africa     1957          1385.                   1803.
3 Africa     1962          1598.                   2227.
4 Africa     1967          2050.                   2953.
5 Africa     1972          2340.                   3416.
6 Africa     1977          2586.                   3830.

Combining dplyr and ggplot2

Load ggplot2:

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.5.2

Traditional Approach

Creating an intermediate variable:

americas <- gapminder[gapminder$continent == "Americas", ]

ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap(~ country) +
  theme(axis.text.x = element_text(angle = 45))

Piping into ggplot2

No intermediate variable needed:

gapminder %>%
  filter(continent == "Americas") %>%
  ggplot(mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap(~ country) +
  theme(axis.text.x = element_text(angle = 45))

Combining mutate and ggplot2

Create new columns and plot in one pipeline:

gapminder %>%
  mutate(startsWith = substr(country, 1, 1)) %>%
  filter(startsWith %in% c("A", "Z")) %>%
  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
  geom_line() +
  facet_wrap(vars(country)) +
  theme_minimal()

Advanced Challenge

Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent.

Then arrange the continent names in reverse order.

Hint: Use the dplyr functions arrange() and sample_n().

Advanced Challenge Solution

set.seed(42)  # For reproducibility

lifeExp_2countries_bycontinents <- gapminder %>%
  filter(year == 2002) %>%
  group_by(continent) %>%
  sample_n(2) %>%
  summarize(mean_lifeExp = mean(lifeExp)) %>%
  arrange(desc(mean_lifeExp))

lifeExp_2countries_bycontinents
# A tibble: 5 × 2
  continent mean_lifeExp
  <chr>            <dbl>
1 Oceania           79.7
2 Asia              75.9
3 Europe            75.9
4 Americas          74.3
5 Africa            59.8

Key dplyr Verbs Summary

Verb Purpose
select() Choose columns
filter() Choose rows
group_by() Group data
summarize() Calculate summaries
mutate() Create new columns
arrange() Sort rows
count() Count observations
rename() Rename columns

Key Points

  • Use the dplyr package to manipulate data frames
  • Use select() to choose variables from a data frame
  • Use filter() to choose data based on values
  • Use group_by() and summarize() to work with subsets of data
  • Use mutate() to create new variables
  • Use pipes (%>%) to chain operations together

Important Reminders

  • Order of operations matters when combining select() and filter()
  • group_by() changes the data frame structure
  • n() returns the count of observations in the current group
  • Pipes make code more readable and reduce intermediate variables
  • Combine dplyr with ggplot2 for powerful data visualization workflows

Resources

Questions?