Prefer a slide view for teaching or as a student? Click here.

Exploring Data Frames in R

Adapted from Software Carpentry

Overview

Today we will learn to:

Add and remove rows or columns
Append two data frames
Explore data frame properties
Read data from CSV files

What is a Data Frame?

A data frame is a table where:

Columns are vectors (same data type within column)
Rows are lists (can mix data types across columns)
Most common data structure in R

Reading Data

Load data from CSV files:

cats <- read.csv("data/feline-data.csv")
cats

    coat weight likes_catnip
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

Adding Columns

Use cbind() to add columns:

age <- c(2, 3, 5)
cbind(cats, age)

    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Try: Adding a Different Vector

What happens if we try with a different number of values?

age <- c(2, 3, 5, 12)
cbind(cats, age)

Error: Too Many Values

Error in `data.frame()`:
! arguments imply differing number of rows: 3, 4

The data frame has 3 rows but age has 4 values.

Adding Columns: Key Rule

Number of rows must match vector length:

nrow(cats)  # 3
length(age) # must also be 3

Mismatched lengths will cause an error.

Saving the New Column

To keep the new column, assign it back to cats:

age <- c(2, 3, 5)
cats <- cbind(cats, age)
cats

    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Adding Rows

Use rbind() to add rows (as lists):

newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
cats

           coat weight likes_catnip age
1        calico    2.1            1   2
2         black    5.0            0   3
3         tabby    3.2            1   5
4 tortoiseshell    3.3            1   9

Removing Rows

Use negative indices to drop rows:

# Remove row 4
cats[-4, ]

    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Remove multiple rows:

# Remove rows 3 and 4
cats[c(-3, -4), ]

    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3

Removing Columns

Use negative column indices:

# Remove column 4
cats[, -4]

           coat weight likes_catnip
1        calico    2.1            1
2         black    5.0            0
3         tabby    3.2            1
4 tortoiseshell    3.3            1

Or use column names with %in%:

drop <- names(cats) %in% c("age")
cats[, !drop]

           coat weight likes_catnip
1        calico    2.1            1
2         black    5.0            0
3         tabby    3.2            1
4 tortoiseshell    3.3            1

Appending Data Frames

Combine two data frames with rbind():

cats <- rbind(cats, cats)
cats

           coat weight likes_catnip age
1        calico    2.1            1   2
2         black    5.0            0   3
3         tabby    3.2            1   5
4 tortoiseshell    3.3            1   9
5        calico    2.1            1   2
6         black    5.0            0   3
7         tabby    3.2            1   5
8 tortoiseshell    3.3            1   9

Result: duplicate rows appended to original data

Challenge 1

Create a data frame with your information:

df <- data.frame(
  first_name = c("Your", "Name"),
  last_name = c("Goes", "Here"),
  lucky_number = c(7, 13)
)

Then:

Use rbind() to add an entry for the person sitting beside you
Use cbind() to add a column with each person’s answer to the question, “Is it time for coffee break?”

Commit Your Work

Save progress to version control:

git add .
git commit -m "Add data frame manipulation examples"

Reading the Gapminder Dataset

Now let’s work with a realistic dataset. Load the gapminder data:

gapminder <- read.csv("data/gapminder_data.csv")

Or read directly from the internet:

gapminder <- read.csv(
  "https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv"
)

Exploring Gapminder: str()

Check the structure of the dataset:

str(gapminder)

Output shows 1704 observations and 6 variables:

country (character)
year (integer)
pop (numeric)
continent (character)
lifeExp (numeric)
gdpPercap (numeric)

Exploring Gapminder: summary()

Get summary statistics:

summary(gapminder)

   country               year           pop             continent        
 Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
 Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
 Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
                    Mean   :1980   Mean   :2.960e+07                     
                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
                    Max.   :2007   Max.   :1.319e+09                     
    lifeExp        gdpPercap       
 Min.   :23.60   Min.   :   241.2  
 1st Qu.:48.20   1st Qu.:  1202.1  
 Median :60.71   Median :  3531.8  
 Mean   :59.47   Mean   :  7215.3  
 3rd Qu.:70.85   3rd Qu.:  9325.5  
 Max.   :82.60   Max.   :113523.1

Shows min, quartiles, median, mean, and max for each column.

Data Type by Column

Check individual column types:

typeof(gapminder$year)      # integer

[1] "integer"

typeof(gapminder$country)   # character

[1] "character"

str(gapminder$country)      # character vector length 1704

 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

Checking Data Frame Properties

Examine dimensions and types:

typeof(gapminder)        # list

[1] "list"

length(gapminder)        # 6 (columns)

[1] 6

nrow(gapminder)          # 1704 rows

[1] 1704

ncol(gapminder)          # 6 columns

[1] 6

dim(gapminder)           # 1704 6

[1] 1704    6

colnames(gapminder)      # column names

[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Viewing the Data

Check the first and last rows:

head(gapminder)

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

tail(gapminder)

      country year      pop continent lifeExp gdpPercap
1699 Zimbabwe 1982  7636524    Africa  60.363  788.8550
1700 Zimbabwe 1987  9216418    Africa  62.351  706.1573
1701 Zimbabwe 1992 10704340    Africa  60.377  693.4208
1702 Zimbabwe 1997 11404948    Africa  46.809  792.4500
1703 Zimbabwe 2002 11926563    Africa  39.989  672.0386
1704 Zimbabwe 2007 12311143    Africa  43.487  469.7093

Challenge 2

Examine the gapminder data:

Check the last few lines of the data
Check some rows in the middle
Try to pull a few random rows

(Hint: Use functions like tail(), subsetting, and sample())

Challenge 2 Solution

tail(gapminder)

# Middle rows (example: rows 800–810)
gapminder[800:810, ]

# Random rows
gapminder[sample(nrow(gapminder), 5), ]

Challenge 3

Create an R script:

Go to File > New File > R Script
Write code to load the gapminder dataset
Save in the scripts/ directory
Add to version control
Run the script using source() function

Challenge 4

Interpret the output of str(gapminder):

Using what you know about lists and vectors, explain:

What does each line of str() output mean?
Why is length 6 but nrow is 1704?
How does this relate to the data structure?

Discuss with your neighbor.

Important Reminders

Columns must have consistent types
Number of elements must match when adding columns
Rows are added as lists
Always check structure after loading data
Use str() first, then explore further

Commit Your Final Work

Save all changes to version control:

git add .
git commit -m "Complete data frame exploration lesson"

Always commit your final work.

Resources

Software Carpentry: r-novice-gapminder
RStudio Cheatsheets
R Documentation: ?read.csv, ?str, etc.