Exploring Data Frames in R

Adapted from Software Carpentry

Overview

Today we will learn to:

  • Add and remove rows or columns
  • Append two data frames
  • Explore data frame properties
  • Read data from CSV files

What is a Data Frame?

A data frame is a table where:

  • Columns are vectors (same data type within column)
  • Rows are lists (can mix data types across columns)
  • Most common data structure in R

Reading Data

Load data from CSV files:

cats <- read.csv("data/feline-data.csv")
cats
    coat weight likes_catnip
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

Adding Columns

Use cbind() to add columns:

age <- c(2, 3, 5)
cbind(cats, age)
    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Try: Adding a Different Vector

What happens if we try with a different number of values?

age <- c(2, 3, 5, 12)
cbind(cats, age)

Error: Too Many Values

Error in `data.frame()`:
! arguments imply differing number of rows: 3, 4

The data frame has 3 rows but age has 4 values.

Adding Columns: Key Rule

Number of rows must match vector length:

nrow(cats)  # 3
length(age) # must also be 3

Mismatched lengths will cause an error.

Saving the New Column

To keep the new column, assign it back to cats:

age <- c(2, 3, 5)
cats <- cbind(cats, age)
cats
    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Adding Rows

Use rbind() to add rows (as lists):

newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
cats
           coat weight likes_catnip age
1        calico    2.1            1   2
2         black    5.0            0   3
3         tabby    3.2            1   5
4 tortoiseshell    3.3            1   9

Removing Rows

Use negative indices to drop rows:

# Remove row 4
cats[-4, ]
    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Remove multiple rows:

# Remove rows 3 and 4
cats[c(-3, -4), ]
    coat weight likes_catnip age
1 calico    2.1            1   2
2  black    5.0            0   3

Removing Columns

Use negative column indices:

# Remove column 4
cats[, -4]
           coat weight likes_catnip
1        calico    2.1            1
2         black    5.0            0
3         tabby    3.2            1
4 tortoiseshell    3.3            1

Or use column names with %in%:

drop <- names(cats) %in% c("age")
cats[, !drop]
           coat weight likes_catnip
1        calico    2.1            1
2         black    5.0            0
3         tabby    3.2            1
4 tortoiseshell    3.3            1

Appending Data Frames

Combine two data frames with rbind():

cats <- rbind(cats, cats)
cats
           coat weight likes_catnip age
1        calico    2.1            1   2
2         black    5.0            0   3
3         tabby    3.2            1   5
4 tortoiseshell    3.3            1   9
5        calico    2.1            1   2
6         black    5.0            0   3
7         tabby    3.2            1   5
8 tortoiseshell    3.3            1   9

Result: duplicate rows appended to original data

Challenge 1

Create a data frame with your information:

df <- data.frame(
  first_name = c("Your", "Name"),
  last_name = c("Goes", "Here"),
  lucky_number = c(7, 13)
)

Then:

  1. Use rbind() to add an entry for the person sitting beside you

  2. Use cbind() to add a column with each person’s answer to the question, “Is it time for coffee break?”

Commit Your Work

Save progress to version control:

git add .
git commit -m "Add data frame manipulation examples"

Reading the Gapminder Dataset

Now let’s work with a realistic dataset. Load the gapminder data:

gapminder <- read.csv("data/gapminder_data.csv")

Or read directly from the internet:

gapminder <- read.csv(
  "https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv"
)

Exploring Gapminder: str()

Check the structure of the dataset:

str(gapminder)

Output shows 1704 observations and 6 variables:

  • country (character)
  • year (integer)
  • pop (numeric)
  • continent (character)
  • lifeExp (numeric)
  • gdpPercap (numeric)

Exploring Gapminder: summary()

Get summary statistics:

summary(gapminder)
   country               year           pop             continent        
 Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
 Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
 Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
                    Mean   :1980   Mean   :2.960e+07                     
                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
                    Max.   :2007   Max.   :1.319e+09                     
    lifeExp        gdpPercap       
 Min.   :23.60   Min.   :   241.2  
 1st Qu.:48.20   1st Qu.:  1202.1  
 Median :60.71   Median :  3531.8  
 Mean   :59.47   Mean   :  7215.3  
 3rd Qu.:70.85   3rd Qu.:  9325.5  
 Max.   :82.60   Max.   :113523.1  

Shows min, quartiles, median, mean, and max for each column.

Data Type by Column

Check individual column types:

typeof(gapminder$year)      # integer
[1] "integer"
typeof(gapminder$country)   # character
[1] "character"
str(gapminder$country)      # character vector length 1704
 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

Checking Data Frame Properties

Examine dimensions and types:

typeof(gapminder)        # list
[1] "list"
length(gapminder)        # 6 (columns)
[1] 6
nrow(gapminder)          # 1704 rows
[1] 1704
ncol(gapminder)          # 6 columns
[1] 6
dim(gapminder)           # 1704 6
[1] 1704    6
colnames(gapminder)      # column names
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Viewing the Data

Check the first and last rows:

head(gapminder)
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134
tail(gapminder)
      country year      pop continent lifeExp gdpPercap
1699 Zimbabwe 1982  7636524    Africa  60.363  788.8550
1700 Zimbabwe 1987  9216418    Africa  62.351  706.1573
1701 Zimbabwe 1992 10704340    Africa  60.377  693.4208
1702 Zimbabwe 1997 11404948    Africa  46.809  792.4500
1703 Zimbabwe 2002 11926563    Africa  39.989  672.0386
1704 Zimbabwe 2007 12311143    Africa  43.487  469.7093

Challenge 2

Examine the gapminder data:

  1. Check the last few lines of the data
  2. Check some rows in the middle
  3. Try to pull a few random rows

(Hint: Use functions like tail(), subsetting, and sample())

Challenge 2 Solution

tail(gapminder)

# Middle rows (example: rows 800–810)
gapminder[800:810, ]

# Random rows
gapminder[sample(nrow(gapminder), 5), ]

Challenge 3

Create an R script:

  1. Go to File > New File > R Script
  2. Write code to load the gapminder dataset
  3. Save in the scripts/ directory
  4. Add to version control
  5. Run the script using source() function

Challenge 4

Interpret the output of str(gapminder):

Using what you know about lists and vectors, explain:

  • What does each line of str() output mean?
  • Why is length 6 but nrow is 1704?
  • How does this relate to the data structure?

Discuss with your neighbor.

Important Reminders

  • Columns must have consistent types
  • Number of elements must match when adding columns
  • Rows are added as lists
  • Always check structure after loading data
  • Use str() first, then explore further

Commit Your Final Work

Save all changes to version control:

git add .
git commit -m "Complete data frame exploration lesson"

Always commit your final work.

Resources

  • Software Carpentry: r-novice-gapminder
  • RStudio Cheatsheets
  • R Documentation: ?read.csv, ?str, etc.

Questions?