cats <- read.csv("data/feline-data.csv")
cats coat weight likes_catnip
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
Adapted from Software Carpentry
Today we will learn to:
A data frame is a table where:
Load data from CSV files:
Use cbind() to add columns:
What happens if we try with a different number of values?
Error in `data.frame()`:
! arguments imply differing number of rows: 3, 4
The data frame has 3 rows but age has 4 values.
Number of rows must match vector length:
Mismatched lengths will cause an error.
To keep the new column, assign it back to cats:
Use rbind() to add rows (as lists):
Use negative indices to drop rows:
coat weight likes_catnip age
1 calico 2.1 1 2
2 black 5.0 0 3
3 tabby 3.2 1 5
Remove multiple rows:
Use negative column indices:
coat weight likes_catnip
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
4 tortoiseshell 3.3 1
Or use column names with %in%:
Combine two data frames with rbind():
coat weight likes_catnip age
1 calico 2.1 1 2
2 black 5.0 0 3
3 tabby 3.2 1 5
4 tortoiseshell 3.3 1 9
5 calico 2.1 1 2
6 black 5.0 0 3
7 tabby 3.2 1 5
8 tortoiseshell 3.3 1 9
Result: duplicate rows appended to original data
Create a data frame with your information:
Then:
Use rbind() to add an entry for the person sitting beside you
Use cbind() to add a column with each person’s answer to the question, “Is it time for coffee break?”
Save progress to version control:
Now let’s work with a realistic dataset. Load the gapminder data:
Or read directly from the internet:
Check the structure of the dataset:
Output shows 1704 observations and 6 variables:
country (character)year (integer)pop (numeric)continent (character)lifeExp (numeric)gdpPercap (numeric)Get summary statistics:
country year pop continent
Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
Mode :character Median :1980 Median :7.024e+06 Mode :character
Mean :1980 Mean :2.960e+07
3rd Qu.:1993 3rd Qu.:1.959e+07
Max. :2007 Max. :1.319e+09
lifeExp gdpPercap
Min. :23.60 Min. : 241.2
1st Qu.:48.20 1st Qu.: 1202.1
Median :60.71 Median : 3531.8
Mean :59.47 Mean : 7215.3
3rd Qu.:70.85 3rd Qu.: 9325.5
Max. :82.60 Max. :113523.1
Shows min, quartiles, median, mean, and max for each column.
Check individual column types:
Examine dimensions and types:
Check the first and last rows:
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
country year pop continent lifeExp gdpPercap
1699 Zimbabwe 1982 7636524 Africa 60.363 788.8550
1700 Zimbabwe 1987 9216418 Africa 62.351 706.1573
1701 Zimbabwe 1992 10704340 Africa 60.377 693.4208
1702 Zimbabwe 1997 11404948 Africa 46.809 792.4500
1703 Zimbabwe 2002 11926563 Africa 39.989 672.0386
1704 Zimbabwe 2007 12311143 Africa 43.487 469.7093
Examine the gapminder data:
(Hint: Use functions like tail(), subsetting, and sample())
Create an R script:
scripts/ directorysource() functionInterpret the output of str(gapminder):
Using what you know about lists and vectors, explain:
str() output mean?Discuss with your neighbor.
str() first, then explore furtherSave all changes to version control:
Always commit your final work.
?read.csv, ?str, etc.