dplyr is an important tool for fixing data. If you need further help, see dplyr tutorial
These notes correspond to Datacamp’s Introduction to the tidyverse.
Data.frames are an older technology, and tibbles are their newer version. Tibbles do not automatically change data types. A data.frame guesses at times, but a tibble must be told exactly what to do with incoming data. For example, a factor must be manually created. Tibbles also are better at printing.
If you’ve used data.frames, you may have used the approach below for subsetting. The syntax for filtering rows/columns is pretty twitchy. dplyr is easier.
library(tidyverse)
# Create a data.frame
df <- data.frame(x = c(1, 2, 3, 4, 5), y = c("a", "b", "c", "d", "e"))
# Only include values with x>3
# Note that we use [rows_to_include, columns_to_include]
df[df$x > 3, ]
# Subset, finding a specific column. Return different types.
typeof(df[1])
typeof(df$x)
is.vector(df$x)
class(df[1])
You will eventually load tibbles data from a file, but for now I will usually include them in the file.
library(tidyverse)
# a tibble created from vectors.
# Note that either they must have the same length,
# or a length of zero (which are automatically expanded)
t <- tibble(
x = c(1, 2, 3, 4),
y = c(5, 6, 7, 9),
z = c("?")
)
# You can subset a tibble and get a tibble as a result
t["x"]
t[c("x", "y")]
# Getting a vector requires double brackets.
t[["x"]]
# You can verify the output by using the class function.
class(t[["x"]])
Select allows us to choose columns. It helps when we load a large dataset with more columns than we need.
Pick columns by passing their field names, with each separated by a comma.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
t <- tibble(
names = c("Bob", "Sarah", "Tim", "John"),
ages = c(23, 61, 17, 9),
id = c(1, 2, 3, 4)
)
t_names <- t %>%
select(names, ages)
print(t_names)
## # A tibble: 4 × 2
## names ages
## <chr> <dbl>
## 1 Bob 23
## 2 Sarah 61
## 3 Tim 17
## 4 John 9
Remove specific columns by placing a -
in front of
each.
library(tidyverse)
t <- tibble(
names = c("Bob", "Sarah", "Tim", "John"),
ages = c(23, 61, 17, 9),
id = c(1, 2, 3, 4),
null_field = c(NA, NA, NA, NA)
)
t_no_id_or_null <- t %>%
select(-id, -null_field)
print(t_no_id_or_null)
## # A tibble: 4 × 2
## names ages
## <chr> <dbl>
## 1 Bob 23
## 2 Sarah 61
## 3 Tim 17
## 4 John 9
You can include a list of columns with
start_column:end_column
.
library(tidyverse)
t <- tibble(
extra = 1:4,
name = c("Bob", "Sarah", "Tim", "John"),
ages = c(23, 61, 17, 9),
id = c(1, 2, 3, 4),
null_field = c(NA, NA, NA, NA)
)
t_name_to_id <- t %>%
select(name:id)
print(t_name_to_id)
## # A tibble: 4 × 3
## name ages id
## <chr> <dbl> <dbl>
## 1 Bob 23 1
## 2 Sarah 61 2
## 3 Tim 17 3
## 4 John 9 4
You can also use the where
function pick only fields
meeting a condition.
library(tidyverse)
t <- tibble(
extra = 1:4,
name = c("Bob", "Sarah", "Tim", "John"),
ages = c(23, 61, 17, 9),
id = c(1, 2, 3, 4),
null_field = c(NA, NA, NA, NA)
)
t_numbers <- t %>%
select(where(is.numeric))
print(t_numbers)
## # A tibble: 4 × 3
## extra ages id
## <int> <dbl> <dbl>
## 1 1 23 1
## 2 2 61 2
## 3 3 17 3
## 4 4 9 4
We can use helper functions to select fields:
starts_with()
ends_with()
contains()
matches()
See the docs for other options.
library(tidyverse)
t <- tibble(
id = c(1, 2, 3),
population_start = c(100, 200, 300),
population_middle = c(110, 150, 200),
population_end = c(120, 80, 20),
not_needed_field = c(NA, "hey", "umn")
)
t_smaller <- t %>%
select(starts_with('population'), id)
print(t_smaller)
## # A tibble: 3 × 4
## population_start population_middle population_end id
## <dbl> <dbl> <dbl> <dbl>
## 1 100 110 120 1
## 2 200 150 80 2
## 3 300 200 20 3
Removes rows (observations). Uses the symbols:
&
as and!
as or|
as notlibrary(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"))
t_men_only <- t %>%
filter(sex == "man")
print(t_men_only)
## # A tibble: 3 × 1
## sex
## <chr>
## 1 man
## 2 man
## 3 man
You can combine multiple logical tests with 3 different approaches
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
age = c(10, 20, 30, 40, 50))
# Use one the following:
# Option 1: Use & between each test
t_men_over_30 <- t %>%
filter(sex == "man" & age > 30)
# Option 2: Separate each test with a comma
t_men_over_30 <- t %>%
filter(sex == "man", age > 30)
# Option 3: Use a separate filter for each logical test
t_men_over_30 <- t %>%
filter(sex == "man") %>%
filter(age > 30)
print(t_men_over_30)
## # A tibble: 1 × 2
## sex age
## <chr> <dbl>
## 1 man 50
The |
symbol is used when we want to satisfy at
least one of several logical tests.
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
age = c(10, 20, 30, 40, 50))
t_men_or_those_over_30 <- t %>%
filter(sex == "man" | age > 30)
print(t_men_or_those_over_30)
## # A tibble: 4 × 2
## sex age
## <chr> <dbl>
## 1 man 10
## 2 man 30
## 3 woman 40
## 4 man 50
Use the !
symbol to test for anything not true
(or false).
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
age = c(10, 20, 30, 40, 50))
t_not_age_30 <- t %>%
filter(!age == 30)
print(t_not_age_30)
## # A tibble: 4 × 2
## sex age
## <chr> <dbl>
## 1 man 10
## 2 woman 20
## 3 woman 40
## 4 man 50
Test for NA
values with is.na()
. You can
remove NA
values by filtering for
!is.na(column_name)
.
Each NA
value is unique. So, testing for
column_name == NA
will always be FALSE
.
library(tidyverse)
t <- tibble(sex = c("man", "woman", NA))
t_not_na <- t %>%
filter(!is.na(sex))
print(t_not_na)
## # A tibble: 2 × 1
## sex
## <chr>
## 1 man
## 2 woman
We can use the %in%
command to find values in a
vector.
library(tidyverse)
t <- tibble(tool = c("driver", "saw", "nail", "hammer"))
t2 <- t %>%
filter(tool %in% c("nail", "hammer"))
print(t2)
## # A tibble: 2 × 1
## tool
## <chr>
## 1 nail
## 2 hammer
Adds a new column or changes an existing one.
library(tidyverse)
# population in millions
cities <- tibble(population = c(10, 20, 15, 2, 3))
# Change population from millions to thousands
cities <- cities %>%
mutate(population = population * 1000)
print(cities)
## # A tibble: 5 × 1
## population
## <dbl>
## 1 10000
## 2 20000
## 3 15000
## 4 2000
## 5 3000
We can change a value into a percentage (or ratio) by using
sum()
and then dividing a field by that value.
library(tidyverse)
# population in millions
cities <- tibble(population = c(10, 20, 15, 2, 3))
total_population <- sum(cities$population)
# Change population from millions to thousands
cities <- cities %>%
mutate(population = population / total_population)
print(cities)
## # A tibble: 5 × 1
## population
## <dbl>
## 1 0.2
## 2 0.4
## 3 0.3
## 4 0.04
## 5 0.06
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"))
t_is_man <- t %>%
mutate(is_man = sex == "man")
print(t_is_man)
## # A tibble: 5 × 2
## sex is_man
## <chr> <lgl>
## 1 man TRUE
## 2 woman FALSE
## 3 man TRUE
## 4 woman FALSE
## 5 man TRUE
We typically use ifelse()
to create new columns. The
first argument will be a logical test. If the test is
TRUE
, the function will return its second argument. If the
test is FALSE
, it will return the third argument.
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"))
# Function:
# ifelse(*logical_test*, *result_if_true*, *result_if_false*)
#
# Below gives a 1 for males, and a 0 for all others
#
# Note that we have to end with two parenthesis, one for the ifelse and one for the mutate.
t_with_is_man <- t %>%
mutate(is_man01 = ifelse(sex == "man", 1, 0) )
print(t_with_is_man)
## # A tibble: 5 × 2
## sex is_man01
## <chr> <dbl>
## 1 man 1
## 2 woman 0
## 3 man 1
## 4 woman 0
## 5 man 1
This example shows ifelse()
returning a string.
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"))
t_with_m <- t %>%
mutate(s = ifelse(sex == "man", "m", "w") )
print(t_with_m)
## # A tibble: 5 × 2
## sex s
## <chr> <chr>
## 1 man m
## 2 woman w
## 3 man m
## 4 woman w
## 5 man m
We can add multiple columns with a single mutate. Below adds two new
columns and then modifies the sex
field.
library(tidyverse)
t <- tibble(sex = c("Man", "Woman", "MAN", "wOMan", "man"))
t2 <- t %>%
mutate(sex_capitalized = str_to_upper(sex),
sex_lower = str_to_lower(sex),
sex = str_to_title(sex))
print(t2)
## # A tibble: 5 × 3
## sex sex_capitalized sex_lower
## <chr> <chr> <chr>
## 1 Man MAN man
## 2 Woman WOMAN woman
## 3 Man MAN man
## 4 Woman WOMAN woman
## 5 Man MAN man
The arrange
functions sorts data.
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
age = c(20, 30, 40, 30, 50))
t_sorted <- t %>%
arrange(sex)
print(t_sorted)
## # A tibble: 5 × 2
## sex age
## <chr> <dbl>
## 1 man 20
## 2 man 40
## 3 man 50
## 4 woman 30
## 5 woman 30
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
age = c(20, 30, 40, 30, 50))
t_sorted <- t %>%
arrange(sex, age)
print(t_sorted)
## # A tibble: 5 × 2
## sex age
## <chr> <dbl>
## 1 man 20
## 2 man 40
## 3 man 50
## 4 woman 30
## 5 woman 30
Wrap a field with desc(field_name)
to reverse sort
(z-a).
library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
age = c(20, 30, 40, 30, 50))
t_sorted <- t %>%
arrange(desc(sex), desc(age))
print(t_sorted)
## # A tibble: 5 × 2
## sex age
## <chr> <dbl>
## 1 woman 30
## 2 woman 30
## 3 man 50
## 4 man 40
## 5 man 20
Rename allows you to change the name of a field. This is really handy when dealing with badly-named fields containing spaces.
Note that we use the backtick
for naming fields with
spaces. These are different from 'single quotes'
or
"double quotes"
.
library(tidyverse)
# Create sample tibbles.
checks <- tibble(
`Item Name` = c("Cheese", "Bread", "Milk", "Mustard", "Milk"))
checks <- checks %>%
rename(item_name = `Item Name`)
print(checks)
## # A tibble: 5 × 1
## item_name
## <chr>
## 1 Cheese
## 2 Bread
## 3 Milk
## 4 Mustard
## 5 Milk
The relocate verb moves columns in a table. It is handy when you are looking for a table to be in a certain order.
We can put these all together using the %\>%
(pipe)
operator. This joins multiple results without needing to re-type
stuff.
Note that all dplyr functions use the tibble as the first argument. Rather than nesting or using intermediate variables, %>% allows us to rewrite them.
x %\>% f(y)
turns into f(x, y)
Read left-to-right, top-to-bottom.
You can type this with Control+Shift+M
The book “R for Data Science” is very good. Below are several key chapters: