dplyr is an important tool for fixing data. If you need further help, see dplyr tutorial

These notes correspond to Datacamp’s Introduction to the tidyverse.

Why tibbles?

Data.frames are an older technology, and tibbles are their newer version. Tibbles do not automatically change data types. A data.frame guesses at times, but a tibble must be told exactly what to do with incoming data. For example, a factor must be manually created. Tibbles also are better at printing.

Converting from data.frame to tibble

If you’ve used data.frames, you may have used the approach below for subsetting. The syntax for filtering rows/columns is pretty twitchy. dplyr is easier.

library(tidyverse)

# Create a data.frame
df <- data.frame(x = c(1, 2, 3, 4, 5), y = c("a", "b", "c", "d", "e"))

# Only include values with x>3
# Note that we use [rows_to_include, columns_to_include]
df[df$x > 3, ]

# Subset, finding a specific column. Return different types.
typeof(df[1])
typeof(df$x)
is.vector(df$x)
class(df[1])

Creating a tibble

You will eventually load tibbles data from a file, but for now I will usually include them in the file.

library(tidyverse)

# a tibble created from vectors.
# Note that either they must have the same length,
# or a length of zero (which are automatically expanded)
t <- tibble(
  x = c(1, 2, 3, 4),
  y = c(5,  6,  7,  9),
  z = c("?")
)

# You can subset a tibble and get a tibble as a result
t["x"]
t[c("x", "y")]

# Getting a vector requires double brackets.
t[["x"]]

# You can verify the output by using the class function.
class(t[["x"]])

dplyr commands

select

Select allows us to choose columns. It helps when we load a large dataset with more columns than we need.

Columns by name

Pick columns by passing their field names, with each separated by a comma.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
t <- tibble(
  names = c("Bob", "Sarah", "Tim", "John"),
  ages = c(23,  61,  17,  9),
  id = c(1, 2, 3, 4)
)

t_names <- t %>% 
  select(names, ages)

print(t_names)
## # A tibble: 4 × 2
##   names  ages
##   <chr> <dbl>
## 1 Bob      23
## 2 Sarah    61
## 3 Tim      17
## 4 John      9

Remove columns by name

Remove specific columns by placing a - in front of each.

library(tidyverse)

t <- tibble(
  names = c("Bob", "Sarah", "Tim", "John"),
  ages = c(23,  61,  17,  9),
  id = c(1, 2, 3, 4),
  null_field = c(NA, NA, NA, NA)
)

t_no_id_or_null <- t %>% 
  select(-id, -null_field)

print(t_no_id_or_null)
## # A tibble: 4 × 2
##   names  ages
##   <chr> <dbl>
## 1 Bob      23
## 2 Sarah    61
## 3 Tim      17
## 4 John      9

Add columns by name array

You can include a list of columns with start_column:end_column.

library(tidyverse)

t <- tibble(
  extra = 1:4,
  name = c("Bob", "Sarah", "Tim", "John"),
  ages = c(23,  61,  17,  9),
  id = c(1, 2, 3, 4),
  null_field = c(NA, NA, NA, NA)
)

t_name_to_id <- t %>% 
  select(name:id)

print(t_name_to_id)
## # A tibble: 4 × 3
##   name   ages    id
##   <chr> <dbl> <dbl>
## 1 Bob      23     1
## 2 Sarah    61     2
## 3 Tim      17     3
## 4 John      9     4

All numeric columns

You can also use the where function pick only fields meeting a condition.

library(tidyverse)

t <- tibble(
  extra = 1:4,
  name = c("Bob", "Sarah", "Tim", "John"),
  ages = c(23,  61,  17,  9),
  id = c(1, 2, 3, 4),
  null_field = c(NA, NA, NA, NA)
)

t_numbers <- t %>% 
  select(where(is.numeric))

print(t_numbers)
## # A tibble: 4 × 3
##   extra  ages    id
##   <int> <dbl> <dbl>
## 1     1    23     1
## 2     2    61     2
## 3     3    17     3
## 4     4     9     4

Helper functions

We can use helper functions to select fields:

  • starts_with()
  • ends_with()
  • contains()
  • matches()

See the docs for other options.

library(tidyverse)

t <- tibble(
  id = c(1, 2, 3),
  population_start = c(100, 200, 300),
  population_middle = c(110, 150, 200),
  population_end = c(120, 80, 20),
  not_needed_field = c(NA, "hey", "umn")
)

t_smaller <- t %>% 
  select(starts_with('population'), id)

print(t_smaller)
## # A tibble: 3 × 4
##   population_start population_middle population_end    id
##              <dbl>             <dbl>          <dbl> <dbl>
## 1              100               110            120     1
## 2              200               150             80     2
## 3              300               200             20     3

filter

Removes rows (observations). Uses the symbols:

  • & as and
  • ! as or
  • | as not

Basic example

library(tidyverse)

t <- tibble(sex = c("man", "woman", "man", "woman", "man"))

t_men_only <- t %>%
  filter(sex == "man")

print(t_men_only)
## # A tibble: 3 × 1
##   sex  
##   <chr>
## 1 man  
## 2 man  
## 3 man

And

You can combine multiple logical tests with 3 different approaches

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
            age = c(10, 20, 30, 40, 50))


# Use one the following:

# Option 1: Use & between each test
t_men_over_30 <- t %>%
  filter(sex == "man" & age > 30)

# Option 2: Separate each test with a comma
t_men_over_30 <- t %>%
  filter(sex == "man",  age > 30)

# Option 3: Use a separate filter for each logical test
t_men_over_30 <- t %>%
  filter(sex == "man") %>%
  filter(age > 30)

print(t_men_over_30)
## # A tibble: 1 × 2
##   sex     age
##   <chr> <dbl>
## 1 man      50

Or

The | symbol is used when we want to satisfy at least one of several logical tests.

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
            age = c(10, 20, 30, 40, 50))

t_men_or_those_over_30 <- t %>%
  filter(sex == "man" | age > 30)

print(t_men_or_those_over_30)
## # A tibble: 4 × 2
##   sex     age
##   <chr> <dbl>
## 1 man      10
## 2 man      30
## 3 woman    40
## 4 man      50

Not

Use the ! symbol to test for anything not true (or false).

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
            age = c(10, 20, 30, 40, 50))

t_not_age_30 <- t %>%
  filter(!age == 30)

print(t_not_age_30)
## # A tibble: 4 × 2
##   sex     age
##   <chr> <dbl>
## 1 man      10
## 2 woman    20
## 3 woman    40
## 4 man      50

Test for NA values

Test for NA values with is.na(). You can remove NA values by filtering for !is.na(column_name).

Each NA value is unique. So, testing for column_name == NA will always be FALSE.

library(tidyverse)
t <- tibble(sex = c("man", "woman", NA))

t_not_na <- t %>%
  filter(!is.na(sex))

print(t_not_na)
## # A tibble: 2 × 1
##   sex  
##   <chr>
## 1 man  
## 2 woman

Filter by list

We can use the %in% command to find values in a vector.

library(tidyverse)
t <- tibble(tool = c("driver", "saw", "nail", "hammer"))

t2 <- t %>%
  filter(tool %in% c("nail", "hammer"))

print(t2)
## # A tibble: 2 × 1
##   tool  
##   <chr> 
## 1 nail  
## 2 hammer

mutate

Adds a new column or changes an existing one.

Change values in a column

library(tidyverse)
# population in millions
cities <- tibble(population = c(10, 20, 15, 2, 3))

# Change population from millions to thousands
cities <- cities %>% 
  mutate(population = population * 1000)

print(cities)
## # A tibble: 5 × 1
##   population
##        <dbl>
## 1      10000
## 2      20000
## 3      15000
## 4       2000
## 5       3000

Change value to ratio

We can change a value into a percentage (or ratio) by using sum() and then dividing a field by that value.

library(tidyverse)
# population in millions
cities <- tibble(population = c(10, 20, 15, 2, 3))

total_population <- sum(cities$population)

# Change population from millions to thousands
cities <- cities %>% 
  mutate(population = population / total_population)

print(cities)
## # A tibble: 5 × 1
##   population
##        <dbl>
## 1       0.2 
## 2       0.4 
## 3       0.3 
## 4       0.04
## 5       0.06

New Boolean column

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"))

t_is_man <- t %>% 
  mutate(is_man = sex == "man")

print(t_is_man)
## # A tibble: 5 × 2
##   sex   is_man
##   <chr> <lgl> 
## 1 man   TRUE  
## 2 woman FALSE 
## 3 man   TRUE  
## 4 woman FALSE 
## 5 man   TRUE

ifelse returning 1/0

We typically use ifelse() to create new columns. The first argument will be a logical test. If the test is TRUE, the function will return its second argument. If the test is FALSE, it will return the third argument.

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"))

# Function:
#    ifelse(*logical_test*, *result_if_true*, *result_if_false*)
# 
# Below gives a 1 for males, and a 0 for all others
# 
# Note that we have to end with two parenthesis, one for the ifelse and one for the mutate.
t_with_is_man <- t %>% 
  mutate(is_man01 = ifelse(sex == "man", 1, 0) )

print(t_with_is_man)
## # A tibble: 5 × 2
##   sex   is_man01
##   <chr>    <dbl>
## 1 man          1
## 2 woman        0
## 3 man          1
## 4 woman        0
## 5 man          1

ifelse returning string

This example shows ifelse() returning a string.

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"))

t_with_m <- t %>% 
  mutate(s = ifelse(sex == "man", "m", "w") )

print(t_with_m)
## # A tibble: 5 × 2
##   sex   s    
##   <chr> <chr>
## 1 man   m    
## 2 woman w    
## 3 man   m    
## 4 woman w    
## 5 man   m

Multiple new columns

We can add multiple columns with a single mutate. Below adds two new columns and then modifies the sex field.

library(tidyverse)
t <- tibble(sex = c("Man", "Woman", "MAN", "wOMan", "man"))

t2 <- t %>% 
  mutate(sex_capitalized = str_to_upper(sex),
         sex_lower = str_to_lower(sex),
         sex = str_to_title(sex))

print(t2)
## # A tibble: 5 × 3
##   sex   sex_capitalized sex_lower
##   <chr> <chr>           <chr>    
## 1 Man   MAN             man      
## 2 Woman WOMAN           woman    
## 3 Man   MAN             man      
## 4 Woman WOMAN           woman    
## 5 Man   MAN             man

arrange

The arrange functions sorts data.

Ascending

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
            age = c(20, 30, 40, 30, 50))

t_sorted <- t %>% 
  arrange(sex)

print(t_sorted)
## # A tibble: 5 × 2
##   sex     age
##   <chr> <dbl>
## 1 man      20
## 2 man      40
## 3 man      50
## 4 woman    30
## 5 woman    30

Ascending with two or more columns

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
            age = c(20, 30, 40, 30, 50))

t_sorted <- t %>% 
  arrange(sex, age)

print(t_sorted)
## # A tibble: 5 × 2
##   sex     age
##   <chr> <dbl>
## 1 man      20
## 2 man      40
## 3 man      50
## 4 woman    30
## 5 woman    30

Descending order

Wrap a field with desc(field_name) to reverse sort (z-a).

library(tidyverse)
t <- tibble(sex = c("man", "woman", "man", "woman", "man"),
            age = c(20, 30, 40, 30, 50))

t_sorted <- t %>% 
  arrange(desc(sex), desc(age))

print(t_sorted)
## # A tibble: 5 × 2
##   sex     age
##   <chr> <dbl>
## 1 woman    30
## 2 woman    30
## 3 man      50
## 4 man      40
## 5 man      20

Less commonly used dplyr commands

rename

Rename allows you to change the name of a field. This is really handy when dealing with badly-named fields containing spaces.

Note that we use the backtick for naming fields with spaces. These are different from 'single quotes' or "double quotes".

library(tidyverse)

# Create sample tibbles.
checks <- tibble(
  `Item Name` = c("Cheese", "Bread", "Milk", "Mustard", "Milk"))

checks <- checks %>% 
  rename(item_name = `Item Name`)

print(checks)
## # A tibble: 5 × 1
##   item_name
##   <chr>    
## 1 Cheese   
## 2 Bread    
## 3 Milk     
## 4 Mustard  
## 5 Milk

relocate

The relocate verb moves columns in a table. It is handy when you are looking for a table to be in a certain order.

pipe operator

We can put these all together using the %\>% (pipe) operator. This joins multiple results without needing to re-type stuff.

Note that all dplyr functions use the tibble as the first argument. Rather than nesting or using intermediate variables, %>% allows us to rewrite them.

x %\>% f(y) turns into f(x, y)

Read left-to-right, top-to-bottom.

You can type this with Control+Shift+M

References

The book “R for Data Science” is very good. Below are several key chapters:

Application problem

See problems on GitHub