group_by and summarize
Joins
Less common commands
- slice_min/max
  - slice_min when n = 1
  - slice_min when n > 1
- relocate
Application problem

These functions in dplyr allow us to reshape data. See dplyr tutorial for more help.

group_by and summarize

These two command are similar to SQL’s GROUP BY or Excel’s Pivot Table.

First, the group_by command adds a note on the tibble. It does not change any of the data.

Second, the summarize command combines rows. It reads the note on the tibble set by group_by. That field will be used as the labels in my smaller tibble. summarize then performs some type of aggregation (such as sum or count).

Summarize without group_by

Both summarize and mutate create a new column. But, summarize requires that each new field have function, such as sum or sd. summarize also reduces the number of rows, one for each unique value in the group_by note.

library(tidyverse)

t <- tibble(
  region = c('a', 'b', 'c', 'a', 'a', 'b'),
  sales = c(0, 10, 5, 3, 2, 1)
)

# Show the sum of dales by region
t_sum <- t %>% 
  summarise(sum_of_sales = sum(sales))

print(t_sum)

## # A tibble: 1 × 1
##   sum_of_sales
##          <dbl>
## 1           21

Summarize with group_by

While summarize is helpful, we normally want to the results grouped by a field.

Start by using group_by, which includes region in our results. Then use summarize to show which field we want to aggregate.

library(tidyverse)

t <- tibble(
  region = c('a', 'b', 'c', 'a', 'a', 'b'),
  sales = c(0, 10, 5, 3, 2, 1)
)

# Show the sum of sales by region
t_region_sum <- t %>% 
  group_by(region) %>% 
  summarise(sum_of_sales = sum(sales))

print(t_region_sum)

## # A tibble: 3 × 2
##   region sum_of_sales
##   <chr>         <dbl>
## 1 a                 5
## 2 b                11
## 3 c                 5

Two or more groups

We can group by more than one field.

library(tidyverse)

t <- tibble(
  region = c('a', 'b', 'c', 'a', 'a', 'b'),
  country = c('USA', 'USA', 'DEN', 'USA', 'USA', 'USA'),
  sales = c(0, 10, 5, 3, 2, 1)
)

# Show the sum of dales by region
t_region_sum <- t %>% 
  group_by(country, region) %>% 
  summarise(sum_of_sales = sum(sales))

print(t_region_sum)

## # A tibble: 3 × 3
## # Groups:   country [2]
##   country region sum_of_sales
##   <chr>   <chr>         <dbl>
## 1 DEN     c                 5
## 2 USA     a                 5
## 3 USA     b                11

Functions

There are a number of useful functions.

library(tidyverse)

t <- tibble(
  region = c('a', 'b', 'c', 'a', 'a', 'b'),
  sales = c(0, 10, 5, 3, 2, 1)
)

# Show the sum of sales by region
t_region_stats <- t %>% 
  group_by(region) %>% 
  summarise(n_of_rows = n(),
            sum_of_sales = sum(sales),
            median_sales = median(sales),
            mean_sales = mean(sales),
            max_sales = max(sales),
            min_sales = min(sales),
            sd_sales = sd(sales),
            n_of_distinct_sales = n_distinct(sales))

print(t_region_stats)

## # A tibble: 3 × 9
##   region n_of_rows sum_of_sales median_sales mean_sales max_sales min_sales
##   <chr>      <int>        <dbl>        <dbl>      <dbl>     <dbl>     <dbl>
## 1 a              3            5          2         1.67         3         0
## 2 b              2           11          5.5       5.5         10         1
## 3 c              1            5          5         5            5         5
## # ℹ 2 more variables: sd_sales <dbl>, n_of_distinct_sales <int>

Function NA problem

Functions will have issues if there is a NA. Use an argument to remove all NAs

library(tidyverse)

t <- tibble(
  region = c('a', 'a', 'b', 'b'),
  sales = c(0, 10, 5, NA)
)

# Show the sum of dales by region
t_region_stats <- t %>% 
  group_by(region) %>% 
  summarise(n_of_rows = n(),
            sum_of_sales = sum(sales, na.rm = TRUE),
            median_sales = median(sales, na.rm = TRUE),
            mean_sales = mean(sales, na.rm = TRUE),
            max_sales = max(sales, na.rm = TRUE),
            min_sales = min(sales, na.rm = TRUE),
            sd_sales = sd(sales, na.rm = TRUE),
            n_of_distinct_sales = n_distinct(sales, na.rm = TRUE))

print(t_region_stats)

## # A tibble: 2 × 9
##   region n_of_rows sum_of_sales median_sales mean_sales max_sales min_sales
##   <chr>      <int>        <dbl>        <dbl>      <dbl>     <dbl>     <dbl>
## 1 a              2           10            5          5        10         0
## 2 b              2            5            5          5         5         5
## # ℹ 2 more variables: sd_sales <dbl>, n_of_distinct_sales <int>

ungroup

Sometimes we want to remove a group from a tibble. You can call ungroup. In this course, we generally create a new tibble for each summary. But, in the real world, you sometimes want to avoid duplicating tibbles when working with large datasets.

library(tidyverse)

t <- tibble(
  region = c('a', 'b', 'c', 'a', 'a', 'b'),
  sales = c(0, 10, 5, 3, 2, 1)
)

# Group and then ungroup
t <- t %>% 
  group_by(region) %>% 
  ungroup()

print(t)

## # A tibble: 6 × 2
##   region sales
##   <chr>  <dbl>
## 1 a          0
## 2 b         10
## 3 c          5
## 4 a          3
## 5 a          2
## 6 b          1

Joins

Joins are used to combine multiple tibbles. They are very similar to joins in SQL.

Here is a more thorough guide: Let’s talk about joins

inner_join

Inner join will return all rows that exist in both tables.

library(tidyverse)

# Create sample tibbles.
checks <- tibble(
  vendor_pk = c(2, 2, 2, 3, 5),
  amount = c(10, 15, 30, 40, 50)
)

vendors <- tibble(
  vendor_pk = c(2, 3, 4, 9),
  vendor_name = c("Bob", "Sarah", "Tim", "Andrew"),
)

# Note that we get the same results if we start with either vendor or checks
joined_table <- checks %>% 
  inner_join(vendors, by = "vendor_pk")

print(joined_table)

## # A tibble: 4 × 3
##   vendor_pk amount vendor_name
##       <dbl>  <dbl> <chr>      
## 1         2     10 Bob        
## 2         2     15 Bob        
## 3         2     30 Bob        
## 4         3     40 Sarah

left_join

Left (outer) join will return all rows from one table, and matching rows in the second table.

Order matters! When using a left join, whatever table you start with will have all rows returned. Only matching rows from the right will be returned.

library(tidyverse)

# Create sample tibbles.
checks <- tibble(
  vendor_pk = c(2, 2, 2, 3, 5),
  amount = c(10, 15, 30, 40, 50)
)

vendors <- tibble(
  vendor_pk = c(2, 3, 4, 9),
  vendor_name = c("Bob", "Sarah", "Tim", "Andrew"),
)

# Note that we get *different* results if we start with either vendor or checks
# Because we start with checks, all checks are included (and only matching vendors)
joined_table <- checks %>% 
  left_join(vendors, by = "vendor_pk")

print(joined_table)

## # A tibble: 5 × 3
##   vendor_pk amount vendor_name
##       <dbl>  <dbl> <chr>      
## 1         2     10 Bob        
## 2         2     15 Bob        
## 3         2     30 Bob        
## 4         3     40 Sarah      
## 5         5     50 <NA>

rename when fields have different names

Rename allows you to change the name of a field. This is really handy when each tibble has a different name for a field.

library(tidyverse)

# Create sample tibbles.
checks <- tibble(
  check_id = c(1, 2, 3, 4),
  check_amount = c(100, 200, 150, 250)
)

invoices <- tibble(
  invoice_key = c(10, 20, 30, 40),
  check_key = c(1, 2, 3, 4)
)

# Start with checks, rename its field to match the 
# name in the other table, and then join using the new name.
joined_data <- checks %>% 
  rename(check_key = check_id) %>% 
  inner_join(invoices, by = 'check_key')

print(joined_data)

## # A tibble: 4 × 3
##   check_key check_amount invoice_key
##       <dbl>        <dbl>       <dbl>
## 1         1          100          10
## 2         2          200          20
## 3         3          150          30
## 4         4          250          40

Less common commands

slice_min/max

We can use slice_max and slice_min to find the rows with the highest and lowest values in a field.

slice_min when n = 1

slice_min pulls the rows with the lowest value in the given field. It respects the current grouping level. When pulling a single value, it is basically identical to min(field_name). slice_max is the same, but pulls the largest item instead of the smallest.

library(tidyverse)

t <- tibble(
  region = c('a', 'b', 'c', 'a', 'a', 'b'),
  sales = c(0, 10, 5, 3, 2, 1)
)

# Group and then find smallest value
t <- t %>% 
  group_by(region) %>% 
  slice_min(sales, n = 1) # same as summarize(sales = min(sales))

print(t)

## # A tibble: 3 × 2
## # Groups:   region [3]
##   region sales
##   <chr>  <dbl>
## 1 a          0
## 2 b          1
## 3 c          5

slice_min when n > 1

When given n > 1, slice will pull multiple rows.

library(tidyverse)

t <- tibble(
  region = c('a', 'b', 'c', 'a', 'a', 'b'),
  sales = c(0, 10, 5, 3, 2, 1)
)

# Group and then find the 2 smallest items from each group
t <- t %>% 
  group_by(region) %>% 
  slice_min(sales, n = 2) 

print(t)

## # A tibble: 5 × 2
## # Groups:   region [3]
##   region sales
##   <chr>  <dbl>
## 1 a          0
## 2 a          2
## 3 b          1
## 4 b         10
## 5 c          5

relocate

The relocate verb moves columns in a table. It is handy when you are looking for a table to be in a certain order.

Application problem

See problems on GitHub

dplyr2 - shaping