These functions in dplyr allow us to reshape data. See dplyr tutorial for more help.
These two command are similar to SQL’s GROUP BY
or
Excel’s Pivot Table.
First, the group_by
command adds a note on the tibble.
It does not change any of the data.
Second, the summarize
command combines rows. It reads
the note on the tibble set by group_by
. That field will be
used as the labels in my smaller tibble. summarize
then
performs some type of aggregation (such as sum or count).
Both summarize
and mutate
create a new
column. But, summarize
requires that each new field have
function, such as sum
or sd
.
summarize
also reduces the number of rows, one for each
unique value in the group_by
note.
library(tidyverse)
t <- tibble(
region = c('a', 'b', 'c', 'a', 'a', 'b'),
sales = c(0, 10, 5, 3, 2, 1)
)
# Show the sum of dales by region
t_sum <- t %>%
summarise(sum_of_sales = sum(sales))
print(t_sum)
## # A tibble: 1 × 1
## sum_of_sales
## <dbl>
## 1 21
While summarize
is helpful, we normally want to the
results grouped by a field.
Start by using group_by
, which includes region in our
results. Then use summarize
to show which field we want to
aggregate.
library(tidyverse)
t <- tibble(
region = c('a', 'b', 'c', 'a', 'a', 'b'),
sales = c(0, 10, 5, 3, 2, 1)
)
# Show the sum of sales by region
t_region_sum <- t %>%
group_by(region) %>%
summarise(sum_of_sales = sum(sales))
print(t_region_sum)
## # A tibble: 3 × 2
## region sum_of_sales
## <chr> <dbl>
## 1 a 5
## 2 b 11
## 3 c 5
We can group by more than one field.
library(tidyverse)
t <- tibble(
region = c('a', 'b', 'c', 'a', 'a', 'b'),
country = c('USA', 'USA', 'DEN', 'USA', 'USA', 'USA'),
sales = c(0, 10, 5, 3, 2, 1)
)
# Show the sum of dales by region
t_region_sum <- t %>%
group_by(country, region) %>%
summarise(sum_of_sales = sum(sales))
print(t_region_sum)
## # A tibble: 3 × 3
## # Groups: country [2]
## country region sum_of_sales
## <chr> <chr> <dbl>
## 1 DEN c 5
## 2 USA a 5
## 3 USA b 11
There are a number of useful functions.
library(tidyverse)
t <- tibble(
region = c('a', 'b', 'c', 'a', 'a', 'b'),
sales = c(0, 10, 5, 3, 2, 1)
)
# Show the sum of sales by region
t_region_stats <- t %>%
group_by(region) %>%
summarise(n_of_rows = n(),
sum_of_sales = sum(sales),
median_sales = median(sales),
mean_sales = mean(sales),
max_sales = max(sales),
min_sales = min(sales),
sd_sales = sd(sales),
n_of_distinct_sales = n_distinct(sales))
print(t_region_stats)
## # A tibble: 3 × 9
## region n_of_rows sum_of_sales median_sales mean_sales max_sales min_sales
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 a 3 5 2 1.67 3 0
## 2 b 2 11 5.5 5.5 10 1
## 3 c 1 5 5 5 5 5
## # ℹ 2 more variables: sd_sales <dbl>, n_of_distinct_sales <int>
Functions will have issues if there is a NA. Use an argument to remove all NAs
library(tidyverse)
t <- tibble(
region = c('a', 'a', 'b', 'b'),
sales = c(0, 10, 5, NA)
)
# Show the sum of dales by region
t_region_stats <- t %>%
group_by(region) %>%
summarise(n_of_rows = n(),
sum_of_sales = sum(sales, na.rm = TRUE),
median_sales = median(sales, na.rm = TRUE),
mean_sales = mean(sales, na.rm = TRUE),
max_sales = max(sales, na.rm = TRUE),
min_sales = min(sales, na.rm = TRUE),
sd_sales = sd(sales, na.rm = TRUE),
n_of_distinct_sales = n_distinct(sales, na.rm = TRUE))
print(t_region_stats)
## # A tibble: 2 × 9
## region n_of_rows sum_of_sales median_sales mean_sales max_sales min_sales
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 a 2 10 5 5 10 0
## 2 b 2 5 5 5 5 5
## # ℹ 2 more variables: sd_sales <dbl>, n_of_distinct_sales <int>
Sometimes we want to remove a group from a tibble. You can call
ungroup
. In this course, we generally create a new tibble
for each summary. But, in the real world, you sometimes want to avoid
duplicating tibbles when working with large datasets.
library(tidyverse)
t <- tibble(
region = c('a', 'b', 'c', 'a', 'a', 'b'),
sales = c(0, 10, 5, 3, 2, 1)
)
# Group and then ungroup
t <- t %>%
group_by(region) %>%
ungroup()
print(t)
## # A tibble: 6 × 2
## region sales
## <chr> <dbl>
## 1 a 0
## 2 b 10
## 3 c 5
## 4 a 3
## 5 a 2
## 6 b 1
Joins are used to combine multiple tibbles. They are very similar to joins in SQL.
Here is a more thorough guide: Let’s talk about joins
Inner join will return all rows that exist in both tables.
library(tidyverse)
# Create sample tibbles.
checks <- tibble(
vendor_pk = c(2, 2, 2, 3, 5),
amount = c(10, 15, 30, 40, 50)
)
vendors <- tibble(
vendor_pk = c(2, 3, 4, 9),
vendor_name = c("Bob", "Sarah", "Tim", "Andrew"),
)
# Note that we get the same results if we start with either vendor or checks
joined_table <- checks %>%
inner_join(vendors, by = "vendor_pk")
print(joined_table)
## # A tibble: 4 × 3
## vendor_pk amount vendor_name
## <dbl> <dbl> <chr>
## 1 2 10 Bob
## 2 2 15 Bob
## 3 2 30 Bob
## 4 3 40 Sarah
Left (outer) join will return all rows from one table, and matching rows in the second table.
Order matters! When using a left join, whatever table you start with will have all rows returned. Only matching rows from the right will be returned.
library(tidyverse)
# Create sample tibbles.
checks <- tibble(
vendor_pk = c(2, 2, 2, 3, 5),
amount = c(10, 15, 30, 40, 50)
)
vendors <- tibble(
vendor_pk = c(2, 3, 4, 9),
vendor_name = c("Bob", "Sarah", "Tim", "Andrew"),
)
# Note that we get *different* results if we start with either vendor or checks
# Because we start with checks, all checks are included (and only matching vendors)
joined_table <- checks %>%
left_join(vendors, by = "vendor_pk")
print(joined_table)
## # A tibble: 5 × 3
## vendor_pk amount vendor_name
## <dbl> <dbl> <chr>
## 1 2 10 Bob
## 2 2 15 Bob
## 3 2 30 Bob
## 4 3 40 Sarah
## 5 5 50 <NA>
Rename allows you to change the name of a field. This is really handy when each tibble has a different name for a field.
library(tidyverse)
# Create sample tibbles.
checks <- tibble(
check_id = c(1, 2, 3, 4),
check_amount = c(100, 200, 150, 250)
)
invoices <- tibble(
invoice_key = c(10, 20, 30, 40),
check_key = c(1, 2, 3, 4)
)
# Start with checks, rename its field to match the
# name in the other table, and then join using the new name.
joined_data <- checks %>%
rename(check_key = check_id) %>%
inner_join(invoices, by = 'check_key')
print(joined_data)
## # A tibble: 4 × 3
## check_key check_amount invoice_key
## <dbl> <dbl> <dbl>
## 1 1 100 10
## 2 2 200 20
## 3 3 150 30
## 4 4 250 40
We can use slice_max and slice_min to find the rows with the highest and lowest values in a field.
slice_min
pulls the rows with the lowest value in the
given field. It respects the current grouping level. When pulling a
single value, it is basically identical to min(field_name)
.
slice_max
is the same, but pulls the largest item instead
of the smallest.
library(tidyverse)
t <- tibble(
region = c('a', 'b', 'c', 'a', 'a', 'b'),
sales = c(0, 10, 5, 3, 2, 1)
)
# Group and then find smallest value
t <- t %>%
group_by(region) %>%
slice_min(sales, n = 1) # same as summarize(sales = min(sales))
print(t)
## # A tibble: 3 × 2
## # Groups: region [3]
## region sales
## <chr> <dbl>
## 1 a 0
## 2 b 1
## 3 c 5
When given n > 1
, slice will pull multiple rows.
library(tidyverse)
t <- tibble(
region = c('a', 'b', 'c', 'a', 'a', 'b'),
sales = c(0, 10, 5, 3, 2, 1)
)
# Group and then find the 2 smallest items from each group
t <- t %>%
group_by(region) %>%
slice_min(sales, n = 2)
print(t)
## # A tibble: 5 × 2
## # Groups: region [3]
## region sales
## <chr> <dbl>
## 1 a 0
## 2 a 2
## 3 b 1
## 4 b 10
## 5 c 5
The relocate verb moves columns in a table. It is handy when you are looking for a table to be in a certain order.