course_model

Pandas Grouping and Joining

This tutorial shows how to group and join data using the Pandas library in Python.

Outcomes:

Group data using aggregation and transformation
Join data from multiple DataFrames

Links:

template

Grouping Data

We can group data by one or more columns and then perform aggregate functions on the grouped data. This is done using the groupby() and agg() functions.

In this class, we will always use these two functions together in the same call. This is because groupby() does not return a DataFrame, but rather a GroupBy object that requires an aggregation function to produce a DataFrame.

Group by with groupby()

Not that this is one of the rare functions with a side effect; you do not get a new DataFrame back until you apply an aggregation function.

Arguments:

by: column(s) to group by.
as_index: if True, the grouped columns will be set as the index of the resulting DataFrame. Default is True.
sort: if True, the grouped keys will be sorted. Default is True.

Example:

df.groupby(by='column_name', as_index=False, sort=True)

Aggregate with agg()

There are many ways to call agg(). I will try to consistently use this format, which has fewer surprises.

Each call needs aggregation function, such as sum(), mean(), count(), etc. It is generally a good idea to include a n column that counts the number of rows in each group, which can be done by counting any column (e.g., student) that is not null. Functions include: sum, mean, median, min, max, count, first, last, std, var, etc.

Arguments:

agg(): takes a dictionary where the keys are the names of the new columns and the values are tuples of the form (column_name, aggregation_function).

import pandas as pd

# Create sample data
df = pd.DataFrame({
    "student": ['Bob', 'Bob', 'Sarah', 'Tim', 'John', 'John', 'John'],
    "class": ["Freshman", "Freshman", "Sophomore", "Sophomore", "Junior", "Senior", "Senior"],
    "score": [90, 80, 85, 88, 92, 95, 87]
})

print('Aggregated by student name')
df_result = df.groupby(by='student').agg(
    mean_student_score=('score', 'mean'),
    lowest_student_score=('score', 'min'),
    highest_student_score=('score', 'max'),
    n = ('student', 'count')
)
print(df_result)    

print('Aggregated by class')
df_class_result = df.groupby(by='class').agg(
    mean_student_score=('score', 'mean'),
    n = ('student', 'count')
)
print(df_class_result)

Aggregated by student name
         mean_student_score  lowest_student_score  highest_student_score  n
student                                                                    
Bob               85.000000                    80                     90  2
John              91.333333                    87                     95  3
Sarah             85.000000                    85                     85  1
Tim               88.000000                    88                     88  1
Aggregated by class
           mean_student_score  n
class                           
Freshman                 85.0  2
Junior                   92.0  1
Senior                   91.0  2
Sophomore                86.5  2

Join data

We regularly need to combine data from multiple tables. This is called a join or a merge. In pandas, we can use the merge() function to do this.

Arguments:

how: type of join to perform. Options are ‘left’, ‘right’, ‘outer’, ‘inner’. Default is ‘inner’.
on: column(s) to join on. If not specified, pandas will use columns with the same name in both tables.
left_on: column(s) from the left DataFrame to join on.
right_on: column(s) from the right DataFrame to join on.
suffixes: tuple of string suffixes to apply to overlapping column names in the left and right DataFrames.

# Join data example
import pandas as pd

# Create sample data
df_students = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
})

df_classes = pd.DataFrame({
    "student_id": [1, 1, 2, 4],
    "name": ["CS101", "Math101", "CS101", "History101"],
})

# Inner join
# Note the order, we want all students with classes, so first is df_students
# The 'how' parameter can be 'inner', 'left', 'right', or 'outer'   
# The left_on and right_on parameters specify the columns to join on
# The suffixes parameter specifies suffixes to add to overlapping column names
df_students_with_classes = pd.merge(df_students, 
                                    df_classes, 
                                    how='inner', 
                                    left_on='id', 
                                    right_on='student_id', 
                                    suffixes=('_student', '_class'))


df_students_with_classes

	id	name_student	student_id	name_class
0	1	Alice	1	CS101
1	1	Alice	1	Math101
2	2	Bob	2	CS101