course_model

Pandas Grouping and Joining

This tutorial shows how to group and join data using the Pandas library in Python.

Outcomes:

Links:

Grouping Data

We can group data by one or more columns and then perform aggregate functions on the grouped data. This is done using the groupby() and agg() functions.

In this class, we will always use these two functions together in the same call. This is because groupby() does not return a DataFrame, but rather a GroupBy object that requires an aggregation function to produce a DataFrame.

Group by with groupby()

Not that this is one of the rare functions with a side effect; you do not get a new DataFrame back until you apply an aggregation function.

Arguments:

Example:

df.groupby(by='column_name', as_index=False, sort=True)

Aggregate with agg()

There are many ways to call agg(). I will try to consistently use this format, which has fewer surprises.

Each call needs aggregation function, such as sum(), mean(), count(), etc. It is generally a good idea to include a n column that counts the number of rows in each group, which can be done by counting any column (e.g., student) that is not null. Functions include: sum, mean, median, min, max, count, first, last, std, var, etc.

Arguments:

import pandas as pd

# Create sample data
df = pd.DataFrame({
    "student": ['Bob', 'Bob', 'Sarah', 'Tim', 'John', 'John', 'John'],
    "class": ["Freshman", "Freshman", "Sophomore", "Sophomore", "Junior", "Senior", "Senior"],
    "score": [90, 80, 85, 88, 92, 95, 87]
})

print('Aggregated by student name')
df_result = df.groupby(by='student').agg(
    mean_student_score=('score', 'mean'),
    lowest_student_score=('score', 'min'),
    highest_student_score=('score', 'max'),
    n = ('student', 'count')
)
print(df_result)    

print('Aggregated by class')
df_class_result = df.groupby(by='class').agg(
    mean_student_score=('score', 'mean'),
    n = ('student', 'count')
)
print(df_class_result)


Aggregated by student name
         mean_student_score  lowest_student_score  highest_student_score  n
student                                                                    
Bob               85.000000                    80                     90  2
John              91.333333                    87                     95  3
Sarah             85.000000                    85                     85  1
Tim               88.000000                    88                     88  1
Aggregated by class
           mean_student_score  n
class                           
Freshman                 85.0  2
Junior                   92.0  1
Senior                   91.0  2
Sophomore                86.5  2

Join data

We regularly need to combine data from multiple tables. This is called a join or a merge. In pandas, we can use the merge() function to do this.

Arguments:

# Join data example
import pandas as pd

# Create sample data
df_students = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
})

df_classes = pd.DataFrame({
    "student_id": [1, 1, 2, 4],
    "name": ["CS101", "Math101", "CS101", "History101"],
})

# Inner join
# Note the order, we want all students with classes, so first is df_students
# The 'how' parameter can be 'inner', 'left', 'right', or 'outer'   
# The left_on and right_on parameters specify the columns to join on
# The suffixes parameter specifies suffixes to add to overlapping column names
df_students_with_classes = pd.merge(df_students, 
                                    df_classes, 
                                    how='inner', 
                                    left_on='id', 
                                    right_on='student_id', 
                                    suffixes=('_student', '_class'))


df_students_with_classes

id name_student student_id name_class
0 1 Alice 1 CS101
1 1 Alice 1 Math101
2 2 Bob 2 CS101