course_model

Pandas 2: Data mutation

This tutorial shows how to create and remove columns in Pandas.

Outcomes:

Links:

Fix fields with dropping and renaming

We often need to fix fields in a dataframe. Common operations include removing and renaming.

# Sample fields 
import pandas as pd

# Create sample data
df_students = pd.DataFrame({
    'Student Id': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Class': ['Freshman', 'Sophomore', 'Junior', 'Senior'],
    'Scr': [85, 90, 78, 10],
    'Extracurricular': ['Basketball', 'Soccer', 'Chess', 'Debate']
})

# Rename all fields to be lowercase and use underscores
# This is a common convention in pandas dataframes
df_students.columns = [col.lower().replace(' ', '_') for col in df_students.columns ]

# Remove fields we do not need in our analysis
df_students = df_students.drop(columns=['extracurricular', 'name', 'class'])

# Rename 'scr' to the easier to understand 'exam_score'
# This can be easier to read if you put columns on a new line
df_students = df_students.rename(
    columns={'scr': 'exam_score'}
)

print(df_students)
   student_id  exam_score
0           1          85
1           2          90
2           3          78
3           4          10

Create a new field

We can create a new field in several ways:

Note that we can add a field directly, or use the assign function. The assign function is often more convenient when adding multiple fields at once. It also works better in method chains, as well as filtered datsets.

# Sample changing values
import pandas as pd
import numpy as np

# Create sample data
df_students = pd.DataFrame({
    'name': ['  alice ', 'bob', 'charlie', 'david'],
    'score': [85, 90, 78, np.nan],
})

# Create a new field with a constant value (same for all rows)
df_students = df_students.assign(passed=True)

# Create a new field based on a condition
# Note that it is either True or False based on the condition
# This uses the same syntax as filtering
# Note that I put the each new field on its own line for readability. This is optional.
df_students = df_students.assign(
    honor_roll = df_students['score'] >= 85
)


# Use a function to create a new field
# Note that if we want to use a string function, use str and then our function.

# Trim whitespace 
df_students = df_students.assign(
    name = df_students['name'].str.strip()
)

# Set to an all capitalized version of the tested field
df_students = df_students.assign(
    name = df_students['name'].str.capitalize()
)

# Round scores to nearest 10
df_students = df_students.assign(
    score_rounded = df_students['score'].round(-1)
)

df_students
name score passed honor_roll score_rounded
0 Alice 85.0 True True 80.0
1 Bob 90.0 True True 90.0
2 Charlie 78.0 True False 80.0
3 David NaN True False NaN

Conditional changes with np.where

A very common operation is to change values based on a condition. This can be done with numpy.where

The syntax is: np.where(condition, x, y)

# Sample changing values
import pandas as pd
import numpy as np

# Create sample data
df_students = pd.DataFrame({
    'name': ['alice', 'bob', 'charlie', 'david'],
    'class': ['Freshman', 'Sophomore', 'Junior', 'Senior'],
    'score': [85, 90, 78, np.nan],
})

# Set NA values in a field. 
# You can also test for *not* NA with ~(df['field'].isna()), ...
df_students = df_students.assign(
    score_not_na = np.where(df_students['score'].isna(), 0, df_students['score'])
)

# Set to a constant value.
df_students = df_students.assign(
    is_david = np.where(df_students['name'] == 'david', 1, 0)
)

# Set to a modified version of another field
# Note that we are only adding 5 when the condition is met, otherwise we subtract 5
df_students = df_students.assign(
    score_plus_5 = np.where(
        df_students['class'] == 'Freshman', 
        df_students['score'] + 5, 
        df_students['score'] - 5
    )
)

# Use a string function like contains to check for substrings
df_students = df_students.assign(
    name_contains_a = np.where(
        df_students['name'].str.contains('a'), 
        1,
        0)
)

# Set to a constant value for one of two conditions using | (or) or & (and)
df_students = df_students.assign(
    is_not_david = np.where(
        (df_students['name'] == 'alice') | 
            (df_students['name'] == 'charlie') |
            (df_students['name'] == 'david'), 
        1, 
        0)
)


df_students
name class score score_not_na is_david score_plus_5 name_contains_a is_not_david
0 alice Freshman 85.0 85.0 0 90.0 1 1
1 bob Sophomore 90.0 90.0 0 85.0 0 0
2 charlie Junior 78.0 78.0 0 73.0 1 1
3 david Senior NaN 0.0 1 NaN 1 1

Conditional changes with a dictionary and map

If we have a lot of fixes, it’s often easier to use a dictionary and map the values.

import pandas as pd

# Create sample data
df_students = pd.DataFrame({
    'class': ['Freshman', 'Sophomore', 'Junior', 'Senior'],
})

# Change values based on a dictionary mapping
class_mapping = {
    'Freshman': 1,
    'Sophomore': 2,
    'Junior': 3,
    'Senior': 4
}

df_students = df_students.assign(year = df_students['class'].map(class_mapping))

df_students
class year
0 Freshman 1
1 Sophomore 2
2 Junior 3
3 Senior 4