This tutorial shows how to create and remove columns in Pandas.
Outcomes:
| Use & (and) and | (or) to combine multiple conditions |
Links:
We often need to fix fields in a dataframe. Common operations include removing and renaming.
# Sample fields
import pandas as pd
# Create sample data
df_students = pd.DataFrame({
'Student Id': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Class': ['Freshman', 'Sophomore', 'Junior', 'Senior'],
'Scr': [85, 90, 78, 10],
'Extracurricular': ['Basketball', 'Soccer', 'Chess', 'Debate']
})
# Rename all fields to be lowercase and use underscores
# This is a common convention in pandas dataframes
df_students.columns = [col.lower().replace(' ', '_') for col in df_students.columns ]
# Remove fields we do not need in our analysis
df_students = df_students.drop(columns=['extracurricular', 'name', 'class'])
# Rename 'scr' to the easier to understand 'exam_score'
# This can be easier to read if you put columns on a new line
df_students = df_students.rename(
columns={'scr': 'exam_score'}
)
print(df_students)
student_id exam_score
0 1 85
1 2 90
2 3 78
3 4 10
We can create a new field in several ways:
Note that we can add a field directly, or use the assign function. The assign function is often more convenient when adding multiple fields at once. It also works better in method chains, as well as filtered datsets.
# Sample changing values
import pandas as pd
import numpy as np
# Create sample data
df_students = pd.DataFrame({
'name': [' alice ', 'bob', 'charlie', 'david'],
'score': [85, 90, 78, np.nan],
})
# Create a new field with a constant value (same for all rows)
df_students = df_students.assign(passed=True)
# Create a new field based on a condition
# Note that it is either True or False based on the condition
# This uses the same syntax as filtering
# Note that I put the each new field on its own line for readability. This is optional.
df_students = df_students.assign(
honor_roll = df_students['score'] >= 85
)
# Use a function to create a new field
# Note that if we want to use a string function, use str and then our function.
# Trim whitespace
df_students = df_students.assign(
name = df_students['name'].str.strip()
)
# Set to an all capitalized version of the tested field
df_students = df_students.assign(
name = df_students['name'].str.capitalize()
)
# Round scores to nearest 10
df_students = df_students.assign(
score_rounded = df_students['score'].round(-1)
)
df_students
| name | score | passed | honor_roll | score_rounded | |
|---|---|---|---|---|---|
| 0 | Alice | 85.0 | True | True | 80.0 |
| 1 | Bob | 90.0 | True | True | 90.0 |
| 2 | Charlie | 78.0 | True | False | 80.0 |
| 3 | David | NaN | True | False | NaN |
A very common operation is to change values based on a condition. This can be done with numpy.where
The syntax is: np.where(condition, x, y)
# Sample changing values
import pandas as pd
import numpy as np
# Create sample data
df_students = pd.DataFrame({
'name': ['alice', 'bob', 'charlie', 'david'],
'class': ['Freshman', 'Sophomore', 'Junior', 'Senior'],
'score': [85, 90, 78, np.nan],
})
# Set NA values in a field.
# You can also test for *not* NA with ~(df['field'].isna()), ...
df_students = df_students.assign(
score_not_na = np.where(df_students['score'].isna(), 0, df_students['score'])
)
# Set to a constant value.
df_students = df_students.assign(
is_david = np.where(df_students['name'] == 'david', 1, 0)
)
# Set to a modified version of another field
# Note that we are only adding 5 when the condition is met, otherwise we subtract 5
df_students = df_students.assign(
score_plus_5 = np.where(
df_students['class'] == 'Freshman',
df_students['score'] + 5,
df_students['score'] - 5
)
)
# Use a string function like contains to check for substrings
df_students = df_students.assign(
name_contains_a = np.where(
df_students['name'].str.contains('a'),
1,
0)
)
# Set to a constant value for one of two conditions using | (or) or & (and)
df_students = df_students.assign(
is_not_david = np.where(
(df_students['name'] == 'alice') |
(df_students['name'] == 'charlie') |
(df_students['name'] == 'david'),
1,
0)
)
df_students
| name | class | score | score_not_na | is_david | score_plus_5 | name_contains_a | is_not_david | |
|---|---|---|---|---|---|---|---|---|
| 0 | alice | Freshman | 85.0 | 85.0 | 0 | 90.0 | 1 | 1 |
| 1 | bob | Sophomore | 90.0 | 90.0 | 0 | 85.0 | 0 | 0 |
| 2 | charlie | Junior | 78.0 | 78.0 | 0 | 73.0 | 1 | 1 |
| 3 | david | Senior | NaN | 0.0 | 1 | NaN | 1 | 1 |
If we have a lot of fixes, it’s often easier to use a dictionary and map the values.
import pandas as pd
# Create sample data
df_students = pd.DataFrame({
'class': ['Freshman', 'Sophomore', 'Junior', 'Senior'],
})
# Change values based on a dictionary mapping
class_mapping = {
'Freshman': 1,
'Sophomore': 2,
'Junior': 3,
'Senior': 4
}
df_students = df_students.assign(year = df_students['class'].map(class_mapping))
df_students
| class | year | |
|---|---|---|
| 0 | Freshman | 1 |
| 1 | Sophomore | 2 |
| 2 | Junior | 3 |
| 3 | Senior | 4 |