course_model

Principle Component Analysis (PCA)

PCA is a method for reducing the number of columns while retaining as much information as possible.

Outcomes

PCA Algorithm

Principal Component Analysis (PCA) is a technique for:

Each principal component is:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Create a dataset
df = pd.DataFrame({
    'height': [10, 12, 13, 15, 18, 20, 22, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'weight': [30, 32, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105],
    'age': np.random.randint(20, 60, size=17)
})

# PCA works best when features are centered; standardization is often used when units differ.
X = df[['height', 'weight', 'age']].values  # shape: (17, 3)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) 
# `X_scaled` has mean ~0 and variance ~1 in each column.

# Fit PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# `pca.components_` is a 2×3 matrix. Each row is a principal component (PC1 and PC2), each column corresponds to a feature (`height`, `weight`, `age`).
# `pca.explained_variance_ratio_` might look something like `[0.95, 0.05]`, meaning the amount of variance captured by each principal component.
print("Principal components (directions):")
print(pca.components_)

print("\nExplained variance ratio:")
print(pca.explained_variance_ratio_)
Principal components (directions):
[[ 0.70317421  0.70333577 -0.10423451]
 [ 0.07527239  0.07213746  0.99455028]]

Explained variance ratio:
[0.66635142 0.32967702]

Add PCA results back to a DataFrame

We can now use the PCA results to do further analysis or visualization.

df_pca = pd.DataFrame(
    X_pca,
    columns=['PC1', 'PC2']
)

df_with_pca = pd.concat([df.reset_index(drop=True), df_pca], axis=1)

import matplotlib.pyplot as plt

# Plot PCA-transformed data with two scatter calls (matplotlib scatter marker must be a single style)
plt.figure(figsize=(12, 5))

young = df_with_pca[df_with_pca['age'] < 30]
middle = df_with_pca[(df_with_pca['age'] >= 30) & (df_with_pca['age'] < 45)]
old = df_with_pca[df_with_pca['age'] >= 45]

plt.scatter(
    young['PC1'], young['PC2'],
    c=young['height'] + young['weight'],
    marker='.', cmap='viridis', alpha=0.7, label='age < 30'
)
plt.scatter(
    middle['PC1'], middle['PC2'],
    c=middle['height'] + middle['weight'],
    marker='*', cmap='viridis', alpha=0.7, label='30 ≤ age < 45'
)
plt.scatter(
    old['PC1'], old['PC2'],
    c=old['height'] + old['weight'],
    marker='o', cmap='viridis', alpha=0.7, label='age ≥ 45'
)

plt.colorbar(label='height + weight')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Scatter (Age-Based Markers)')
plt.legend()
plt.tight_layout()
plt.show()

png

Example PCA Process

This image show the PCA training process on a dataset with three features reduced to two principal components. PCA Process