PCA is a method for reducing the number of columns while retaining as much information as possible.
Outcomes
Principal Component Analysis (PCA) is a technique for:
Each principal component is:
Ordered so that:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Create a dataset
df = pd.DataFrame({
'height': [10, 12, 13, 15, 18, 20, 22, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'weight': [30, 32, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105],
'age': np.random.randint(20, 60, size=17)
})
# PCA works best when features are centered; standardization is often used when units differ.
X = df[['height', 'weight', 'age']].values # shape: (17, 3)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# `X_scaled` has mean ~0 and variance ~1 in each column.
# Fit PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# `pca.components_` is a 2×3 matrix. Each row is a principal component (PC1 and PC2), each column corresponds to a feature (`height`, `weight`, `age`).
# `pca.explained_variance_ratio_` might look something like `[0.95, 0.05]`, meaning the amount of variance captured by each principal component.
print("Principal components (directions):")
print(pca.components_)
print("\nExplained variance ratio:")
print(pca.explained_variance_ratio_)
Principal components (directions):
[[ 0.70317421 0.70333577 -0.10423451]
[ 0.07527239 0.07213746 0.99455028]]
Explained variance ratio:
[0.66635142 0.32967702]
We can now use the PCA results to do further analysis or visualization.
df_pca = pd.DataFrame(
X_pca,
columns=['PC1', 'PC2']
)
df_with_pca = pd.concat([df.reset_index(drop=True), df_pca], axis=1)
import matplotlib.pyplot as plt
# Plot PCA-transformed data with two scatter calls (matplotlib scatter marker must be a single style)
plt.figure(figsize=(12, 5))
young = df_with_pca[df_with_pca['age'] < 30]
middle = df_with_pca[(df_with_pca['age'] >= 30) & (df_with_pca['age'] < 45)]
old = df_with_pca[df_with_pca['age'] >= 45]
plt.scatter(
young['PC1'], young['PC2'],
c=young['height'] + young['weight'],
marker='.', cmap='viridis', alpha=0.7, label='age < 30'
)
plt.scatter(
middle['PC1'], middle['PC2'],
c=middle['height'] + middle['weight'],
marker='*', cmap='viridis', alpha=0.7, label='30 ≤ age < 45'
)
plt.scatter(
old['PC1'], old['PC2'],
c=old['height'] + old['weight'],
marker='o', cmap='viridis', alpha=0.7, label='age ≥ 45'
)
plt.colorbar(label='height + weight')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Scatter (Age-Based Markers)')
plt.legend()
plt.tight_layout()
plt.show()

This image show the PCA training process on a dataset with three features reduced to two principal components.
