Principal Component Analysis (PCA) Math to Code
PCA finds the directions of maximum variance in data by computing the eigenvectors of the covariance matrix, projecting data onto these principal components to reduce dimensionality while preserving the most information.
Mathematical Foundation
PCA decomposes the covariance matrix C = (1/N) X^T X (for centered X) via eigendecomposition C = V \u039b V^T, where columns of V are principal components and \u039b contains eigenvalues (variances).
PCA from Scratch
<pre><code class="language-python">import numpy as np
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Step 1: Center
X_centered = X - X.mean(axis=0)
# Step 2: Covariance matrix
C = np.cov(X_centered.T) # shape (4, 4)
# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(C)
# Step 4: Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
PC = eigenvectors[:, idx] # principal components as columns
# Step 5: Project
X_pca = X_centered @ PC[:, :2]
print(f"Projected shape: {X_pca.shape}")
print(f"Explained variance ratio: {eigenvalues[:2] / eigenvalues.sum()}")</pre>
PCA with scikit-learn
sklearn's PCA uses SVD internally for numerical stability and exposes explained variance ratios, components, and loadings.
Fitting and Transforming
<pre><code class="language-python">from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Components shape: {pca.components_.shape}") # (2, 4)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {pca.explained_variance_ratio_.sum():.3f}")</pre>
Visualization
<pre><code class="language-python">import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
for cls, name in enumerate(load_iris().target_names):
mask = y == cls
plt.scatter(X_pca[mask, 0], X_pca[mask, 1], label=name, alpha=0.7)
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.title('Iris PCA Projection')
plt.legend(); plt.show()</pre>
Choosing n_components
Use n_components as a float to set the minimum explained variance threshold, or use a scree plot to find the elbow.
Variance-Based Selection
<pre><code class="language-python"># Retain 95% of variance automatically
pca_95 = PCA(n_components=0.95)
X_95 = pca_95.fit_transform(X_scaled)
print(f"Components for 95% variance: {pca_95.n_components_}")</pre>