Principal Component Analysis (PCA) Math to Code

PCA finds the directions of maximum variance in data by computing the eigenvectors of the covariance matrix, projecting data onto these principal components to reduce dimensionality while preserving the most information.

Mathematical Foundation

PCA decomposes the covariance matrix C = (1/N) X^T X (for centered X) via eigendecomposition C = V \u039b V^T, where columns of V are principal components and \u039b contains eigenvalues (variances).

PCA from Scratch

<pre><code class="language-python">import numpy as np from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) # Step 1: Center X_centered = X - X.mean(axis=0) # Step 2: Covariance matrix C = np.cov(X_centered.T) # shape (4, 4) # Step 3: Eigendecomposition eigenvalues, eigenvectors = np.linalg.eigh(C) # Step 4: Sort by descending eigenvalue idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] PC = eigenvectors[:, idx] # principal components as columns # Step 5: Project X_pca = X_centered @ PC[:, :2] print(f"Projected shape: {X_pca.shape}") print(f"Explained variance ratio: {eigenvalues[:2] / eigenvalues.sum()}")</pre>

PCA with scikit-learn

sklearn's PCA uses SVD internally for numerical stability and exposes explained variance ratios, components, and loadings.

Fitting and Transforming

<pre><code class="language-python">from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"Components shape: {pca.components_.shape}") # (2, 4) print(f"Explained variance ratio: {pca.explained_variance_ratio_}") print(f"Total variance retained: {pca.explained_variance_ratio_.sum():.3f}")</pre>

Visualization

<pre><code class="language-python">import matplotlib.pyplot as plt plt.figure(figsize=(8, 5)) for cls, name in enumerate(load_iris().target_names): mask = y == cls plt.scatter(X_pca[mask, 0], X_pca[mask, 1], label=name, alpha=0.7) plt.xlabel('PC1'); plt.ylabel('PC2') plt.title('Iris PCA Projection') plt.legend(); plt.show()</pre>

Choosing n_components

Use n_components as a float to set the minimum explained variance threshold, or use a scree plot to find the elbow.

Variance-Based Selection

<pre><code class="language-python"># Retain 95% of variance automatically pca_95 = PCA(n_components=0.95) X_95 = pca_95.fit_transform(X_scaled) print(f"Components for 95% variance: {pca_95.n_components_}")</pre>