Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm iterates between computing expected assignments (E-step) and maximizing parameters given those assignments (M-step) to fit models with latent variables.
The Two EM Steps
EM alternates between two steps until convergence: the E-step computes posterior probabilities of latent variables (soft cluster assignments), and the M-step updates model parameters to maximize the expected log-likelihood.
E-Step: Compute Responsibilities
For GMMs, the E-step computes the responsibility r_{ik} = \u03c0_k \u2115(x_i; \u03bc_k, \u03a3_k) / p(x_i) — the probability that component k generated sample i. This is a Bayes' theorem application using current parameter estimates.
M-Step: Update Parameters
Given responsibilities from the E-step, the M-step updates the parameters analytically: \u03bc_k = \u03a3_i r_{ik} x_i / N_k, \u03a3_k = \u03a3_i r_{ik} (x_i - \u03bc_k)(x_i - \u03bc_k)^T / N_k, and \u03c0_k = N_k / N, where N_k = \u03a3_i r_{ik}.
Convergence Properties
EM is guaranteed to monotonically increase the log-likelihood at every iteration, converging to a local maximum. It never decreases the objective, but convergence to the global maximum is not guaranteed.
Monitoring Convergence
Multiple Restarts
Since EM converges to local optima, running with n_init > 1 starts from multiple random initializations and returns the result with the highest log-likelihood. For GMMs, K-Means initialization (init_params='kmeans') often provides better starting points than random initialization.
EM Beyond GMMs
EM is a general-purpose algorithm for any model with latent variables: Hidden Markov Models (Baum-Welch algorithm), topic models (LDA), and missing data imputation all use EM principles.
EM vs. K-Means
K-Means can be seen as a special case of EM with hard assignments (each point belongs to exactly one cluster) and spherical, equal-variance Gaussians. GMM with EM generalizes this to soft assignments and arbitrary covariance structures, at the cost of more parameters and greater computational expense per iteration.