Gaussian Mixture Models (GMM)

Gaussian Mixture Models assume data is generated from a mixture of K Gaussian distributions and use the EM algorithm to learn component parameters, providing soft probabilistic cluster memberships.


GMM as a Probabilistic Generative Model

The GMM density is p(x) = \u03a3_{k=1}^{K} \u03c0_k \u00b7 \u2115(x; \u03bc_k, \u03a3_k), where \u03c0_k are mixing weights summing to 1, and each component is a multivariate Gaussian.

Soft vs. Hard Assignments

Unlike K-Means (hard assignment), GMM assigns each point a probability of belonging to each component. This makes GMMs better for overlapping clusters and provides uncertainty estimates in cluster membership.

GMM in scikit-learn

GaussianMixture supports four covariance types: 'full', 'tied', 'diag', and 'spherical', offering flexibility from full covariance matrices to isotropic Gaussians.

Fitting and Predicting

<pre><code class="language-python">from sklearn.mixture import GaussianMixture from sklearn.datasets import make_blobs import numpy as np X, _ = make_blobs(n_samples=400, centers=3, cluster_std=[0.5, 1.0, 0.7], random_state=42) gmm = GaussianMixture(n_components=3, covariance_type='full', n_init=5, random_state=42) gmm.fit(X) # Hard labels labels = gmm.predict(X) # Soft probabilities probs = gmm.predict_proba(X) print(f"BIC: {gmm.bic(X):.2f}") print(f"Converged: {gmm.converged_}, Iterations: {gmm.n_iter_}")</pre>

Covariance Types

  • 'full': Each component has its own full covariance matrix (most flexible, most parameters).
  • 'tied': All components share the same covariance matrix.
  • 'diag': Each component has its own diagonal covariance (axis-aligned ellipses).
  • 'spherical': Each component has a single variance (circular clusters).

Model Selection: BIC and AIC

Choose the number of components using BIC (Bayesian Information Criterion) or AIC (Akaike Information Criterion), both penalizing model complexity.

Selecting n_components

<pre><code class="language-python">import matplotlib.pyplot as plt bics = [] for n in range(1, 8): gmm_n = GaussianMixture(n_components=n, n_init=5, random_state=42) gmm_n.fit(X) bics.append(gmm_n.bic(X)) plt.plot(range(1, 8), bics, 'bo-') plt.xlabel('n_components'); plt.ylabel('BIC') plt.title('GMM Model Selection via BIC') plt.show() print(f"Best n_components: {bics.index(min(bics)) + 1}")</pre>