Clustering Text Documents

Text clustering groups similar documents together without labels, enabling topic discovery, content organization, and exploratory text analysis.

Feature Extraction for Text

Text must be converted to numerical vectors before clustering. TF-IDF (Term Frequency-Inverse Document Frequency) is the standard approach, weighting terms by their importance in a document relative to the corpus.

TF-IDF Vectorization

<pre><code class="language-python">from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.datasets import fetch_20newsgroups # Load a subset of 20 newsgroups categories = ['rec.sport.baseball', 'sci.space', 'talk.politics.misc', 'comp.graphics'] news = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes')) vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', min_df=2, max_df=0.95) X_tfidf = vectorizer.fit_transform(news.data) print(f"TF-IDF matrix shape: {X_tfidf.shape}")</pre>

Clustering with K-Means

K-Means works well with TF-IDF because cosine similarity between sparse high-dimensional vectors approximates Euclidean distance on normalized TF-IDF vectors.

Fitting and Evaluating

<pre><code class="language-python">from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score from sklearn.preprocessing import normalize # Normalize for cosine-equivalent K-Means X_norm = normalize(X_tfidf) km = KMeans(n_clusters=4, init='k-means++', n_init=5, max_iter=100, random_state=42) labels = km.fit_predict(X_norm) print(f"ARI: {adjusted_rand_score(news.target, labels):.3f}")</pre>

Top Terms per Cluster

<pre><code class="language-python">import numpy as np terms = vectorizer.get_feature_names_out() order_centroids = km.cluster_centers_.argsort()[:, ::-1] for cluster_id in range(4): top_terms = [terms[i] for i in order_centroids[cluster_id, :10]] print(f"Cluster {cluster_id}: {', '.join(top_terms)}")</pre>

Dimensionality Reduction for Text

High-dimensional TF-IDF vectors (thousands of dimensions) benefit from dimensionality reduction before clustering, improving both speed and cluster quality.

Truncated SVD (LSA) for Sparse Text

<pre><code class="language-python">from sklearn.decomposition import TruncatedSVD from sklearn.pipeline import make_pipeline svd = TruncatedSVD(n_components=100, random_state=42) pipeline = make_pipeline(svd, normalize) X_lsa = pipeline.fit_transform(X_tfidf) km_lsa = KMeans(n_clusters=4, random_state=42, n_init=5) labels_lsa = km_lsa.fit_predict(X_lsa) print(f"LSA + K-Means ARI: {adjusted_rand_score(news.target, labels_lsa):.3f}")</pre>