Clustering Text Documents
Text clustering groups similar documents together without labels, enabling topic discovery, content organization, and exploratory text analysis.
Feature Extraction for Text
Text must be converted to numerical vectors before clustering. TF-IDF (Term Frequency-Inverse Document Frequency) is the standard approach, weighting terms by their importance in a document relative to the corpus.
TF-IDF Vectorization
<pre><code class="language-python">from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
# Load a subset of 20 newsgroups
categories = ['rec.sport.baseball', 'sci.space', 'talk.politics.misc', 'comp.graphics']
news = fetch_20newsgroups(subset='train', categories=categories,
remove=('headers', 'footers', 'quotes'))
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english',
min_df=2, max_df=0.95)
X_tfidf = vectorizer.fit_transform(news.data)
print(f"TF-IDF matrix shape: {X_tfidf.shape}")</pre>
Clustering with K-Means
K-Means works well with TF-IDF because cosine similarity between sparse high-dimensional vectors approximates Euclidean distance on normalized TF-IDF vectors.
Fitting and Evaluating
<pre><code class="language-python">from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import normalize
# Normalize for cosine-equivalent K-Means
X_norm = normalize(X_tfidf)
km = KMeans(n_clusters=4, init='k-means++', n_init=5,
max_iter=100, random_state=42)
labels = km.fit_predict(X_norm)
print(f"ARI: {adjusted_rand_score(news.target, labels):.3f}")</pre>
Top Terms per Cluster
<pre><code class="language-python">import numpy as np
terms = vectorizer.get_feature_names_out()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for cluster_id in range(4):
top_terms = [terms[i] for i in order_centroids[cluster_id, :10]]
print(f"Cluster {cluster_id}: {', '.join(top_terms)}")</pre>
Dimensionality Reduction for Text
High-dimensional TF-IDF vectors (thousands of dimensions) benefit from dimensionality reduction before clustering, improving both speed and cluster quality.
Truncated SVD (LSA) for Sparse Text
<pre><code class="language-python">from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
svd = TruncatedSVD(n_components=100, random_state=42)
pipeline = make_pipeline(svd, normalize)
X_lsa = pipeline.fit_transform(X_tfidf)
km_lsa = KMeans(n_clusters=4, random_state=42, n_init=5)
labels_lsa = km_lsa.fit_predict(X_lsa)
print(f"LSA + K-Means ARI: {adjusted_rand_score(news.target, labels_lsa):.3f}")</pre>