Topic Modeling with Deep Autoencoders

Topic modeling identifies latent themes in large document collections. Deep autoencoders approach this task by compressing bag-of-words document vectors through a bottleneck layer, learning topic distributions that reconstruct the input text.


Neural Topic Modeling

Neural autoencoders project bag-of-words distributions into low-dimensional latent spaces where bottleneck activations represent topics.

Limitations of LDA

Latent Dirichlet Allocation (LDA) is the standard algorithm for topic modeling. It uses a Bayesian generative process where documents are modeled as mixtures of topics, and topics are modeled as mixtures of words. However, LDA has several limitations. It is difficult to scale to massive datasets because it relies on sampling methods (like Gibbs Sampling) or Variational Inference, which are computationally expensive. Additionally, LDA cannot incorporate document metadata (such as author or publication date) or handle non-linear interactions between topics.

Neural topic models based on autoencoders address these limitations. By framing topic modeling as a reconstruction task, they can be trained using stochastic gradient descent, allowing them to scale to millions of documents and integrate metadata easily.

Document Autoencoder Concept

In an autoencoder-based topic model, a document is represented as a high-dimensional Bag-of-Words (BoW) vector \\( \\mathbf{x} \\in \\mathbb{R}^V \\), where \\( V \\) is the vocabulary size. The encoder compresses this vector through one or more hidden layers to a bottleneck layer \\( \\mathbf{z} \\in \\mathbb{R}^K \\), where \\( K \\) is the number of topics (with \\( K \\ll V \\)). The activations of the bottleneck layer represent the document's topic distribution.

The decoder then projects this latent representation back to the vocabulary space, attempting to reconstruct the original BoW vector. By forcing the network to compress and reconstruct the document, the bottleneck layer is constrained to learn the most salient semantic patterns, which correspond to latent topics.

PyTorch Topic Autoencoder

A custom PyTorch autoencoder uses linear layers and activations to reconstruct bag-of-words vectors through a topic bottleneck.

Model Architecture

This PyTorch model defines a topic autoencoder that projects bag-of-words inputs into a latent topic space before reconstructing them:

<pre><code class="language-python">import torch import torch.nn as nn class TopicAutoencoder(nn.Module): def __init__(self, vocab_size, num_topics): super().__init__() # Encoder compresses BoW to topics self.encoder = nn.Sequential( nn.Linear(vocab_size, 128), nn.ReLU(), nn.Linear(128, num_topics), nn.Softmax(dim=-1) # Forces topic distribution to sum to 1 ) # Decoder reconstructs BoW from topics # No bias in decoder helps map weights directly to word importances per topic self.decoder = nn.Linear(num_topics, vocab_size, bias=False) def forward(self, x): # x shape: [batch_size, vocab_size] (normalized BoW counts) topic_dist = self.encoder(x) # [batch_size, num_topics] reconstructed = self.decoder(topic_dist) # [batch_size, vocab_size] return reconstructed, topic_dist # Example execution model = TopicAutoencoder(vocab_size=1000, num_topics=20) x_bow = torch.rand(4, 1000) # Batch of 4 documents x_bow = x_bow / x_bow.sum(dim=-1, keepdim=True) # Normalize to probability vector recon, topics = model(x_bow) print("Topics distribution shape:", topics.shape) # [4, 20] print("Reconstruction shape:", recon.shape) # [4, 1000]</pre>

Reconstruction and Sparsity Loss

Training a topic autoencoder requires a reconstruction loss function. Since the input \\( \\mathbf{x} \\) is a normalized probability distribution over words, cross-entropy is typically used: \\( L_{recon} = -\\sum_{i=1}^V x_i \\log(\\hat{x}_i) \\). To ensure that the latent topics are interpretable, we must prevent the model from assigning every topic to every document. We can enforce this by adding a sparsity regularization term to the loss function.

Common regularization terms include L1 regularization on the bottleneck activations: \\( L_{sparse} = \\lambda \\sum |z_k| \\), or a KL divergence term that constrains the average topic activation to match a sparse target distribution. This forces the model to represent each document using only a small set of active topics, yielding clearer, more distinct topic groupings.

Interpretability and Evaluation

Topic coherence metrics and dimensionality reduction techniques evaluate and visualize the quality of the learned topic spaces.

Topic Coherence and Diversity

To evaluate topic models, we extract the top \\( N \\) words for each topic by ranking the weights of the decoder matrix. We then measure topic quality using coherence and diversity metrics. Topic Coherence (such as NPMI - Normalized Pointwise Mutual Information) measures the semantic similarity of the top words in a topic. It calculates how often these words co-occur in a reference corpus (like Wikipedia). High coherence indicates that the topic represents a clear, logical concept.

Topic Diversity measures the uniqueness of the learned topics, calculated as the percentage of unique words in the top words of all topics. A model with high coherence but low diversity will output the same set of words for multiple topics. Balancing both metrics is essential for generating distinct and useful topic profiles.

Semantic Space Projection

Once the autoencoder is trained, the encoder can project the entire document collection into the latent topic space. This low-dimensional space can be visualized using dimensionality reduction techniques like t-SNE or UMAP. By projecting the topic vectors into a 2D plane, we can visualize document clusters and relationships.

This visualization allows practitioners to verify that similar documents are grouped together and inspect the transitions between different topic clusters, confirming that the autoencoder has captured the underlying semantic structure of the corpus.