Neural Sentiment Analysis with LSTMs

Long Short-Term Memory (LSTM) networks are widely used for sentiment analysis because of their ability to process sequential data and capture long-term dependencies. By passing word embeddings through recurrent gating structures, LSTMs learn context-aware text representations for sequence classification.

Sequential Modeling for Text Classification

LSTMs process sentences step-by-step, accumulating historical context to make a single sequence-level prediction.

Many-to-One Architecture

Sentiment analysis is formulated as a many-to-one sequence classification task, where an input sequence of variable length \\( T \\) is mapped to a single output target (e.g., positive or negative sentiment). An LSTM processes the sequence token by token. At each step \\( t \\), the network receives the embedding of the current token \\( x_t \\) and the previous hidden state \\( h_{t-1} \\), producing a updated hidden state \\( h_t \\). This recurrent step propagates information forward through the entire sequence.

Once the final token at index \\( T \\) is processed, the final hidden state \\( h_T \\) acts as a compressed vector representation of the entire sentence. This vector is passed to a fully connected output layer that projects the features into class logits. Because the classification decision relies on the entire sequence context, the model can resolve dependencies like negation (e.g., "not good") that bag-of-words models fail to capture.

PyTorch LSTM Sentiment Classifier

This PyTorch code builds a complete sentiment classifier using an Embedding layer, an LSTM, and a classification head:

<pre><code class="language-python">import torch import torch.nn as nn class LSTMSentimentClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) # batch_first=True makes input/output shapes [batch_size, seq_len, features] self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True) self.fc = nn.Linear(hidden_dim, output_dim) def forward(self, text): # text shape: [batch_size, seq_len] embedded = self.embedding(text) # [batch_size, seq_len, embed_dim] # LSTM forward pass # out shape: [batch_size, seq_len, hidden_dim] # h_n shape: [1, batch_size, hidden_dim] (for single layer) out, (h_n, c_n) = self.lstm(embedded) # Extract the final hidden state: h_n is [num_layers, batch_size, hidden_dim] final_hidden = h_n[-1] # [batch_size, hidden_dim] logits = self.fc(final_hidden) # [batch_size, output_dim] return logits # Example instantiation model = LSTMSentimentClassifier(vocab_size=100, embed_dim=32, hidden_dim=64, output_dim=1) x_batch = torch.randint(0, 100, (4, 10)) # Batch size of 4, sequence length of 10 out = model(x_batch) # Shape: [4, 1] - [batch_size, output_dim] print("Output shape:", out.shape)</pre>

Overcoming Vanishing Gradients in Text

LSTMs use internal gating networks to regulate information flow, protecting gradients over long sequences.

LSTM Gating Mechanism

Standard Recurrent Neural Networks (RNNs) suffer from vanishing or exploding gradients when trained on sequences longer than a few dozen steps. This occurs because the gradient computation involves repeated multiplications of the recurrent transition weight matrix. LSTMs solve this issue by introducing a constant error carousel represented by the cell state \\( c_t \\), regulated by three gates. The forget gate \\( f_t \\) decides what information to discard: \\( f_t = \\sigma(\\mathbf{W}_f [h_{t-1}, x_t] + b_f) \\).

The input gate \\( i_t \\) determines which new information to store: \\( i_t = \\sigma(\\mathbf{W}_i [h_{t-1}, x_t] + b_i) \\), modulating candidate values \\( \\tilde{c}_t = \\tanh(\\mathbf{W}_c [h_{t-1}, x_t] + b_c) \\). The cell state is updated via addition, rather than multiplication: \\( c_t = f_t \\odot c_{t-1} + i_t \\odot \\tilde{c}_t \\). Finally, the output gate \\( o_t \\) controls what to write to the hidden state: \\( o_t = \\sigma(\\mathbf{W}_o [h_{t-1}, x_t] + b_o) \\), where \\( h_t = o_t \\odot \\tanh(c_t) \\). The additive update of the cell state allows gradients to flow backward through time without decaying exponentially.

Hidden State Extraction

While using the final hidden state \\( h_T \\) is the standard method for many-to-one tasks, it can create a bottleneck if the sentence is very long, as the network is forced to compress the entire text into a single vector. An alternative design is global pooling over time. In this setup, we collect all hidden states \\( h_1, h_2, \\dots, h_T \\) and apply a max-pooling or average-pooling operation across the sequence dimension.

Max-pooling extracts the most salient feature activations triggered by specific words anywhere in the sentence (e.g., highly negative words like "terrible"), while average-pooling captures the overall tone. Combining the final hidden state with global max and average pooling yields a richer representation for downstream classifier heads.

Sequence Padding and Pack/Pad Utilities

Variable-length sentences within a batch must be padded to match, and PyTorch provides utilities to avoid processing redundant padding tokens.

Dynamic Batching

To utilize parallel hardware, we group sentences into batches. Because sentences have different lengths, we must pad shorter sentences with dummy tokens (typically <PAD> index 0) to form a uniform rectangular tensor of shape \\( (B, T_{max}) \\). However, passing this padded tensor directly through an LSTM is inefficient. The recurrent cells will perform matrix multiplications on the zero embeddings of pad tokens, wasting computation and corrupting the final hidden state representation with padded inputs.

To prevent this, the recurrent weights must only update based on actual tokens. We can pass a sequence mask or utilize PyTorch's specialized packing utilities, which dynamically adjust execution based on actual sequence lengths, bypassing pad tokens during execution.

PyTorch Pack Padded Sequence

PyTorch provides pack_padded_sequence and pad_packed_sequence in torch.nn.utils.rnn. Packing flattens the batch by sorting sequences by length and stacking active tokens row-by-row, allowing the LSTM to process only valid tokens. Here is how to incorporate this into a PyTorch model:

<pre><code class="language-python">import torch import torch.nn as nn from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence class PackedLSTMClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True) self.fc = nn.Linear(hidden_dim, 1) def forward(self, x, lengths): # x shape: [batch_size, max_seq_len] # lengths: Tensor of actual lengths, e.g. [10, 7, 5, 2] (must be sorted in descending order) embedded = self.embedding(x) # [batch_size, max_seq_len, embed_dim] # Pack the sequences packed_embedded = pack_padded_sequence( embedded, lengths.cpu(), # cpu is required for lengths in some versions batch_first=True, enforce_sorted=True ) # Process with LSTM packed_out, (h_n, c_n) = self.lstm(packed_embedded) # Unpack back to padded form if needed, or use h_n directly # output, output_lengths = pad_packed_sequence(packed_out, batch_first=True) # h_n contains the final hidden state of each sequence before padding final_hidden = h_n[-1] # [batch_size, hidden_dim] return self.fc(final_hidden) # Example run model = PackedLSTMClassifier(vocab_size=10, embed_dim=8, hidden_dim=16) # Sequences padded to length 5. Batch sorted by actual sequence length descending x = torch.tensor([ [1, 2, 3, 4, 5], # len 5 [2, 3, 4, 0, 0], # len 3 [1, 5, 0, 0, 0] # len 2 ]) lengths = torch.tensor([5, 3, 2]) out = model(x, lengths) print("Packed output shape:", out.shape)</pre>