Long Short-Term Memory (LSTM) Cell Architecture

The Long Short-Term Memory (LSTM) network, designed by Hochreiter and Schmidhuber, solves the vanishing gradient problem. By separating internal memory (cell state) from the output (hidden state), LSTMs can preserve context for hundreds of steps.

Cell State vs. Hidden State

The core innovation of the LSTM is the cell state, an internal register that acts as an additive conveyor belt running down the sequence.

The Cell State (C_t)

The cell state C_t stores long-term information. Information is written to and read from this state via additive gates. Because the updates are additive, gradients can propagate backward through the cell line with minimal decay.

The Hidden State (h_t)

The hidden state h_t is the filtered version of the cell state, used to output predictions at step t and provide context to the next step. It is updated by passing the cell state through a tanh activation and multiplying it by the output gate.

Using LSTMs in PyTorch

PyTorch's nn.LSTM handles batches of sequences and returns both the sequence outputs and the final hidden/cell states.

LSTM Forward Pass

Unlike a simple RNN, the LSTM requires initializing and passing a tuple of states: (hidden_state, cell_state).

<pre><code class="language-python">import torch import torch.nn as nn # Input size=10, Hidden size=20 lstm = nn.LSTM(input_size=10, hidden_size=20, batch_first=True) # Input tensor: [batch_size, sequence_length, input_size] x = torch.randn(3, 5, 10) # Pass input; returns output sequence and a tuple of final (h_n, c_n) output, (h_n, c_n) = lstm(x) print(output.shape) # torch.Size([3, 5, 20]) print(h_n.shape) # torch.Size([1, 3, 20]) -> final hidden state print(c_n.shape) # torch.Size([1, 3, 20]) -> final cell state</pre>