Sequence-to-Sequence (Seq2Seq) Architecture

Sequence-to-Sequence (Seq2Seq) architectures map an input sequence to an output sequence of a different length. This encoder-decoder design is the foundational framework for machine translation, summarization, and speech-to-text models.

Encoder and Decoder Roles

A Seq2Seq model consists of an Encoder network that compresses the input sequence and a Decoder network that generates the target sequence.

The Encoder

The Encoder (usually an LSTM or GRU) processes the input sequence step-by-step. The final hidden and cell states of the Encoder represent a dense summary vector, called the context vector, which encodes the semantic meaning of the input.

The Decoder

The Decoder takes the context vector as its initial hidden state. It generates the target sequence token-by-token autoregressively, stopping when it predicts a special End-Of-Sequence (<EOS>) token.

The Information Bottleneck Challenge

While Seq2Seq models are powerful, compressing entire sequences into a single vector leads to information loss on long inputs.

Context Compression Limit

If a sentence is 50 words long, compressing it into a single 512-dimensional context vector causes the network to lose details from the beginning of the sentence. This bottleneck led to the invention of attention mechanisms, which allow the decoder to look back at intermediate encoder states.

Seq2Seq State Transfer in PyTorch

This code shows how the encoder's final states initialize the decoder's hidden states.

<pre><code class="language-python">import torch import torch.nn as nn # Vocabulary dimensions input_vocab, target_vocab = 5000, 6000 emb_dim, hidden_dim = 128, 256 class Seq2Seq(nn.Module): def __init__(self): super().__init__() self.enc_emb = nn.Embedding(input_vocab, emb_dim) self.encoder = nn.LSTM(emb_dim, hidden_dim, batch_first=True) self.dec_emb = nn.Embedding(target_vocab, emb_dim) self.decoder = nn.LSTM(emb_dim, hidden_dim, batch_first=True) self.fc_out = nn.Linear(hidden_dim, target_vocab) def forward(self, src, trg): # src shape: [batch, src_len], trg shape: [batch, trg_len] _, (h, c) = self.encoder(self.enc_emb(src)) # Encoder final state # Initialize decoder with encoder state dec_input = self.dec_emb(trg) dec_out, _ = self.decoder(dec_input, (h, c)) return self.fc_out(dec_out) # Logits over target vocab</pre>