Machine Translation with RNNs + Attention

Neural Machine Translation (NMT) uses encoder-decoder networks with attention to map sequences from a source language to a target language. The training loop utilizes teacher forcing, while inference relies on search algorithms like beam search to generate coherent translations.

End-to-End Seq2Seq Architecture

An NMT system maps inputs to outputs using an attention-guided encoder-decoder network, trained with teacher forcing.

Encoder-Decoder Translation Loop

In NMT, the encoder processes the source sentence (e.g., in French) and generates a sequence of hidden vectors. The decoder (in English) is initialized with the encoder's final state and generates the target sentence token by token. The start of translation is marked by a special <SOS> token. At each step, the decoder uses attention to retrieve a context vector from the encoder states, combines it with its own state, and predicts the probability distribution over the target vocabulary.

This loop continues until the decoder generates the end-of-sequence <EOS> token. During training, we use a technique called teacher forcing: instead of feeding the decoder's own predicted token from the previous step as input to the next step, we feed the actual ground-truth token. This prevents errors from compounding early in training, accelerating convergence.

PyTorch Seq2Seq Model with Attention

This PyTorch code implements an encoder-decoder translation architecture featuring a custom attention layer:

<pre><code class="language-python">import torch import torch.nn as nn class Seq2SeqAttention(nn.Module): def __init__(self, src_vocab, trg_vocab, embed_dim, hidden_dim): super().__init__() self.encoder_embed = nn.Embedding(src_vocab, embed_dim) self.decoder_embed = nn.Embedding(trg_vocab, embed_dim) self.encoder = nn.GRU(embed_dim, hidden_dim, batch_first=True) self.decoder = nn.GRU(embed_dim + hidden_dim, hidden_dim, batch_first=True) # Linear layer for attention alignment scoring self.attn_align = nn.Linear(hidden_dim * 2, 1) self.out_projection = nn.Linear(hidden_dim, trg_vocab) def forward(self, src, trg): # src shape: [batch_size, src_len] # trg shape: [batch_size, trg_len] batch_size, trg_len = trg.shape # Encode source enc_states, enc_hidden = self.encoder(self.encoder_embed(src)) # Initial decoder hidden state dec_hidden = enc_hidden outputs = [] for t in range(trg_len): # Current decoder input token dec_input = trg[:, t].unsqueeze(1) # [batch_size, 1] dec_embedded = self.decoder_embed(dec_input) # [batch_size, 1, embed_dim] # Compute attention alignment scores # dec_hidden squeezed to [batch_size, 1, hidden_dim] and repeated along src_len seq_len = enc_states.size(1) dh_repeated = dec_hidden[-1].unsqueeze(1).repeat(1, seq_len, 1) # Score shape: [batch_size, src_len, 1] score = torch.tanh(self.attn_align(torch.cat((dh_repeated, enc_states), dim=-1))) attn_weights = torch.softmax(score, dim=1) # [batch_size, src_len, 1] # Compute context vector context = torch.sum(attn_weights * enc_states, dim=1).unsqueeze(1) # [batch_size, 1, hidden_dim] # Combine context and decoder input dec_concat = torch.cat((dec_embedded, context), dim=-1) # [batch_size, 1, embed_dim + hidden_dim] # Decoder step dec_out, dec_hidden = self.decoder(dec_concat, dec_hidden) # Project to vocabulary logits logits = self.out_projection(dec_out.squeeze(1)) # [batch_size, trg_vocab] outputs.append(logits) return torch.stack(outputs, dim=1) # [batch_size, trg_len, trg_vocab] # Example instantiation model = Seq2SeqAttention(src_vocab=30, trg_vocab=50, embed_dim=16, hidden_dim=32) src = torch.randint(1, 30, (2, 8)) # Batch size of 2, src seq length of 8 trg = torch.randint(1, 50, (2, 6)) # trg seq length of 6 logits = model(src, trg) print("Output logits shape:", logits.shape) # [2, 6, 50]</pre>

Search Strategies: Greedy vs. Beam Search

Generating translations at inference requires decoding strategies like Beam Search to find the most probable sequence.

Greedy Decoding and Exposure Bias

During inference, the model does not have access to ground-truth target tokens. In greedy decoding, the model generates the sequence by choosing the single most probable token at each step: \\( \\hat{y}_t = \\arg\\max P(y_t \\mid \\hat{y}_{<t}, X) \\). While computationally fast, this approach is sub-optimal. If the model makes a mistake at step \\( t \\), this error is fed as input to step \\( t+1 \\), causing subsequent predictions to diverge from the correct translation.

This discrepancy between training (where the model sees ground-truth inputs) and inference (where it sees its own potentially erroneous inputs) is called exposure bias. Greedy decoding often produces repetitive, ungrammatical, or truncated sentences because it cannot back out of early mistakes.

Beam Search Decoding

Beam search mitigates exposure bias by maintaining a set of \\( B \\) candidate sequences (called the beam width) at each step. At step 1, the model computes the probabilities for the first token and keeps the top \\( B \\) candidates. At each subsequent step, the model computes the probabilities of all possible next tokens for each of the \\( B \\) candidates, resulting in \\( B \\times V \\) paths. It then ranks these paths based on their cumulative log-probability and retains only the top \\( B \\) paths.

To prevent the search from favoring shorter sentences (since adding negative log-probabilities always decreases the score), we apply length normalization: \\( \\text{Score}(Y) = \\frac{\\sum \\log P(y_t \\mid y_{<t}, X)}{L^\\alpha} \\), where \\( L \\) is the sequence length and \\( \\alpha \\) is a normalization parameter (typically 0.6 to 0.7). Once all beams generate the <EOS> token, the path with the highest normalized score is selected.

Evaluation Metrics

NMT systems are evaluated using metrics like BLEU and ROUGE to measure alignment with human reference translations.

BLEU Score Mechanics

Bilingual Evaluation Understudy (BLEU) is the standard metric for evaluating machine translation. It measures the precision of n-grams in the generated translation against one or more human reference translations. The score is computed as the geometric mean of modified n-gram precisions \\( p_n \\) (usually from 1-gram to 4-gram), multiplied by a brevity penalty (BP) to penalize translations that are too short:

\\( \\text{BLEU} = \\text{BP} \\cdot \\exp\\left(\\sum_{n=1}^N w_n \\log p_n\\right) \\)

The brevity penalty is defined as: \\( \\text{BP} = \\begin{cases} 1 & \\text{if } c > r \\\\ e^{(1 - r/c)} & \\text{if } c \\le r \\end{cases} \\), where \\( c \\) is the candidate translation length and \\( r \\) is the reference translation length. While BLEU is useful for measuring translation accuracy, it struggles with synonyms and paraphrase structures, as it relies on exact string matching.

Alternative Metrics

Because BLEU has limitations, alternative metrics like METEOR and COMET are often used. METEOR matches words based on stem similarity, synonymy (using databases like WordNet), and paraphrasing, providing a more flexible evaluation that correlates better with human judgment. In modern pipelines, reference-based neural evaluation metrics like COMET use pre-trained cross-lingual models to project source, hypothesis, and reference sentences into a shared semantic space.

These neural metrics calculate similarity based on semantic meaning rather than exact word overlap. This allows them to evaluate translation quality accurately even when the model uses different vocabulary or syntax than the reference translation.