Teacher Forcing in Sequence Training

Training recurrent decoders autoregressively can be slow and unstable because early prediction mistakes propagate downstream. Teacher Forcing resolves this by feeding ground-truth tokens as inputs during training instead of the model's own predictions.


The Principle of Teacher Forcing

During inference, the decoder takes its own predicted token from step t-1 as the input for step t. Teacher Forcing replaces this prediction with the actual ground-truth token during training.

Preventing Error Propagation

If a model translating 'hello world' incorrectly predicts the first word as 'goodbye', a model training without teacher forcing would feed 'goodbye' to the next step, causing all subsequent predictions to degrade. Feeding the ground truth 'hello' keeps training on track.

Training Acceleration

By feeding ground-truth tokens, steps can be calculated in parallel during training rather than waiting for predictions sequentially, which speeds up convergence.

Exposure Bias and Scheduled Sampling

While Teacher Forcing speeds up training, it introduces a mismatch between training and test conditions, known as exposure bias.

The Exposure Bias Problem

At test time, the model is exposed to its own errors, which it never encountered during training. This can cause the model to perform poorly or generate gibberish if it makes a single mistake early in a sequence.

Scheduled Sampling

Scheduled sampling addresses exposure bias by dynamically mixing strategies. During early epochs, we use teacher forcing (e.g., 100% of the time). As training progresses, we gradually decay this probability, forcing the model to learn to recover from its own predictions.

<pre><code class="language-python">import random import torch def get_decoder_input(target_token, predicted_token, teacher_forcing_ratio=0.5): # Decide whether to use ground-truth target or the model's prediction if random.random() < teacher_forcing_ratio: return target_token # Teacher forcing else: return predicted_token # Autoregressive feedback</pre>