Using Pre-trained Embeddings in PyTorch/Keras

Pre-trained word embeddings transfer semantic representations learned from massive text corpora to custom neural network architectures. Implementing these embeddings in PyTorch requires loading the weights into memory, mapping vocabularies, and deciding whether to freeze or fine-tune the parameters.


Loading Pre-trained Weights

Loading external embeddings involves building a mapping from a custom vocabulary to the pre-trained vector array and initializing the network layer.

Embedding Initialization

When using pre-trained embeddings like GloVe or Word2Vec, the first step is parsing the text or binary vector files. These files contain words followed by their high-dimensional vector representations (typically 100 to 300 dimensions). We construct a vocabulary mapping from our training corpus, assigning each unique token a unique index. Then, we initialize a weight tensor of shape \\( (V, D) \\), where \\( V \\) is our vocabulary size and \\( D \\) is the embedding dimension. We iterate through our vocabulary, look up each word in the pre-trained dictionary, and copy its vector into the corresponding row of the weight tensor.

If a word in our vocabulary is not present in the pre-trained embeddings, we must handle it as an out-of-vocabulary (OOV) token, which is typically initialized randomly or with zeros. This weight matrix is then passed directly to the embedding layer of the neural network during construction, establishing the starting representations for training.

PyTorch Implementation

This PyTorch implementation demonstrates how to instantiate an nn.Embedding layer using a pre-trained weight tensor, configure freezing, and handle the padding index:

<pre><code class="language-python">import torch import torch.nn as nn # Simulate loading a pre-trained weight matrix # Vocabulary size = 5, Embedding dimension = 4 pretrained_weights = torch.tensor([ [0.1, 0.2, -0.3, 0.4], # Index 0: &lt;PAD&gt; [0.9, -0.1, 0.4, 0.2], # Index 1: 'deep' [-0.2, 0.8, 0.5, -0.1], # Index 2: 'learning' [0.0, 0.0, 0.0, 0.0], # Index 3: &lt;UNK&gt; (initialized to zero) [0.5, 0.5, -0.5, -0.5] # Index 4: 'curriculum' ]) class CustomEmbedder(nn.Module): def __init__(self, weights, freeze=True): super().__init__() num_embeddings, embedding_dim = weights.shape # Load pre-trained weights into the Embedding layer self.embedding = nn.Embedding.from_pretrained( weights, freeze=freeze, # True: freezes weights; False: fine-tunes weights padding_idx=0 # Keeps PAD vector constant during training ) self.fc = nn.Linear(embedding_dim, 2) def forward(self, x): # x shape: [batch_size, seq_len] embedded = self.embedding(x) # [batch_size, seq_len, embedding_dim] # Global average pooling over sequence length pooled = torch.mean(embedded, dim=1) # [batch_size, embedding_dim] logits = self.fc(pooled) # [batch_size, 2] return logits # Example usage model = CustomEmbedder(pretrained_weights, freeze=True) input_tokens = torch.tensor([[1, 2, 4], [1, 0, 0]]) # Batch of 2 sentences out = model(input_tokens) print("Output logits shape:", out.shape) print("Weights require grad?", model.embedding.weight.requires_grad)</pre>

Fine-Tuning vs. Freezing Embeddings

Deciding whether to update pre-trained weights during backpropagation is a crucial design choice that depends on the volume of training data.

Optimization Dynamics

When incorporating pre-trained embeddings, we have two primary training strategies: freezing the weights (holding them constant) or fine-tuning them (allowing them to update via backpropagation). Freezing the embedding weights (requires_grad=False) treats the vectors as static feature extractors. This is highly beneficial when the target training dataset is small, as it prevents the model from overfitting the embeddings to the small dataset and preserves the general semantic spaces learned from the massive source corpus.

Conversely, fine-tuning the embedding weights (requires_grad=True) allows the vectors to adapt to the specific vocabulary usage and contextual nuances of the target task. If the target dataset is large enough, fine-tuning aligns the word representations with the task-specific labels, improving performance. The risk, however, is that rare words in the target training set will have their vectors severely distorted during updates, breaking their general semantic alignment.

Learning Rate Schedules

To mitigate the risk of disrupting pre-trained representations during fine-tuning, practitioners often apply a technique called discriminative fine-tuning or differential learning rates. This approach updates the embedding layer with a much smaller learning rate than the rest of the network (e.g., \\( 10^{-5} \\) for the embedding layer and \\( 10^{-3} \\) for the classifier layers). This ensures that while the embeddings can adjust to task-specific patterns, they are not drastically altered by the noisy gradients of early training epochs.

Another common approach is a two-stage training schedule: the embeddings are initially frozen for several epochs to allow the randomly-initialized downstream layers to converge, and then unfrozen and trained with a low learning rate for a few final epochs. This prevents the initial random gradients from corrupting the valuable pre-trained representations.

Handling Out-of-Vocabulary (OOV) Words

Effectively managing words that do not appear in the pre-trained dictionary prevents performance drops during downstream training.

Initialization Strategies

No pre-trained embedding file contains every possible word. Handling Out-of-Vocabulary (OOV) words requires careful initialization. The simplest strategy is zero-initialization, which assigns all OOV words a vector of zeros. While computationally clean, this means OOV words contribute nothing to downstream layers. Another strategy is random uniform or normal initialization, matching the mean and variance of the pre-trained embeddings. This ensures that OOV tokens generate activations, but their random positions in the semantic space can introduce noise.

A more sophisticated approach is semantic averaging. If an OOV word can be broken into parts or related to existing vocabulary terms, its vector can be initialized as the average of those related terms. In modern systems, hybrid approaches combine word-level pre-trained embeddings with character-level convolutional or recurrent feature extractors to generate representations for unseen words dynamically.

Domain Adaptation

When applying general pre-trained embeddings (like GloVe trained on Wikipedia) to specialized domains (like clinical medical notes or legal contracts), the OOV rate can soar. Undergoing domain adaptation is essential. One common strategy is to pre-train word embeddings from scratch on a large unlabeled in-domain corpus, and then align them with the general vectors using Procrustes alignment or similar translation techniques.

This vocabulary alignment ensures that the specialized terminology is projected into a vector space that is structurally compatible with the general-purpose embeddings, allowing downstream models to leverage both general semantic knowledge and domain-specific vocabulary structures.