Transfer Learning with Pre-trained Models

Transfer learning leverages features learned by a model pre-trained on a large source dataset to solve a target task with limited training data.

Transfer Learning Mechanics

Transfer learning addresses data scarcity by reusing visual feature detectors learned on massive general datasets.

Feature Reuse and Data Scarcity

Training a deep network from scratch requires a massive amount of labeled data. In tasks where data is scarce, starting with random weights leads to overfitting and poor generalization. Transfer learning addresses this by using a model pre-trained on a massive dataset (like ImageNet).

The early layers of pre-trained models capture general visual features (like edges, textures, and shapes) that are useful across different datasets. By reusing these features, the model can learn to solve the target task with significantly less data, saving compute.

Inductive Transfer and Domain Alignment

The success of transfer learning depends on the similarity between the source domain (e.g., ImageNet) and the target domain (e.g., medical scans). If the domains are similar, the pre-trained features are highly relevant.

If the domains differ significantly (e.g., moving from natural photos to satellite imagery), the pre-trained features may be less relevant, and the model will require fine-tuning of deeper layers to align features with the target domain, adapting to target layouts.

Strategy Choices

Developers must choose between feature extraction and fine-tuning depending on the target dataset size.

Feature Extraction vs. Fine-tuning

There are two main transfer learning strategies: Feature Extraction and Fine-Tuning. Feature extraction involves freezing the weights of the pre-trained network and only training a new classifier head on top.

Fine-tuning involves training both the new classifier head and adjusting the weights of the pre-trained layers. Fine-tuning is typically used when the target dataset is large or differs significantly from the source dataset, allowing weights to adjust to new classes.

Classifier Head Replacement

Because the original model's classifier head is designed for the source task (e.g., 1000 classes for ImageNet), it must be replaced with a new linear layer that matches the number of target classes.

The weights of this new layer are initialized randomly, and the layer is trained using the target dataset's labels to map the extracted features to the target classes, ensuring correct prediction mappings.

PyTorch Implementation of Transfer Learning

Let's build a transfer learning pipeline in PyTorch, freezing submodules and configuring parameter gradients.

Freezing Weights and Replacing the Classifier

The code below shows how to load a pre-trained ResNet18 model, freeze its weights, replace the final classification layer, and prepare it for training.

<pre><code class="language-python">import torch import torch.nn as nn from torchvision import models # Load pre-trained ResNet18 model = models.resnet18(pretrained=True) # Freeze all layers for param in model.parameters(): param.requires_grad = False # Replace classification head (fc layer in ResNet) num_features = model.fc.in_features # New fc layer requires gradients by default model.fc = nn.Linear(num_features, 5) # 5 target classes # Verify trainable parameters for name, param in model.named_parameters(): if param.requires_grad: print("Trainable:", name) # fc.weight, fc.bias</pre>

In this setup, setting requires_grad = False disables gradient calculations for all pre-trained parameters. Only the newly instantiated fc layer has gradients active, locking the feature extractor.

Training Optimization

When compiling the optimizer, we should only pass the parameters that require gradients. Passing frozen parameters increases memory usage and computational overhead.

<pre><code class="language-python">import torch.optim as optim # Filter parameters to only pass trainable ones trainable_params = [p for p in model.parameters() if p.requires_grad] optimizer = optim.Adam(trainable_params, lr=0.001)</pre>

Filtering parameters ensures that the optimizer does not allocate memory buffers for gradient tracking of frozen parameters, reducing GPU memory footprint during training.