Fine-tuning Strategies and Learning Rate Differential

Fine-tuning matches pre-trained network weights to new target domains using differential learning rates to preserve early visual feature detectors.


Differential Learning Rates

Using different learning rates across layers protects early visual primitives while updating final classifiers.

Preserving Early Feature Extractors

During fine-tuning, the pre-trained weights are adjusted to fit the target dataset. However, because early layers (like Conv1 and Conv2) detect general visual primitives (like edges and colors) that are universally useful, updating them with large gradients can destroy these learned features, a problem known as catastrophic forgetting.

To prevent this, developers use a differential learning rate strategy. The early layers are configured with a very small learning rate (e.g., 1e-5 or 1e-6), while the newly initialized classifier head is configured with a larger learning rate (e.g., 1e-3). This allows the classifier head to adapt quickly while early layers are adjusted gently, preventing feature distortion.

Parameter Groups in Optimizers

PyTorch allows configuring different learning rates for different parameters by passing a list of parameter dictionaries (parameter groups) to the optimizer.

Each dictionary specifies the target parameters and their corresponding learning rate. This gives fine-grained control over the optimization process, ensuring that each part of the network updates at the appropriate speed, maximizing convergence speed and accuracy.

Progressive Unfreezing

Unfreezing layers sequentially during training stabilizes training and prevents gradient explosion.

Layer-wise Training

Progressive unfreezing is a strategy where layers are unfrozen and trained sequentially, rather than all at once. The training begins by freezing the entire feature extractor and only training the classifier head.

Once the classifier head converges, the deeper convolutional blocks are unfrozen and trained. Gradually, early convolutional blocks are unfrozen in stages. This progressive transition prevents large gradients from the randomly initialized head from destabilizing the pre-trained weights.

Regularization and Early Stopping

Fine-tuning deep architectures on small datasets increases the risk of overfitting. To mitigate this, developers apply strong regularization during fine-tuning, including weight decay, dropout, and early stopping.

Early stopping monitors validation loss during training and halts optimization when validation performance starts to degrade, preventing the model from over-fitting to the training set, which is critical for transfer learning success.

PyTorch Implementation of Differential Learning Rates

Let's implement parameter grouping and differential learning rates in a PyTorch training pipeline.

Implementing Parameter Groups in PyTorch

The code below shows how to configure a differential learning rate setup in PyTorch, splitting the model into early feature layers and classification layers.

<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim from torchvision import models # Load pre-trained ResNet18 model = models.resnet18(pretrained=True) # Replace classifier head model.fc = nn.Linear(model.fc.in_features, 10) # Split parameters into groups features_params = [] for name, param in model.named_parameters(): if "fc" not in name: features_params.append(param) # Configure optimizer with parameter groups optimizer = optim.Adam([ {'params': features_params, 'lr': 1e-5}, # Low learning rate for features {'params': model.fc.parameters(), 'lr': 1e-3} # Higher learning rate for head ]) # Verify parameter groups for i, group in enumerate(optimizer.param_groups): print(f"Group {i} count: {len(group['params'])}, lr: {group['lr']}")</pre>

This configuration divides the parameters into two optimization groups. During backpropagation, the optimizer applies group-specific updates, ensuring the pre-trained feature extractor parameters adjust gently compared to the classifier head.

Implementing Progressive Unfreezing

Below is the implementation showing how to programmatically unfreeze layers in PyTorch during the training loop.

<pre><code class="language-python">def unfreeze_layer(layer): for param in layer.parameters(): param.requires_grad = True # Unfreeze ResNet layer4 block unfreeze_layer(model.layer4) print("ResNet Layer4 unfrozen for fine-tuning.")</pre>

Setting requires_grad = True triggers gradient tracking for Layer4. In the next training step, the optimizer will update these convolutional weights alongside the classifier head, expanding model capacity.