EfficientNet Architecture Scaling
EfficientNet systematically scales network depth, width, and input resolution using a compound scaling coefficient, achieving state-of-the-art accuracy with high efficiency.
Compound Scaling Principles
Compound scaling scales model depth, width, and input resolution in balance rather than adjusting dimensions individually.
Depth, Width, and Resolution Dimensions
Historically, convolutional networks were scaled by adjusting a single dimension: depth (e.g., adding layers in ResNet), width (e.g., expanding channels in WideResNet), or resolution (e.g., feeding larger images). While these adjustments improve performance, they are sub-optimal if scaled independently.
EfficientNet proves that scaling all three dimensions in balance yields better results. For example, if we increase input resolution, the network needs more depth to expand its receptive field, and more width to capture fine-grained spatial details, maintaining optimal information flow.
The Compound Scaling Coefficient
EfficientNet scales depth (\\(d\\)), width (\\(w\\)), and resolution (\\(r\\)) using a user-specified compound scaling coefficient \\(\\phi\\). The scaling relations are formulated as:
\\(d = \\alpha^\\phi, \\quad w = \\beta^\\phi, \\quad r = \\gamma^\\phi\\)
subject to \\(\\alpha \\times \\beta^2 \\times \\gamma^2 \\approx 2\\) and \\(\\alpha, \\beta, \\gamma \\ge 1\\). Here, \\(\\alpha, \\beta, \\gamma\\) are constant coefficients determined by a grid search on a baseline model (EfficientNet-B0). By setting \\(\\phi\\), developers can systematically scale the model to match their compute resources (from B0 to B7), maintaining balanced representation scaling.
MBConv and Squeeze-and-Excitation
EfficientNet relies on mobile bottlenecks with embedded channel attention blocks to extract clean representations.
MBConv Architecture
The backbone of EfficientNet is based on mobile inverted bottleneck convolutions (MBConv), similar to MobileNet v2. MBConv uses depthwise separable convolutions with inverted residual paths.
In addition, MBConv integrates Squeeze-and-Excitation (SE) blocks, which dynamically adjust channel weights by modeling inter-dependencies between channels. This integration allows the network to prioritize important channels during training.
Squeeze-and-Excitation Mechanics
A Squeeze-and-Excitation block first squeezes spatial dimensions using global average pooling, producing a channel-wise vector. This vector is passed through a small bottleneck fully connected layer, which computes channel-wise attention weights.
These attention weights are applied back to the original feature maps via element-wise multiplication. This excitation step allows the network to dynamically focus on relevant feature channels, improving class discrimination.
PyTorch Implementation of Squeeze-and-Excitation
Let's build a custom Squeeze-and-Excitation block in PyTorch to observe channel-wise scaling dynamics.
Custom SE Block
The code below shows how to implement a Squeeze-and-Excitation block in PyTorch, highlighting the squeeze (Global Average Pooling) and excitation (channel attention) steps.
<pre><code class="language-python">import torch import torch.nn as nn class SqueezeExcitation(nn.Module): def __init__(self, in_c, reduction=16): super().__init__() self.squeeze = nn.AdaptiveAvgPool2d(1) self.excitation = nn.Sequential( nn.Linear(in_c, in_c // reduction, bias=False), nn.ReLU(inplace=True), nn.Linear(in_c // reduction, in_c, bias=False), nn.Sigmoid() ) def forward(self, x): b, c, _, _ = x.shape # Squeeze to [batch, channels] squeezed = self.squeeze(x).view(b, c) # Excitation: [batch, channels, 1, 1] weights = self.excitation(squeezed).view(b, c, 1, 1) # Scale features element-wise return x * weights x = torch.randn(2, 64, 32, 32) se = SqueezeExcitation(in_c=64) out = se(x) print("SE Output shape:", out.shape) # [2, 64, 32, 32]</pre>In this block, the squeeze operation averages pixels, and the excitation MLP computes channel weights. The output scales the original input, emphasizing important feature channels.
Model Scaling Diagnostics
Scaling architectures dynamically requires verifying that weights do not saturate or run out of memory. PyTorch allows us to instantiate pre-trained EfficientNet variants using the torchvision.models module.
This makes it easy to experiment with different scaling levels (e.g., from efficientnet_b0 to efficientnet_b7), adapting the model width and depth to the target task and optimizing memory overhead.