MobileNet and Depthwise Separable Convolutions

MobileNet utilizes depthwise separable convolutions to drastically reduce computation and model size, making deep learning feasible on mobile and edge devices.


Depthwise Separable Convolutions

Depthwise separable convolutions split standard convolutions into independent spatial and channel projection phases.

Depthwise Convolution Mechanics

Standard convolutions perform spatial and channel mixing in a single step. Depthwise separable convolutions decompose this operation into two separate steps: a Depthwise Convolution followed by a Pointwise Convolution.

In a depthwise convolution, the input channels are processed individually. For a \\(C\\)-channel input, we apply \\(C\\) independent 2D filters of size \\(K \\times K\\). This performs spatial filtering within each channel but does not mix information across channels, saving extensive arithmetic steps.

Pointwise Convolution Mechanics

After spatial filtering, a pointwise convolution is applied to mix features across channels. A pointwise convolution is a standard 1x1 convolution that projects the depthwise outputs to the desired channel depth.

This factorization reduces the computational complexity significantly. For a kernel size \\(K\\), the cost of a depthwise separable convolution is approximately:

\\(\\frac{1}{C_{out}} + \\frac{1}{K^2}\\)

of a standard convolution. When using 3x3 kernels, this reduces computational requirements and parameters by nearly 80-90% with only a minor drop in accuracy, allowing models to fit in edge memory.

MobileNet Versions and Features

MobileNet v2 refined the original design by introducing inverted residual shortcuts and linear bottleneck layers.

MobileNet v1 vs. v2 (Inverted Residuals)

MobileNet v1 established the use of depthwise separable convolutions. MobileNet v2 introduced two key improvements: Inverted Residual Blocks and Linear Bottlenecks.

Standard residual blocks (like ResNet) reduce channels, apply a convolution, and expand channels. Inverted residual blocks do the opposite: they expand the channels first (using a 1x1 conv) to allow the depthwise convolution to extract features in a high-dimensional space, and then project back to a low-dimensional space. This maintains a compact representation bottleneck.

Linear Bottlenecks and Information Flow

Non-linear activations like ReLU discard negative values, which can destroy information in low-dimensional feature channels.

To address this, MobileNet v2 replaces the final activation of the projection layer in each block with a linear activation (no activation). This prevents information loss, improving representation capacity and stabilizing gradient updates through the bottleneck.

PyTorch Implementation of Depthwise Separable Convolutions

Let's build a Depthwise Separable block and a MobileNet v2 Inverted Residual block in PyTorch.

Custom Depthwise Separable Layer

The code below shows how to implement a Depthwise Separable Convolution block in PyTorch, highlighting the groups parameter.

<pre><code class="language-python">import torch import torch.nn as nn class DepthwiseSeparableConv(nn.Module): def __init__(self, in_c, out_c, stride=1): super().__init__() # Depthwise: groups = in_channels splits convolution channel-by-channel self.depthwise = nn.Conv2d( in_channels=in_c, out_channels=in_c, kernel_size=3, stride=stride, padding=1, groups=in_c, bias=False ) self.bn1 = nn.BatchNorm2d(in_c) self.pointwise = nn.Conv2d( in_channels=in_c, out_channels=out_c, kernel_size=1, bias=False ) self.bn2 = nn.BatchNorm2d(out_c) self.relu = nn.ReLU(inplace=True) def forward(self, x): x = self.relu(self.bn1(self.depthwise(x))) x = self.relu(self.bn2(self.pointwise(x))) return x x = torch.randn(2, 32, 64, 64) conv = DepthwiseSeparableConv(in_c=32, out_c=64) out = conv(x) print("DWS Conv output shape:", out.shape) # [2, 64, 64, 64]</pre>

By setting groups=in_c, we configure PyTorch to apply filters channel-by-channel without cross-channel mixing, achieving the spatial step of depthwise separable convolution.

MobileNet v2 Inverted Residual Block

Below is the implementation of a MobileNet v2 Inverted Residual Block, demonstrating the channel expansion, depthwise spatial filtering, and linear projection.

<pre><code class="language-python">import torch import torch.nn as nn class InvertedResidualBlock(nn.Module): def __init__(self, in_c, out_c, expansion_factor, stride): super().__init__() self.use_residual = stride == 1 and in_c == out_c hidden_dim = int(in_c * expansion_factor) self.block = nn.Sequential( # 1. Expansion (1x1 Conv) nn.Conv2d(in_c, hidden_dim, kernel_size=1, bias=False), nn.BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), # 2. Depthwise Conv nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, padding=1, groups=hidden_dim, bias=False), nn.BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), # 3. Linear Project (1x1 Conv - NO ACTIVATION) nn.Conv2d(hidden_dim, out_c, kernel_size=1, bias=False), nn.BatchNorm2d(out_c) ) def forward(self, x): if self.use_residual: return x + self.block(x) else: return self.block(x) x = torch.randn(1, 16, 32, 32) inverted_res = InvertedResidualBlock(in_c=16, out_c=16, expansion_factor=6, stride=1) out_inv = inverted_res(x) print("Inverted Residual shape:", out_inv.shape) # [1, 16, 32, 32]</pre>

In this code, ReLU6 is used because it restricts maximum activation to 6, maintaining float precision in lightweight hardware environments. The final layer lacks activation, preserving low-dimensional feature profiles.