Deep Convolutional GANs (DCGAN)

Deep Convolutional GANs (DCGANs), introduced by Alec Radford et al. in 2015, established key architectural constraints for using convolutional layers in adversarial networks. By using strided and transposed convolutions, DCGANs generate high-quality, stable images.

Architectural Constraints of DCGAN

DCGAN replaces pooling and fully connected layers with convolutions and batch normalization to stabilize training.

Convolutional Design Rules

Before DCGAN, scaling GANs to deep convolutional architectures was highly unstable, often resulting in divergence and severe mode collapse. Radford et al. identified a set of design rules to stabilize training. First, spatial pooling layers (like max pooling) are removed. In the discriminator, they are replaced with strided convolutions, allowing the network to learn downsampling. In the generator, they are replaced with fractionally-strided (transposed) convolutions, allowing the network to learn spatial upsampling.

Second, batch normalization is applied to both the generator and discriminator to stabilize gradient flow and prevent mode collapse. Third, fully connected hidden layers are removed, replacing them with global average pooling. Finally, the generator uses ReLU activations in all layers except the output (which uses Tanh), while the discriminator uses LeakyReLU activations (with a slope of 0.2) in all layers.

Generative Spatial Upsampling

The generator in a DCGAN starts with a 1D noise vector \\( \\mathbf{z} \\) and must upsample it to a 2D image (e.g., \\( 64 \\times 64 \\times 3 \\)). This is achieved using Transposed Convolution layers (sometimes called fractionally-strided convolutions). A transposed convolution works by taking input feature maps and mapping them to larger output feature maps. It places input values into a padded grid and performs a standard convolution operation over this grid.

By learning the parameters of the transposed convolution filter, the network can learn how to interpolate values and reconstruct fine spatial details. However, transposed convolutions can introduce checkerboard artifacts if the kernel size is not divisible by the stride. Practitioners often mitigate this by using a bilinear upsampling layer followed by a standard convolution instead.

PyTorch DCGAN Implementation

A complete PyTorch DCGAN implementation defines the generator and discriminator classes and weight initializations.

DCGAN Generator and Discriminator

The following PyTorch code implements the Generator and Discriminator networks according to the DCGAN design rules:

<pre><code class="language-python">import torch import torch.nn as nn class DCGANGenerator(nn.Module): def __init__(self, latent_dim, feature_dim, out_channels): super().__init__() # Input: [batch_size, latent_dim, 1, 1] self.net = nn.Sequential( # Input is latent_dim, outputting to feature_dim * 8 nn.ConvTranspose2d(latent_dim, feature_dim * 8, kernel_size=4, stride=1, padding=0, bias=False), nn.BatchNorm2d(feature_dim * 8), nn.ReLU(True), # State shape: [batch_size, feature_dim * 8, 4, 4] nn.ConvTranspose2d(feature_dim * 8, feature_dim * 4, kernel_size=4, stride=2, padding=1, bias=False), nn.BatchNorm2d(feature_dim * 4), nn.ReLU(True), # State shape: [batch_size, feature_dim * 4, 8, 8] nn.ConvTranspose2d(feature_dim * 4, feature_dim * 2, kernel_size=4, stride=2, padding=1, bias=False), nn.BatchNorm2d(feature_dim * 2), nn.ReLU(True), # State shape: [batch_size, feature_dim * 2, 16, 16] nn.ConvTranspose2d(feature_dim * 2, out_channels, kernel_size=4, stride=2, padding=1, bias=False), nn.Tanh() # Output shape: [batch_size, out_channels, 32, 32] ) def forward(self, z): # z shape: [batch_size, latent_dim, 1, 1] return self.net(z) class DCGANDiscriminator(nn.Module): def __init__(self, in_channels, feature_dim): super().__init__() # Input shape: [batch_size, in_channels, 32, 32] self.net = nn.Sequential( nn.Conv2d(in_channels, feature_dim, kernel_size=4, stride=2, padding=1, bias=False), nn.LeakyReLU(0.2, inplace=True), # State shape: [batch_size, feature_dim, 16, 16] nn.Conv2d(feature_dim, feature_dim * 2, kernel_size=4, stride=2, padding=1, bias=False), nn.BatchNorm2d(feature_dim * 2), nn.LeakyReLU(0.2, inplace=True), # State shape: [batch_size, feature_dim * 2, 8, 8] nn.Conv2d(feature_dim * 2, feature_dim * 4, kernel_size=4, stride=2, padding=1, bias=False), nn.BatchNorm2d(feature_dim * 4), nn.LeakyReLU(0.2, inplace=True), # State shape: [batch_size, feature_dim * 4, 4, 4] nn.Conv2d(feature_dim * 4, 1, kernel_size=4, stride=1, padding=0, bias=False), nn.Sigmoid() # Output shape: [batch_size, 1, 1, 1] ) def forward(self, img): return self.net(img).view(-1, 1)</pre>

Weight Initialization

The DCGAN paper recommends a specific weight initialization protocol to stabilize training. Instead of using the default PyTorch initialization, all model weights must be initialized from a normal distribution with a mean of 0.0 and a standard deviation of 0.02. This function implements this initialization:

<pre><code class="language-python">def weights_init_dcgan(m): classname = m.__class__.__name__ if classname.find('Conv') != -1: nn.init.normal_(m.weight.data, 0.0, 0.02) elif classname.find('BatchNorm') != -1: nn.init.normal_(m.weight.data, 1.0, 0.02) nn.init.constant_(m.bias.data, 0)</pre>

Initializing weights with a standard deviation of 0.02 keeps activation values within a stable range, preventing gradients from vanishing or exploding during early epochs.

Evaluation and Applications

DCGAN latent spaces support vector arithmetic, and generated images are evaluated using Inception and FID metrics.

Latent Space Arithmetic

One of the most notable discoveries in the DCGAN paper is that the learned latent space is structured geometrically, supporting vector arithmetic. If we find latent vectors that generate specific features, we can perform vector addition and subtraction to manipulate the output. For example, if we average the latent vectors of several images of a "smiling woman" (vector \\( \\mathbf{z}_{sw} \\)), subtract the average vector of "neutral women" (\\( \\mathbf{z}_{nw} \\)), and add the average vector of "neutral men" (\\( \\mathbf{z}_{nm} \\)):

\\( \\mathbf{z}_{new} = \\mathbf{z}_{sw} - \\mathbf{z}_{nw} + \\mathbf{z}_{nm} \\)

Passing this new vector \\( \\mathbf{z}_{new} \\) through the generator yields an image of a "smiling man." This proved that the model does not just memorize training samples; it learns a structured, continuous semantic representation of the data distribution.

Evaluation Metrics

Evaluating GANs is challenging because there is no explicit likelihood function. The two standard evaluation metrics are the Inception Score (IS) and the Fréchet Inception Distance (FID). Inception Score evaluates the quality and diversity of generated images by passing them through a pre-trained ImageNet classifier (Inception-V3). It calculates the KL divergence between the predicted label distribution for each image (which should be sharp, indicating high quality) and the marginal label distribution across all images (which should be uniform, indicating high diversity).

Fréchet Inception Distance (FID) improves on IS by comparing the distribution of generated images with the distribution of real images. It extracts feature representations from an intermediate layer of the Inception network for both real and generated images, fits them to multivariate Gaussians, and calculates the Fréchet distance between them: \\( d^2 = ||\\mu_r - \\mu_g||^2 + \\text{Tr}(\\Sigma_r + \\Sigma_g - 2(\\Sigma_r \\Sigma_g)^{1/2}) \\). Lower FID scores indicate that the generated images are statistically similar to the real dataset.