Filters/Kernels for Feature Extraction (Edges, Shapes)
Convolutional kernels function as spatial feature detectors; while classical computer vision used hand-designed kernels to locate edges, CNNs learn these filters automatically through backpropagation.
Classical Hand-Crafted Kernels
Before deep learning, computer vision relied on manually designed kernels to detect features.
Edge Detection with Sobel and Prewitt Filters
Before deep learning, computer vision relied on manually designed kernels to detect features. The Sobel filter is a classic edge detection kernel. The vertical Sobel kernel is defined as:
\\(K_{vert} = \\begin{bmatrix} -1 & 0 & 1 \\\\ -2 & 0 & 2 \\\\ -1 & 0 & 1 \\end{bmatrix}\\)
When convolved with an image, this kernel computes the horizontal gradient of pixel intensities, highlighting vertical edges where intensities change rapidly. Similarly, horizontal Sobel and Prewitt filters detect horizontal boundaries. These filters are useful for highlighting structural edges in images, but they are limited because they cannot adapt to different datasets or learn more complex features.
Gaussian Blur and Sharpening Kernels
Other classical filters include Gaussian blur kernels, which smooth images by averaging neighboring pixel values, and sharpening kernels, which amplify high-frequency details. These operations are used for preprocessing images to reduce noise or enhance features before classification.
While these hand-crafted filters were effective for simple tasks, they required significant manual tuning and could not capture semantic details. Deep learning models replace these manual filters with learnable weights, discovering optimal kernels during training.
Learnable Filters in CNNs
Modern neural networks replace manual kernel configurations with data-driven weight optimization.
Gradient-Based Filter Optimization
In a CNN, the values of the kernels are parameters that are initialized randomly. During training, the network computes the gradient of the loss with respect to these weights. Backpropagation updates the kernel values using gradient descent, tuning them to detect patterns that help minimize the classification error.
This optimization process allows the model to learn filters that are tailored to the training dataset. Instead of using hand-designed edge detectors, the model learns the exact patterns—such as specific angles, color transitions, or textures—that are most useful for the target task.
Hierarchical Feature Representation
Stacked convolutional layers form a hierarchical representation of visual features. Early layers learn simple filters that detect edges, corners, and color gradients. Middle layers combine these simple features to detect textures and local parts (such as wheels or eyes).
Deep layers combine the middle-level features to represent complete semantic shapes (like cars or faces). This hierarchical learning allows the model to build complex representations from simple pixel inputs, mimicking the human visual cortex.
Custom Kernel Application in PyTorch
Let's look at how to load manual weights into a Conv2d layer to observe its feature extraction behavior.
Injecting Manual Weights into nn.Conv2d
We can load manual weights (like a Sobel filter) into a PyTorch convolutional layer to observe how it extracts features. This is done by wrapping the manual matrix in a tensor and assigning it to the layer's weight parameter.
In this code, we disable gradient tracking for the weights by setting requires_grad=False. This locks the manual filter values in place, allowing us to use the layer as a deterministic preprocessing step.
Visualizing Filter Outputs
The output of this layer is a feature map where high positive or negative values correspond to edges. By visualizing these outputs, we can confirm that the manual Sobel filter has highlighted the vertical boundaries in the input tensor.
In a trained CNN, we can visualize the learned filters in the same way. This analysis helps interpret the model's decisions, confirming that it has learned to focus on relevant features rather than training noise.