Digital Image Representation (RGB Matrices)
Digital images are represented in computer systems as multi-dimensional matrices, where color channels track red, green, and blue light intensities across a 2D spatial grid.
Pixel Grids and Channels
Every image consists of a grid of spatial coordinates (pixels), with each pixel mapped to numerical values representing light intensities.
RGB Tensor Structure
A color image is represented as a 3D matrix where three channels represent Red, Green, and Blue (RGB) spectral intensities. Each channel is a 2D grid containing intensity values ranging from 0 (black) to 255 (white). Each coordinate \\((y, x)\\) represents a pixel, and its value is a scalar tracking the intensity of that channel. For example, a \\(224 \\times 224\\) image is represented as a tensor of dimensions \\((3, 224, 224)\\) in PyTorch.
Color space representations allow digital screens to mix primary colors to render the full visual spectrum. Standard RGB is an additive color model, meaning that combining maximum intensities of red, green, and blue produces pure white, while the absence of all light produces black. In machine learning, these integer intensities are typically normalized to floating-point values between 0.0 and 1.0 or normalized using mean and standard deviation to stabilize neural network optimization.
Grayscale and Alternative Color Spaces
Grayscale images are represented as a 2D matrix (single channel), where each pixel represents light intensity. Alternative color spaces include HSV (Hue, Saturation, Value) and YCbCr (Luminance, Blue-difference, Red-difference chrominances). These color spaces are useful for separating brightness information from color information, which is helpful in classical computer vision.
When feeding images into deep networks, the choice of color space depends on the task. While RGB is the standard for most general computer vision models, YCbCr is widely used in image and video compression algorithms, and HSV can be helpful for tasks that require color-based segmentation. Converting between color spaces is typically handled using preprocessing libraries like OpenCV or Pillow.
Channel Formats in Deep Learning
Different deep learning frameworks expect the channels dimension to be situated at different positions in the tensor shape.
Channel-First vs. Channel-Last Layouts
Channel-Last (HWC) structures tensors as \\((Height, Width, Channels)\\), which is common in NumPy, OpenCV, and TensorFlow. Channel-First (CHW) structures tensors as \\((Channels, Height, Width)\\), which is the default format for PyTorch.
The choice of channel format has performance implications for GPU training. PyTorch's default CHW layout is optimized for CUDA operations, allowing for contiguous memory access during spatial convolutions. However, modern GPU architectures also support Channels-Last layouts (NHWC) using Tensor Cores, which can accelerate training by improving memory bandwidth and compute alignment.
PyTorch Layout Conversion
PyTorch provides the permute and transpose methods to easily convert between Channel-First and Channel-Last formats. Converting layouts is necessary when loading images using OpenCV (which returns HWC) and passing them to PyTorch models (which expect CHW).
The permutation does not copy the underlying data buffer in memory; instead, it returns a tensor view with updated strides. This makes the operation highly efficient, though calling contiguous() afterwards is sometimes required to ensure subsequent operations execute correctly on the GPU.
Preprocessing and Normalization Pipelines
Preprocessing transformations prepare raw pixel values for optimization inside the neural network.
Integer Scaling and Standardization
Raw pixel values range from 0 to 255. Before feeding them to a neural network, these values are typically scaled to the range \\([0.0, 1.0]\\) by dividing by 255.0. This scaling keeps input features in a small, stable range, preventing large activations and helping gradient descent converge faster.
Standardizing scaled images using mean and standard deviation is another best practice. In transfer learning, models are initialized with weights pre-trained on datasets like ImageNet, and the input images must be normalized using the exact ImageNet statistics (mean \\([0.485, 0.456, 0.406]\\) and std \\([0.229, 0.224, 0.225]\\)) to ensure the features match the model's expected distributions.
Data Loading and Pipeline Integration
In PyTorch, these normalization operations are integrated into the data pipeline using torchvision.transforms. Transforms are chained together using Compose and applied to the dataset's raw images during data loading.
Using pre-defined transformation pipelines ensures consistency between the training and inference environments. Any difference in normalization parameters between training and testing will lead to a drop in classification accuracy.