Tensors: N-Dimensional Data Structures
Real-world AI data is rarely 2D. Videos are 4D (Height, Width, Color, Time). Deep Learning uses Tensors—a generalized mathematical object that can have any number of dimensions.
The Concept of Rank and Shape
Tensors are the universal mathematical objects used in modern deep learning frameworks. They generalize scalars, vectors, and matrices to any arbitrary number of dimensions.
Tensor Rank (Order)
The Rank of a tensor refers to its number of dimensions (axes). It is completely distinct from matrix rank:
- Rank-0: Scalar (a single number: `5.0`)
- Rank-1: Vector (a 1D array of numbers: `[1.0, 2.0, 3.0]`)
- Rank-2: Matrix (a 2D grid: `[[1, 2], [3, 4]]`)
- Rank-3: 3D array (a cube of numbers)
- Rank-N: High-dimensional arrays.
Tensor Shape
The Shape of a tensor is a tuple of integers specifying the size of the tensor along each axis. For example, a matrix with 3 rows and 5 columns has a shape of `(3, 5)`. A 3D tensor representing a grid of numbers with 4 layers, 3 rows, and 2 columns has a shape of `(4, 3, 2)`.
Representing Complex Data Types
Tensors allow deep learning frameworks to handle all forms of structured, multi-dimensional real-world data under a unified mathematical umbrella.
Image Representations (Rank-3 and Rank-4 Tensors)
- Single Grayscale Image: Rank-2 tensor of shape `(Height, Width)`, containing pixel intensity values.
- Single Color Image: Rank-3 tensor of shape `(Height, Width, Channels)` (RGB channels).
- Batch of Color Images: Rank-4 tensor of shape `(BatchSize, Height, Width, Channels)` or `(BatchSize, Channels, Height, Width)`, representing an entire batch of data passed to a Convolutional Neural Network (CNN) in a single training step.
Text and Video Representations (Rank-3 to Rank-5)
- Sequence of Words: Rank-3 tensor of shape `(BatchSize, SequenceLength, EmbeddingDimension)`, representing text tokens embedded in high-dimensional semantic spaces.
- Video Data: Rank-5 tensor of shape `(BatchSize, Frames, Height, Width, Channels)`, representing a batch of video clips where each clip contains multiple color image frames sequenced over time.
Tensor Operations: Reshaping, Slicing, and Broadcasting
Manipulating tensor coordinates without destroying their underlying memory layout is a fundamental skill in implementing deep learning models.
Reshaping and View Operations
In PyTorch or TensorFlow, we often need to flatten or change the axes of a tensor. For example, before feeding convolutional features into a fully connected layer, we flatten a tensor of shape `(BatchSize, 16, 7, 7)` into a Rank-2 tensor of shape `(BatchSize, 784)`. Using `.view()` or `.reshape()` changes the shape metadata without copying the underlying contiguous raw numbers in memory, making it extremely fast.
Broadcasting: Implicit Dimension Extension
When performing arithmetic between tensors of different shapes, frameworks apply Broadcasting rules. If you add a Rank-1 tensor of shape `(3,)` to a Rank-2 tensor of shape `(4, 3)`, the framework implicitly duplicates the 1D tensor along the missing row dimension, making the shapes matching `(4, 3)` so element-wise addition can happen without explicit memory duplication.