Datasets and DataLoaders in PyTorch

PyTorch provides Dataset and DataLoader classes to build clean, efficient data pipelines. These classes separate data storage and preprocessing from model training, supporting parallel batching and shuffling.


Data Pipeline Abstractions

PyTorch separates dataset storage from batch scheduling using distinct class abstractions.

The Dataset Class

The torch.utils.data.Dataset class represents a collection of data samples. It abstracts the storage location, allowing the dataset to be stored in local memory, hard drives, or cloud buckets.

To create a custom dataset, we subclass Dataset and implement three methods: __init__() (to initialize data paths and transforms), __len__() (to return the dataset size), and __getitem__() (to load and preprocess a sample by index).

The DataLoader Class

The torch.utils.data.DataLoader class wraps a dataset, converting individual samples into parallel batches. It manages batch creation, dataset shuffling, and parallel loading using multi-process workers.

By setting parameters like batch_size, shuffle, and num_workers, the DataLoader manages background threads to load data while the GPU executes training steps, preventing data transfer bottlenecks.

Data Customization and Collation

Custom pipelines support dynamic data loading and custom collation for variable-length inputs.

Dynamic Sample Loading

For large datasets (like high-resolution images), loading all samples into RAM at once will cause memory exhaustion. The __getitem__ method resolves this by loading files from disk dynamically when requested.

This lazy loading ensures that only the active batch is loaded into memory, allowing the pipeline to scale to massive datasets while keeping memory usage constant.

Collate Functions

When samples in a batch have different shapes (such as texts of varying lengths), standard batch stacking will fail. The DataLoader uses a collate_fn to combine list samples into batch tensors.

We can write a custom collate function to handle padding, truncate sequences, or merge dictionaries, providing flexibility for complex input structures.

PyTorch Implementation

We can implement a custom dataset and wrap it in a DataLoader to verify batching and shuffling behavior.

Coding a Custom Dataset Class

Here is a complete PyTorch implementation of a custom dataset that generates synthetic data:

<pre><code class="language-python">import torch from torch.utils.data import Dataset, DataLoader class SyntheticDataset(Dataset): def __init__(self, num_samples, input_dim): # Generate random inputs and labels self.x = torch.randn(num_samples, input_dim) self.y = torch.randint(0, 2, (num_samples, 1)).float() def __len__(self): return len(self.x) def __getitem__(self, idx): # Return a single sample return self.x[idx], self.y[idx] # Instantiate dataset with 100 samples dataset = SyntheticDataset(num_samples=100, input_dim=5) print("Dataset size:", len(dataset))</pre>

In this code, we subclass Dataset and implement the required methods. The dataset holds the data arrays in memory and returns individual samples using index slicing.

Wrapping with DataLoader

We wrap our custom dataset in a DataLoader to automate batching, shuffling, and iteration:

<pre><code class="language-python"># Construct DataLoader dataloader = DataLoader(dataset, batch_size=16, shuffle=True, drop_last=False) # Iterate through one batch for batch_idx, (batch_x, batch_y) in enumerate(dataloader): print(f"Batch {batch_idx} | X shape: {batch_x.shape} | Y shape: {batch_y.shape}") break # Inspect only the first batch</pre>

This loader yields batches of 16 samples. Shuffling is enabled, meaning the order of samples is randomized at the start of each epoch, which is critical to prevent the model from memorizing data order.