FastText and Subword Information

FastText is an extension of the Word2Vec skip-gram model that incorporates subword information by representing words as bags of character n-grams. This approach allows the model to learn rich representations for morphologically complex words and handle unseen out-of-vocabulary terms effectively.

Subword Representation Mechanics

By breaking words down into character n-grams, FastText captures internal spelling patterns, allowing it to share representations across related words.

Character n-grams

In standard word representation models like Word2Vec and GloVe, each word is assigned a distinct vector, treating words as atomic entities. This approach ignores morphological similarities (e.g., "run", "running", and "runner" are treated as completely independent). FastText solves this by representing each word as a bag of character n-grams. For instance, if we select a range of n-grams from 3 to 6, the word "where" is represented by its n-grams: "<wh", "whe", "her", "ere", "re>" (where "<" and ">" are special boundary markers indicating the start and end of the word). The word itself is also included in the set.

This design allows the model to capture suffix and prefix information, which is critical for understanding word meanings. The final vector representation of a word is computed as the sum of the vector representations of its constituent character n-grams. This means that words sharing common prefixes or suffixes will naturally share parameter space and have similar vectors, even if one of the words is extremely rare in the training corpus.

Out-Of-Vocabulary (OOV) Handling

One of the most severe limitations of word-level models is their inability to handle out-of-vocabulary (OOV) words. When a word is encountered at test time that was not present in the training vocabulary, traditional models must assign it a generic "<UNK>" token or a random vector, destroying semantic information. FastText handles OOV words gracefully by exploiting subword structures. If a word is unseen, the model decomposes it into its character n-grams and sums the vectors of those n-grams that were observed during training.

For example, if the word "suboptimality" is OOV but the model has seen "sub-", "optimal", and "-ity" in other contexts, the resulting vector will capture a composite meaning of the parts. This property is crucial for domain-specific text (such as medical or technical documents) where new terms are frequently formed by combining existing prefixes and roots. By leveraging subword information, the model generalizes to unseen words far better than word-level alternatives.

Mathematical Formulation and Training

The scoring function in FastText calculates word similarity by summing the dot products of the constituent n-gram vectors with context word vectors.

Objective Function and Skip-gram Extension

FastText generalizes the skip-gram architecture. In the standard skip-gram model, the score of a word \\( w \\) and a context word \\( c \\) is computed using the dot product of their vectors: \\( s(w, c) = \\mathbf{u}_w^T \\mathbf{v}_c \\). In FastText, this scoring function is modified to include subword information. Let \\( \\mathcal{G}_w \\subset \\{1, \\dots, G\\} \\) be the set of character n-grams appearing in word \\( w \\). We associate a vector \\( \\mathbf{z}_g \\) with each n-gram \\( g \\) in our dictionary. The score of a word-context pair is then defined as the sum of the dot products of the n-gram vectors and the context vector:

\\( s(w, c) = \\sum_{g \\in \\mathcal{G}_w} \\mathbf{z}_g^T \\mathbf{v}_c \\)

During training, the model optimizes the standard skip-gram objective with negative sampling. For a given target word \\( w_t \\) and context word \\( w_c \\), the loss function minimizes the negative log-likelihood of the binary decision: \\( \\log(1 + e^{-s(w_t, w_c)}) + \\sum_{n \\in \\mathcal{N}} \\log(1 + e^{s(w_t, n)}) \\), where \\( \\mathcal{N} \\) is a set of negative samples drawn from the noise distribution. This ensures that the n-gram vectors learn parameters that align with target semantic contexts.

PyTorch Implementation

The following PyTorch code demonstrates how to implement a subword embedding lookup layer that mimics the core mechanics of FastText's word representation calculation:

<pre><code class="language-python">import torch import torch.nn as nn class FastTextEmbeddingLookup(nn.Module): def __init__(self, vocab_size, num_ngrams, embedding_dim): super().__init__() # Standard word embeddings for known words self.word_embeddings = nn.Embedding(vocab_size, embedding_dim) # Subword n-gram embeddings self.ngram_embeddings = nn.Embedding(num_ngrams, embedding_dim) def forward(self, word_idx, ngram_indices): """ Args: word_idx: Tensor of shape [batch_size] containing word indices ngram_indices: List of Tensors, where each Tensor contains the n-gram indices for a single word. """ batch_size = word_idx.size(0) embeddings = [] for i in range(batch_size): # Fetch the word vector w_vec = self.word_embeddings(word_idx[i]) # [embedding_dim] # Fetch and average/sum the n-gram vectors ng_idxs = ngram_indices[i] if len(ng_idxs) > 0: ng_vecs = self.ngram_embeddings(ng_idxs) # [num_word_ngrams, embedding_dim] ng_sum = torch.sum(ng_vecs, dim=0) # [embedding_dim] word_rep = w_vec + ng_sum else: word_rep = w_vec embeddings.append(word_rep) # Stack into [batch_size, embedding_dim] return torch.stack(embeddings, dim=0) # Example execution model = FastTextEmbeddingLookup(vocab_size=10, num_ngrams=100, embedding_dim=16) target_word = torch.tensor([1, 2]) # N-gram indices for each word in the batch ngrams = [torch.tensor([12, 45, 87]), torch.tensor([5, 9])] out = model(target_word, ngrams) # Shape: [2, 16] - [batch_size, embedding_dim] print("Output shape:", out.shape)</pre>

Design Trade-offs and Practical Considerations

Using subword information introduces memory overhead and structural trade-offs, particularly regarding hash collisions and morphological language profiles.

Memory vs. Accuracy

Although character n-grams provide massive representation benefits, they drastically increase the vocabulary size. A corpus with 100,000 unique words can contain millions of unique character n-grams. To constrain memory usage, FastText uses a hashing trick (specifically the Fowler-Noll-Vo FNV-1a hash function) to map all n-grams to a fixed-size table of size \\( K \\) (typically \\( K = 2,000,000 \\)). This limits the memory footprint of the embedding layer but introduces hash collisions, where completely different n-grams are mapped to the same index and share the same vector representation.

In practice, the model's accuracy is robust to moderate numbers of collisions because the final word representation is a sum of many n-gram vectors, which averages out the noise. However, setting the hash table size too small leads to severe collisions and drop in performance. Practitioners must balance the size of \\( K \\) and the n-gram length parameters against the physical RAM limitations of the target deployment environment.

Morphological Richness

The benefits of FastText vary significantly depending on the language's linguistic structure. For morphologically rich languages like Turkish, Finnish, German, and Russian, where words are modified by stacking prefixes and suffixes to indicate grammatical relations, FastText outperforms Word2Vec by large margins. In these languages, the vocabulary is sparse because of the combinatorial explosion of word forms, but the underlying n-grams are highly repetitive.

For structurally simpler languages like English, which rely more on word order and prepositions than inflections, the performance gap between Word2Vec and FastText is narrower. FastText's primary benefit in English is its robust handling of spelling errors and colloquial subwords (such as hashtags and abbreviations on social media platforms), where words are frequently altered in non-standard ways.