NLP Tokenization Techniques (Word, BPE, WordPiece)
Neural networks cannot process raw text strings directly; text must first be split into smaller numeric tokens. Tokenization is the critical preprocessing step that balances vocabulary size and sequence length to translate strings into model-ready integers.
Word vs. Character Tokenization
Early models split text by whitespace or characters, but both approaches introduce severe trade-offs in vocabulary size and sequence length.
Word-Level Tokenization
Word tokenization treats each unique word as a token. This yields small sequence lengths but produces massive vocabulary sizes (often over 100,000 words) and struggles with Out-Of-Vocabulary (OOV) tokens when encountering unseen words.
Character-Level Tokenization
Character tokenization uses individual characters as tokens, resulting in a small vocabulary (e.g., ~100 characters) and no OOV issues. However, it makes sequence lengths extremely long, making it difficult for models to capture long-term context.
Subword Tokenization: BPE and WordPiece
Modern NLP models use subword tokenization to bridge this gap, breaking rare words into common subword pieces to eliminate OOV issues while maintaining compact vocabularies.
Byte Pair Encoding (BPE)
BPE starts with a character vocabulary and iteratively merges the most frequent adjacent token pairs. For example, the rare word 'unbelievable' might be split into the common subwords ['un', 'believable']. This is used by GPT models.
WordPiece and SentencePiece
WordPiece (used by BERT) is similar to BPE but merges pairs based on maximizing the likelihood of the training data rather than raw frequency. SentencePiece treats the input as a raw byte stream, avoiding language-specific whitespace split rules.