Text Generation with Character-Level RNNs
Character-level language models generate text by predicting the next character in a sequence. By working at the character level, these networks maintain a small vocabulary size and can generate novel words, though they require longer sequences to capture context.
The Character-Level Language Model
A character language model treats text as a sequence of character tokens. Given a history of characters, the network predicts a probability distribution over the entire character vocabulary.
Vocabulary and Tokenization
The character vocabulary consists of all unique letters, numbers, punctuation, and whitespaces in the corpus (typically ~100 characters). This is much smaller than word vocabularies (which can exceed 50,000 tokens), preventing memory bloat.
Training Objective
The model is trained using Cross-Entropy loss. At each step, the network takes a character and attempts to predict the actual next character in the text document, maximizing the likelihood of the training text.
Sampling and Softmax Temperature
During text generation, the model predicts logits at each step. We control the creativity of the generated text by applying a temperature scale to these logits before sampling.
Temperature Scaling Math
To adjust creativity, we divide logits by a temperature value T: p_i = \\frac{\\exp(z_i / T)}{\\sum \\exp(z_j / T)}. If T \\to 0, the distribution becomes argmax (greedy, highly repetitive). If T \\to \\infty, the distribution becomes uniform (random, chaotic).
PyTorch Sampling Implementation
We use torch.multinomial to sample characters from the adjusted probability distribution.