Named Entity Recognition (NER) with BiLSTMs

Named Entity Recognition (NER) is a sequence labeling task where each token in a sentence is assigned a tag representing a entity type. Bidirectional LSTMs (BiLSTMs) are suited for this because they process text in both directions, capturing complete contextual information for each word.


Sequence Labeling and Many-to-Many Architecture

NER requires classifying every individual token in a sequence using joint forward and backward sequence context.

NER Task Formulation

Named Entity Recognition is formulated as a many-to-many sequence labeling problem. Given an input sequence of words \\( x_1, x_2, \\dots, x_T \\), the goal is to predict a sequence of labels \\( y_1, y_2, \\dots, y_T \\). To represent entity spans (which can span multiple tokens, like "New York City"), models use specialized tagging schemes. The most common is the BIO (Begin, Inside, Outside) scheme, where a label like B-LOC marks the start of a location entity, I-LOC marks continuation, and O indicates non-entity tokens.

Alternatively, the BILOU (Begin, Inside, Last, Unit, Outside) scheme provides more structural constraints. Standard classification models evaluate each token independently, but sequence labeling requires the architecture to support fine-grained token representation relative to its local and global context.

Bidirectional Contextualization

Unidirectional LSTMs only capture context from the left of the current word. In NER, right-hand context is equally critical for identifying entities. For example, in the sentence "Green is a good player," the word "Green" is a person's name (PER), whereas in "Green apples are delicious," it is an adjective. A unidirectional LSTM processing "Green" has not yet seen "player" or "apples," making disambiguation difficult.

A Bidirectional LSTM solves this by running two independent LSTM layers over the input sequence: a forward LSTM that processes tokens from left-to-right (producing states \\( \\overrightarrow{h}_t \\)), and a backward LSTM that processes tokens from right-to-left (producing states \\( \\overleftarrow{h}_t \\)). At each step \\( t \\), the hidden representations are concatenated: \\( h_t = [\\overrightarrow{h}_t; \\overleftarrow{h}_t] \\). This combined representation contains information from both the past and the future, providing a context-rich feature vector for classifying the token.

BiLSTM-NER Implementation

Implementing a token-level classification network in PyTorch requires outputting tag predictions for every sequence step and masking pads.

Model Architecture

The following PyTorch code defines a Bidirectional LSTM network that projects concatenated hidden states to tag logits for each sequence token:

<pre><code class="language-python">import torch import torch.nn as nn class BiLSTMNER(nn.Module): def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0) # bidirectional=True doubles the output dimensions of the LSTM self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2, num_layers=1, bidirectional=True, batch_first=True) # Linear layer projects concatenated [forward; backward] states to tag space self.hidden_to_tag = nn.Linear(hidden_dim, len(tag_to_ix)) def forward(self, sentence): # sentence shape: [batch_size, seq_len] embeds = self.embedding(sentence) # [batch_size, seq_len, embedding_dim] # lstm_out shape: [batch_size, seq_len, hidden_dim] lstm_out, _ = self.lstm(embeds) # Project to tag space # logits shape: [batch_size, seq_len, num_tags] logits = self.hidden_to_tag(lstm_out) return logits # Example execution tags = {"O": 0, "B-PER": 1, "I-PER": 2, "B-LOC": 3, "I-LOC": 4, "PAD": 5} model = BiLSTMNER(vocab_size=50, tag_to_ix=tags, embedding_dim=16, hidden_dim=32) x = torch.randint(1, 50, (2, 8)) # Batch size of 2, sequence length of 8 logits = model(x) print("Logits shape:", logits.shape) # Should be [2, 8, 6]</pre>

Token-Level Cross-Entropy

Training a sequence labeling model requires computing a classification loss for each token. In PyTorch, this is achieved using nn.CrossEntropyLoss. Since batch tensors contain padding tokens to maintain uniform shapes, we must prevent the model's loss from being influenced by pad tokens. We configure the loss function by setting the ignore_index parameter to the index of our pad label (e.g., ignore_index=5).

This configuration automatically masks out padding steps, setting their contribution to the loss gradient to zero. The total loss is computed as the average loss over all valid tokens in the batch, ensuring that the model updates its weights based only on actual words and entities.

Enhancing NER with CRF and Char Embeddings

Adding Conditional Random Fields and sub-word spelling extractors improves model consistency and vocabulary coverage.

The Role of Conditional Random Fields (CRF)

A standard BiLSTM classifies each token independently, outputting a probability distribution over tags using Softmax. This approach ignores grammatical rules and transitions between adjacent labels. For example, in BIO tagging, it is mathematically impossible for an I-PER (Inside Person) tag to follow a B-LOC (Begin Location) tag. Softmax lacks a mechanism to enforce these constraints, which can lead to invalid label sequences.

To solve this, a Conditional Random Field (CRF) is added as a final layer. Instead of predicting tags independently, a CRF models the joint probability of the entire sequence of tags. It maintains a transition matrix \\( \\mathbf{T} \\), where \\( T_{i,j} \\) represents the score of transitioning from tag \\( i \\) to tag \\( j \\). The model maximizes the likelihood of the correct sequence path over all possible paths using the Viterbi algorithm. This ensures that the final predictions adhere to sequence-level constraints.

Character-Level Features

NER is highly sensitive to out-of-vocabulary words, such as rare surnames or newly coined brand names. Standard word embeddings fail to represent these unseen words. To address this, hybrid BiLSTM architectures extract character-level features. For each word, its letters are passed through a character-level CNN or BiLSTM to generate a spelling feature vector.

This character-based representation is then concatenated with the word-level embedding before being fed to the main sequence model. Character features capture structural clues (e.g., words ending in "-stein" are often names, capitalized words inside sentences are entities) that allow the model to recognize unseen entities based on their spelling patterns.