Question Answering Systems (Pre-LLM)

Before the dominance of Large Language Models (LLMs), question answering (QA) systems relied on extractive span prediction, memory networks, and dual-encoder retriever-reader frameworks. These systems map questions and context documents to precise answers using targeted classification heads.

Extractive Question Answering Formulation

Extractive QA models locate answers by predicting the start and end boundary indices of the answer span within a context document.

Span Prediction

In extractive reading comprehension (popularized by datasets like SQuAD), a model receives a question \\( q \\) and a context document \\( c \\). The task is to locate the answer within the context, represented as a continuous span of text. Instead of generating words, the model predicts two indices: the start index \\( s \\) and the end index \\( e \\) of the answer span within the context document. Let the tokenized context document be represented as a sequence of vectors \\( \\mathbf{h}_1, \\mathbf{h}_2, \\dots, \\mathbf{h}_N \\) output by an encoder.

We introduce two learnable parameter vectors: \\( \\mathbf{w}_{start} \\) and \\( \\mathbf{w}_{end} \\). The probability of token \\( i \\) being the start of the answer span is calculated as: \\( P_{start}(i) = \\frac{e^{\\mathbf{w}_{start}^T \\mathbf{h}_i}}{\\sum_j e^{\\mathbf{w}_{start}^T \\mathbf{h}_j}} \\). The end token probability \\( P_{end}(i) \\) is computed similarly. The model is optimized by minimizing the cross-entropy loss for the start and end positions, and inference searches for the pair \\( (s, e) \\) that maximizes \\( P_{start}(s) \\times P_{end}(e) \\) subject to constraint \\( s \\le e \\le s + L_{max} \\).

PyTorch Span Predictor

This PyTorch module demonstrates the head architecture used to predict start and end answer token boundaries from context features:

<pre><code class="language-python">import torch import torch.nn as nn class QASpanPredictor(nn.Module): def __init__(self, hidden_dim): super().__init__() # Single linear layer projecting to 2 outputs (start and end logits) self.qa_outputs = nn.Linear(hidden_dim, 2) def forward(self, context_representations): # context_representations shape: [batch_size, seq_len, hidden_dim] # Pass through linear projection # logits shape: [batch_size, seq_len, 2] logits = self.qa_outputs(context_representations) # Split into start and end logits # Each has shape: [batch_size, seq_len] start_logits, end_logits = logits.split(1, dim=-1) start_logits = start_logits.squeeze(-1) end_logits = end_logits.squeeze(-1) return start_logits, end_logits # Example run predictor = QASpanPredictor(hidden_dim=32) # Simulate context token encodings (batch of 2, 10 tokens, 32 channels) context_feats = torch.randn(2, 10, 32) start, end = predictor(context_feats) print("Start logits shape:", start.shape) # [2, 10] print("End logits shape:", end.shape) # [2, 10]</pre>

Memory Networks and Reasoning

Memory Networks store facts in an external memory and use multi-hop retrieval to answer complex questions.

End-to-End Memory Networks

For question answering tasks that require reasoning over multiple facts (like the bAbI dataset), standard recurrent networks fail because they cannot store and retrieve factual details reliably. End-to-End Memory Networks (Sukhbaatar et al., 2015) solve this by writing facts to an external memory array. The model converts input facts \\( x_i \\) into memory vectors \\( m_i \\) and output vectors \\( c_i \\) using two embedding matrices.

The question \\( q \\) is embedded into a query vector \\( u \\). The model computes match scores between the query and each memory vector using dot products, normalized via softmax to get attention weights: \\( p_i = \\text{Softmax}(u^T m_i) \\). The response vector \\( o \\) is computed as the sum of output vectors weighted by these match scores: \\( o = \\sum p_i c_i \\). Finally, the query and response are combined to predict the answer, allowing the model to retrieve facts dynamically based on the question content.

Multi-Hop Reasoning

Some questions cannot be answered by retrieving a single fact. For example, to answer "Where is the milk?" given "John took the milk" and "John went to the kitchen," the model must perform a two-step lookup. Multi-hop memory networks stack multiple memory layers (or hops) to support this. The output of the first hop \\( o^1 \\) is added to the initial query \\( u^1 \\) to form a new query: \\( u^2 = u^1 + o^1 \\).

This updated query is passed to the next memory layer, allowing the model to attend to different facts (e.g., locating John in the second hop after identifying him in the first). Stacking hops enables the network to perform logical reasoning across chains of facts, mimicking human cognitive retrieval.

Open-Domain QA Pipelines

Open-domain systems combine a sparse or dense document retriever with a neural reader to answer questions from large corpora.

Retriever-Reader Framework

When answering questions from a large database (like Wikipedia) rather than a short context document, reading the entire database with a neural network is computationally infeasible. Open-domain QA systems solve this using a two-stage Retriever-Reader pipeline. The Retriever stage is a lightweight search component that filters the database to find the top \\( K \\) documents relevant to the question. Traditional systems use sparse search algorithms like TF-IDF or BM25, which match keyword overlaps between the question and documents.

The Reader stage is a deep neural model (like a span predictor) that processes only the retrieved documents, extracting the final answer span. This two-stage design balances retrieval speed with extraction accuracy, allowing the system to scale to millions of documents.

Dense Passage Retrieval (DPR)

Traditional keyword-based retrievers struggle with synonyms and semantic meaning. Dense Passage Retrieval (DPR) solves this by replacing BM25 with a dual-encoder architecture. DPR uses two separate encoders: a question encoder \\( E_Q \\) and a passage encoder \\( E_P \\). Both map text to continuous vector spaces (typically 768 dimensions). The relevance score between question \\( q \\) and passage \\( p \\) is computed as the dot product of their vectors: \\( ext{Sim}(q, p) = E_Q(q)^T E_P(p) \\).

The model is trained using contrastive learning, maximizing the similarity of matching pairs while minimizing the similarity of negative samples. Once trained, passage vectors are indexed using fast vector search libraries (like FAISS), allowing the system to perform semantic retrieval in milliseconds.