Word2Vec: CBOW and Skip-Gram Architectures
Word2Vec, developed by Mikolov et al. at Google, is a landmark framework for learning word embeddings from large text corpora. By training simple networks on context-prediction tasks, Word2Vec learns representations that capture syntactic and semantic patterns.
CBOW and Skip-Gram Architectures
Word2Vec provides two distinct training architectures based on local sliding context windows: Continuous Bag of Words (CBOW) and Skip-Gram.
Continuous Bag of Words (CBOW)
CBOW predicts a target word given its surrounding context words. For example, given the context ['the', 'cat', 'on', 'the'], the model predicts the target word 'sat'. This model trains faster and works well on common words.
Skip-Gram Architecture
Skip-Gram reverses the CBOW task, using a single target word to predict its surrounding context words. Skip-Gram is computationally slower but performs better on rare words and captures fine-grained semantic details.
Negative Sampling Optimization
Calculating a softmax over a large vocabulary at each step is computationally expensive. Word2Vec resolves this using negative sampling.
Simplifying the Output Layer
Negative sampling converts the multi-class classification task into binary logistic regression. For each actual target-context pair (positive sample), the model is fed 5 to 20 random words from the vocabulary (negative samples) and trained to distinguish the true target from the noise: L = -\\log \\sigma(v'_{w_O}^T v_{w_I}) - \\sum_{i=1}^k \\log \\sigma(-v'_{w_{N_i}}^T v_{w_I}).
Skip-Gram Training Concept
We structure training inputs as target-context pairs. This PyTorch code shows a conceptual representation of Skip-Gram inputs.
<pre><code class="language-python">import torch import torch.nn as nn class SkipGramModel(nn.Module): def __init__(self, vocab_size=5000, embed_dim=128): super().__init__() # Target and context embeddings are separate tables self.w_embeddings = nn.Embedding(vocab_size, embed_dim) self.v_embeddings = nn.Embedding(vocab_size, embed_dim) def forward(self, target, context): # target, context: [batch_size] v_t = self.w_embeddings(target) # Target vector u_c = self.v_embeddings(context) # Context vector # Compute dot product to measure similarity return torch.sum(v_t * u_c, dim=1)</pre>