The Problem with One-Hot Encoded Words

Representing words as one-hot vectors is the simplest way to convert categorical text to numeric inputs. However, this approach fails because it produces extremely high-dimensional, sparse vectors that cannot represent semantic relationships between words.

Sparsity and the Curse of Dimensionality

A one-hot vector represents a word as a binary vector of size V (vocabulary size), where a single index contains a 1 and all other positions contain 0.

High Dimensionality

If a vocabulary has 50,000 words, each word is represented by a 50,000-dimensional vector containing 49,999 zeros. This sparsity wastes memory and introduces computational bottlenecks in the network's weight matrices.

Memory Scaling Limits

As the vocabulary size increases, the memory required to store one-hot arrays grows linearly. Multiplying these vectors by weight matrices is equivalent to a lookup, but the representation itself prevents gradient-based semantic grouping.

Lack of Semantic Similarity

Because one-hot vectors are orthogonal, they contain no information about how words relate to each other.

Orthogonality

The dot product of any two distinct one-hot vectors is always 0. Consequently, the cosine similarity between 'cat' and 'kitten' is exactly the same as the similarity between 'cat' and 'refrigerator' (both are 0). This prevents models from sharing parameters between semantically related words.

Zero Similarity in PyTorch

This code shows how distinct one-hot vectors yield zero similarity under standard distance metrics.

<pre><code class="language-python">import torch # One-hot representations for 'cat', 'dog', and 'car' v_cat = torch.tensor([1, 0, 0], dtype=torch.float32) v_dog = torch.tensor([0, 1, 0], dtype=torch.float32) # Dot product is 0, indicating orthogonal alignment similarity = torch.dot(v_cat, v_dog) print(similarity.item()) # 0.0</pre>