Information Theory: What is a "Bit"?
Information theory was founded in 1948 by Claude Shannon, who sought to mathematically define the limits of communication and data compression. At the core of this theory is the formal definition of information as the reduction of uncertainty. The fundamental unit of information is the 'bit' (binary digit). Understanding how information is quantified, coded, and compressed is essential for understanding loss functions (like Cross-Entropy), autoencoders, generative models, and how neural networks compress representations in their latent layers.
We define a bit not as a storage slot in hardware, but as the amount of information required to resolve a choice between two equally likely alternatives. By analyzing data sources as random processes emitting symbols, we can define the absolute limits of lossless compression. We will explore self-information, Shannon's Source Coding Theorem, expected information, and information bottlenecks in deep learning.
Information and Uncertainty
Information theory defines information not by its meaning, but by the reduction of uncertainty. The less probable an event is, the more information it carries when it occurs.
Surprise is the core currency of information: receiving a message that contains highly predictable contents yields very little news, while receiving a rare signal yields high information.
Self-Information ($I(x)$)
The self-information (or Shannon information) of an event $x$ with probability $P(x)$ is defined as the negative logarithm of its probability:
$$I(x) = -\log_2 P(x) = \log_2 \left( \frac{1}{P(x)} \right)$$
If an event is guaranteed ($P(x) = 1$), its self-information is $0$. An event with a low probability carries high self-information because its occurrence is highly surprising. For instance, rolling an 8-sided die yields a probability $P(x) = 1/8$, so the self-information gained is $-\log_2(1/8) = 3$ bits.
Proof of the Additive Property and why the Log Scale is Used
The logarithmic scale is chosen to satisfy the property of additivity for independent events. If two events $A$ and $B$ are independent, their joint probability is $P(A \cap B) = P(A)P(B)$. The information gained from observing both events must equal the sum of their individual information values. Let us prove that the log scale is the unique continuous scaling that satisfies this additive property. We require a function $I(p)$ such that:
$$I(p_A \cdot p_B) = I(p_A) + I(p_B)$$
Let $f(x) = I(e^x)$. Then the multiplicative relation becomes additive in the exponents: $f(x + y) = f(x) + f(y)$. The only continuous solution to Cauchy's functional equation $f(x+y) = f(x)+f(y)$ is a linear function: $f(x) = cx$. Substituting back:
$$I(p) = c \ln(p)$$
To make information positive (since $p \le 1 \implies \ln(p) \le 0$), we choose a negative constant $c = -1/\ln(b)$, yielding:
$$I(p) = -\log_b(p)$$
Using base 2 logarithms measures information in bits (binary digits), while natural logarithms (base $e$) measure information in nats. We convert between them using: $1 \text{ nat} = \frac{1}{\ln(2)} \approx 1.4427 \text{ bits}$.
Defining the "Bit" and Coding Theory
A bit is the fundamental unit of information, representing the quantity of information gained from a choice between two equally likely outcomes.
Coding theory explores how to map source symbols into binary strings of minimal length, bounded by the statistical properties of the source.
Binary Decisions and Bits
When flipping a fair coin ($P(\text{heads}) = 0.5$), the self-information is $-\log_2(0.5) = 1$ bit. Thus, $1$ bit represents the capacity required to store or transmit a single binary decision. If we have $K$ equally likely outcomes, the information gained is $log_2 K$ bits, representing the minimum number of binary questions needed to identify the outcome. It serves as the physical benchmark for representing categorical choices.
Kraft-McMillan Inequality and Shannon's Source Coding Theorem
Coding theory assigns binary codewords to symbols. To ensure that a stream of code can be uniquely decoded without delimiter markers (known as a prefix-free code), the lengths of the codewords $l_1, \dots, l_M$ must satisfy the Kraft-McMillan Inequality:
$$\sum_{i=1}^{M} 2^{-l_i} \le 1$$
Shannon's Source Coding Theorem establishes the absolute physical limit of lossless data compression. It states that for a random variable $X$ with entropy $H(X)$, the average length $L = \sum P(x_i) l_i$ of any prefix-free code used to represent $X$ satisfies:
$$L \ge H(X)$$
This means we cannot compress messages from a source below its entropy without losing information. All compression algorithms (like Huffman coding, arithmetic coding, or ZIP compression) are practical implementations designed to approach this Shannon limit.
Entropy as Expected Information
Self-information measures the surprise of a single event. To measure the average surprise of a complete system, we calculate the expected value of self-information.
Expected self-information represents the average uncertainty of a random source, showing how much information we expect to gain per emitted symbol.
Shannon Entropy ($H(X)$)
For a discrete random variable $X$ with probability distribution $P(X)$, the Shannon Entropy $H(X)$ is the expectation of the self-information:
$$H(X) = E[I(X)] = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$$
It quantifies the average uncertainty of the random variable. For a binary source where the probability of success is $p$, the binary entropy function is:
$$H_b(p) = -p \log_2 p - (1-p) \log_2 (1-p)$$
This function peaks at $p=0.5$ with $H_b(0.5) = 1$ bit, representing maximum uncertainty, and is $0$ at $p=0$ and $p=1$, representing complete certainty.
Nats vs. Bits in Deep Learning
While information theory typically uses base 2 (bits), machine learning libraries (like PyTorch and TensorFlow) use natural logarithms (base $e$), measuring entropy in nats. The natural log simplifies differentiation during gradient descent. The conversion is:
$$1 \text{ nat} = \frac{1}{\ln(2)} \approx 1.4427 \text{ bits}$$
When calculating Cross-Entropy Loss, we use the natural log: $-\sum y_i \ln \hat{y}_i$. This makes the derivatives much cleaner to calculate on GPUs during backpropagation.
Information Bottlenecks and Multi-Variable Information Theory
Neural networks can be viewed as information transmission channels that filter and compress representations.
By forcing inputs through structural constraints, networks discard irrelevant noise and preserve only features critical for predicting labels.
Joint Entropy, Conditional Entropy, and Mutual Information
To analyze information in multi-variable systems, we define:
1. Joint Entropy: The total uncertainty of two random variables $X$ and $Y$: $H(X, Y) = -\sum_{x,y} P(x, y) \log_2 P(x, y)$.
2. Conditional Entropy: The remaining uncertainty of $Y$ given we know $X$: $H(Y|X) = -\sum_{x,y} P(x, y) \log_2 P(y|x)$. Let us prove the chain rule for entropy: $H(X, Y) = H(X) + H(Y|X)$:
$$H(X, Y) = -\sum_{x,y} P(x,y) \log_2 P(x,y) = -\sum_{x,y} P(x,y) \log_2 [P(x)P(y|x)]$$
$$= -\sum_{x,y} P(x,y) \log_2 P(x) - \sum_{x,y} P(x,y) \log_2 P(y|x)$$
Since $\sum_y P(x,y) = P(x)$, the first term simplifies to $-\sum_x P(x) \log_2 P(x) = H(X)$, proving the theorem.
3. Mutual Information: The amount of information shared between $X$ and $Y$, or the uncertainty reduction of one variable given the other:
$$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y)$$
This symmetric measure quantifies any dependency (linear or non-linear) between variables, serving as the basis for non-linear feature selection in ML.
The Information Bottleneck and VAE Latent Compression
Proposed by Naftali Tishby, the Information Bottleneck posits that deep networks train by balancing two objectives: compression and prediction. For input $X$ and target $Y$, a hidden representation $T$ is learned by minimizing the mutual information $I(X; T)$ (discarding input noise) while maximizing $I(T; Y)$ (retaining information about the label):
$$\min_T \left( I(X; T) - \beta I(T; Y) \right)$$
where $\beta$ balances the trade-off. This principle is realized in Variational Autoencoders (VAEs). By forcing inputs $X$ through a low-dimensional bottleneck layer $Z$, the network compresses data. The VAE loss balances reconstruction loss and a Kullback-Leibler divergence term that regularizes the latent space distribution, forcing it to align with a simple Gaussian prior.