Information Theory

Section 2

Entropy: Measuring Surprise

What does it mean to say a message contains information?

Shannon's insight was this: a message is informative in proportion to how surprising it is. If you already knew what the message said, it tells you nothing. If it was completely unexpected, it tells you a great deal.

More formally: the information content of an outcome x with probability $p (x)$ is $- lo g_{2} p (x)$ bits. A coin flip carries 1 bit. A die roll carries $lo g_{2} 6 \approx 2.58$ bits. A certain outcome carries 0 bits, no surprise, no information.

The entropy of a source $X$ is the average information content across all outcomes, the expected surprise:

Shannon Entropy

H (X) = - x \sum p (x) lo g_{2} p (x)

measured in bits per symbol

Why log?

The logarithm is not arbitrary, it is forced by a requirement that makes perfect intuitive sense: information should be additive.

If you flip two fair coins, you learn the result of the first plus the result of the second. Two independent events should carry the sum of their individual information. For independent events, $p (A \cap B) = p (A) \cdot p (B)$ , so:

- lo g_{2} p (A \cap B) = - lo g_{2} p (A) - lo g_{2} p (B)

The log turns multiplication into addition. That is exactly the same insight Turing used when he invented decibans, and the same reason modern machine learning works in log-probabilities.

H = 0

complete certainty

One outcome is guaranteed. The “message” tells you nothing you didn't already know.

H is maximised

all outcomes equally likely

Maximum uncertainty = maximum information per message. For 26 letters: H = log₂(26) ≈ 4.7 bits.

H(English) ≈ 4.1

structured, not random

English has predictable patterns, Q is almost always followed by U, so its entropy is below the theoretical maximum.

Entropy Calculator

4.509

bits / letter

4.700

max (26 letters)

letters

0 bitsH = 4.51 bitslog₂(26) = 4.70 bits

95.9% of theoretical maximum

Letter distribution (by frequency)

Hover a bar: letter frequency and its contribution to entropy

H(X) = −Σ p(x) log₂ p(x) — English text averages ~4.1 bits/letter. Random has ~4.7. Repetitive text approaches 0.

Section 2b

Perfect Secrecy: Why Enigma Was Breakable

Shannon's 1945 theorem, declassified in 1949, settled the question once and for all

In 1945, Shannon wrote a classified paper, A Communication Theory of Secrecy Systems, that used his new entropy framework to prove something fundamental about cryptography. It was not declassified until 1949.

Shannon's Perfect Secrecy Theorem (1945)

A cipher is perfectly secure, unbreakable regardless of computational power, if and only if the entropy of the key is at least as large as the entropy of the message:

H (key) \geq H (message)

The only cipher that satisfies this is the one-time pad: a key that is truly random, as long as the message, and used exactly once. SIGSALY was a voice one-time pad. It was perfectly secure.

Why Enigma Fell Short

How much entropy did a Naval Enigma key have? The day settings specified: rotor order (3 from 5), initial positions, plugboard pairs, and ring settings. The theoretical keyspace was about 10²³, roughly 77 bits of entropy if fully exploited.

A typical message was thirty letters of German military prose. English prose has about 4.1 bits per letter; German military prose is somewhat more predictable. Call it 3.5 bits per letter:

H (30-letter message) \approx 30 \times 3.5 = 105 bits

So far, the key entropy (77 bits) is less than the message entropy (105 bits). Shannon's theorem says: already breakable in principle.

But operators made it far worse. They reused the same settings all day. They opened weather reports with WETTER. They signed off with HEIL HITLER. They chose wheel positions like AAA. Each predictable pattern reduced the effective key entropy far below 77 bits, while the redundancy of German prose remained constant.

The theorem in one sentence

Turing's Bayesian decoder exploited the gap between the entropy Enigma operators actually used and the entropy their messages contained. Shannon proved this gap cannot be closed without a key as long as the message itself. Enigma's key was reused daily. The war was over before daily keys could be made long enough.

Section 3

The 1948 Paper

Two theorems that define the boundaries of all communication

On 1 July 1948, the Bell System Technical Journal published “A Mathematical Theory of Communication” by C.E. Shannon. In 55 pages, Shannon answered two questions that engineers had been asking for decades without a satisfying answer.

Theorem 1: Source Coding

You cannot compress a source below its entropy.

If a source emits symbols with entropy $H$ bits per symbol, you need at least $H$ bits per symbol to represent it without loss. Any compression that goes below $H$ must, on average, throw information away.

English prose has entropy around 4.1 bits per letter. A fixed-length ASCII code uses 8 bits. Shannon's theorem says the compression floor is 4.1 bits, not 0. You can approach it; you cannot beat it.

Theorem 2: Channel Capacity

Every noisy channel has a capacity $C$ , measured in bits per second, such that reliable communication is possible at any rate below $C$ and impossible above it.

For the most important case, a channel with bandwidth $B$ Hz and signal-to-noise ratio $S / N$ , the capacity is:

Shannon-Hartley Theorem

C = B lo g_{2} (1 + \frac{S}{N})

C in bits/s, B in Hz — the hard upper bound on any communication system

This formula is why your internet service provider quotes speeds in bits per second. It is the reason 5G requires wider bandwidth than 4G. It governs satellite communication, deep-space probes, and fibre-optic cables. Engineers do not argue with Shannon's bound, they design up to it.

Before Shannon (1948)

• Engineers thought adding error correction always cost capacity
• There was no known limit on how much noise could be removed
• Communication was an engineering art, not a mathematical science

After Shannon (1948)

• There is a hard capacity limit; no engineering can exceed it
• Below capacity, arbitrarily reliable transmission is provably possible
• Information theory became a mathematical discipline

Section 4

Compression: Huffman Codes

Shannon proved the floor exists. Huffman showed how to reach it.

Shannon's source coding theorem said the entropy $H$ bits per symbol was the compression floor. But it was an existence proof, it showed the floor was reachable without telling you how to reach it. In 1952, a graduate student named David Huffman found the algorithm.

The Algorithm

The idea is beautifully simple. Frequent symbols should get short codes; rare symbols should get long ones. Build a binary tree bottom-up:

01
Start with each symbol as a leaf node, labelled with its frequency.
02
Repeatedly take the two lowest-frequency nodes and merge them into a new parent node whose frequency is their sum.
03
Assign 0 to one branch and 1 to the other.
04
The code for each symbol is the sequence of 0s and 1s on the path from root to leaf.

The resulting code is prefix-free: no codeword is a prefix of another. This means you can decode a bitstream unambiguously without knowing where one codeword ends and the next begins, just follow the tree.

The expected code length of a Huffman code is at most $H + 1$ bits per symbol, within 1 bit of the theoretical floor.

Huffman Codes — Top 8 English Letters

Tree structure

■ 63.6%

■ 26.4%

E (12.7%)

■ 13.7%

N (6.7%)

I (7.0%)

■ 37.2%

■ 15.7%

O (7.5%)

A (8.2%)

■ 21.5%

T (9.1%)

■ 12.4%

0 …

1 …

2.96

bits entropy

3.00

avg code length

1.3%

above entropy floor

Frequent letters (E, T, A) get short codes. Rare ones (J, X, Q) get long codes. Shannon proved you cannot do better than the entropy floor on average. Huffman codes get within 1 bit of it.

Where Huffman codes live today

Huffman coding is a core component of DEFLATE (used in ZIP, gzip, PNG), JPEG, and MP3 compression. It is in every PDF you have ever opened. The browser decompressing this page very likely used it.

Section 5

Python: Build It

Three implementations: entropy, compression, and a noisy channel

Part A: Shannon Entropy

The formula in seven lines. Compare English, random, and repetitive text.

entropy.py Notebook 01

import math
from collections import Counter

def entropy(text: str) -> float:
    """Shannon entropy in bits per symbol."""
    text = text.upper()
    text = [c for c in text if c.isalpha()]
    if not text:
        return 0.0
    n = len(text)
    counts = Counter(text)
    return -sum((c / n) * math.log2(c / n) for c in counts.values())

# English prose
print(f"English:  {entropy('the quick brown fox jumps over the lazy dog'):.3f} bits")  # ~4.1

# Random-looking
print(f"Random:   {entropy('xkqzjvbwpyfuhmolcnrsdtigea'):.3f} bits")                   # ~4.7

# Highly repetitive
print(f"AAAA...:  {entropy('A' * 50):.3f} bits")                                        # 0.0

Part B: Huffman Encoder

Build the tree, assign the codes, verify that average code length matches the entropy floor within 1 bit.

huffman.py Notebook 02

import heapq
from dataclasses import dataclass, field
from typing import Optional

@dataclass(order=True)
class HuffNode:
    freq: float
    symbol: Optional[str] = field(compare=False, default=None)
    left:  Optional["HuffNode"] = field(compare=False, default=None)
    right: Optional["HuffNode"] = field(compare=False, default=None)

def build_tree(freqs: dict[str, float]) -> HuffNode:
    heap = [HuffNode(freq=f, symbol=s) for s, f in freqs.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        lo, hi = heapq.heappop(heap), heapq.heappop(heap)
        heapq.heappush(heap, HuffNode(freq=lo.freq + hi.freq, left=lo, right=hi))
    return heap[0]

def build_codes(node: HuffNode, prefix="", codes=None) -> dict[str, str]:
    if codes is None:
        codes = {}
    if node.symbol is not None:
        codes[node.symbol] = prefix or "0"
    else:
        build_codes(node.left,  prefix + "0", codes)
        build_codes(node.right, prefix + "1", codes)
    return codes

# English letter frequencies (%)
ENG = {"E":12.7,"T":9.1,"A":8.2,"O":7.5,"I":7.0,"N":6.7,"S":6.3,"H":6.1,
       "R":6.0,"D":4.3,"L":4.0,"C":2.8,"U":2.8,"M":2.4,"W":2.4,"F":2.2}

tree  = build_tree(ENG)
codes = build_codes(tree)

# Average code length
total = sum(ENG.values())
avg_len = sum((f / total) * len(codes[s]) for s, f in ENG.items())
print(f"Avg code length: {avg_len:.3f} bits/letter")   # -> ~3.94
print(f"Entropy floor:   {entropy_from_freqs(ENG):.3f} bits/letter")   # -> ~3.93

Part C: Noisy Channel

Simulate a Binary Symmetric Channel. Measure the error rate with and without a simple repetition code. Verify against Shannon's capacity formula.

channel.py Notebook 03

import random

def bsc_channel(bits: list[int], error_rate: float) -> list[int]:
    """Binary Symmetric Channel: flip each bit with probability error_rate."""
    return [b ^ (1 if random.random() < error_rate else 0) for b in bits]

def encode_repetition(bits: list[int], n: int = 3) -> list[int]:
    """Repetition code: send each bit n times."""
    return [b for b in bits for _ in range(n)]

def decode_repetition(received: list[int], n: int = 3) -> list[int]:
    """Majority vote over each group of n bits."""
    decoded = []
    for i in range(0, len(received), n):
        block = received[i:i+n]
        decoded.append(1 if sum(block) > n // 2 else 0)
    return decoded

# Simulate: 100 bits at 10% error rate
message = [random.randint(0, 1) for _ in range(100)]

# Raw channel
received_raw   = bsc_channel(message, error_rate=0.10)
raw_errors     = sum(a != b for a, b in zip(message, received_raw))

# With rate-1/3 repetition code
encoded        = encode_repetition(message, n=3)
received_coded = bsc_channel(encoded, error_rate=0.10)
decoded        = decode_repetition(received_coded, n=3)
coded_errors   = sum(a != b for a, b in zip(message, decoded))

print(f"Without coding: {raw_errors}/100 bit errors")
print(f"With repetition code: {coded_errors}/100 bit errors")

# Shannon's theorem: capacity of BSC with p=0.10
p = 0.10
capacity = 1 + p * math.log2(p) + (1-p) * math.log2(1-p)
print(f"BSC capacity at p=0.10: {capacity:.4f} bits per channel use")  # -> 0.5310

Notebook 01

Entropy & Information

Derive H from first principles. Compute it on real text. Verify the additivity property.

Notebook 02

Source Coding

Build Huffman codes from scratch. Verify the entropy bound.

Notebook 03

Channel Capacity

Simulate noisy channels. Verify Shannon's capacity formula experimentally.

# Get started

$ git clone https://github.com/vivekatsuperset/lutchet

$ uv sync && uv run jupyter lab

Section 6

Information Theory in Modern AI

Shannon's 1948 concepts are the mathematical core of every large language model

When you use ChatGPT, Claude, or Gemini, you are interacting with a system trained by minimising a loss function called cross-entropy. This is Shannon's entropy, applied to the difference between what the model predicts and what actually appeared.

Cross-Entropy Loss

Let $p$ be the true distribution of the next token and $q$ be the model's prediction. The cross-entropy is:

H (p, q) = - x \sum p (x) lo g_{2} q (x)

Minimised when $q = p$ — i.e., when the model's predictions match reality

In practice, for a language model predicting the next token from a training corpus, $p$ is 1 for the true next token and 0 for all others, so the sum collapses to $- lo g_{2} q (true token)$ . Training minimises the surprise the model expresses at the actual next word.

Perplexity

Perplexity is cross-entropy exponentiated:

PPL = 2^{H (p, q)}

Intuitively, perplexity is the “effective branching factor”, how many equally likely choices the model is choosing between at each token position. A model with PPL = 10 is as uncertain at each step as if it were choosing uniformly among 10 possibilities. State-of-the-art LLMs have perplexity below 10 on standard benchmarks.

KL Divergence

The Kullback-Leibler divergence measures the extra bits needed when you encode a source distributed as $p$ using a code optimised for $q$ :

D_{KL} (p ∥ q) = x \sum p (x) lo g_{2} \frac{p ( x )}{q ( x )} = H (p, q) - H (p)

KL divergence is the loss in RLHF (reinforcement learning from human feedback), the technique used to align language models with human preferences. The penalty term keeps the aligned model from drifting too far from the base model's distribution.

Shannon (1948)	Modern ML (2024)
Entropy H(X)	Training objective for every language model
Cross-entropy H(p, q)	Loss function: measures prediction quality
KL divergence D_KL(p ∥ q)	RLHF alignment penalty; VAE training objective
Channel capacity C	Theoretical limit on how well any model can learn a distribution
Source coding theorem	Compression in gzip, JPEG, MP3, every file format

📓

Notebook 05: Information Theory

Implements cross-entropy loss from scratch, plots KL divergence as a training curve, and shows that GPT-style token prediction is a direct descendant of Shannon's noisy channel model.

notebooks/05_information_theory.ipynb →