BELL LABS - MURRAY HILL, NEW JERSEY - 1943
The United States needed to send President Roosevelt's voice across the Atlantic. In real time. Without the Nazis hearing a single word. The problem required a new kind of mathematics, one that did not yet exist.
The question Claude Shannon had to answer first: what is information, exactly?
In 1943, Bell Labs was the most productive research institution in the world. Its scientists had already invented the transistor, the solar cell, and information coding techniques that underpinned the entire telephone network. Now the U.S. military needed something far more difficult: a voice communication system so secure that it could carry top-secret conversations between Allied commanders across an ocean patrolled by submarines.
Claude Shannon, twenty-seven years old and already one of the most original minds at Bell Labs, was assigned to the project. The result was SIGSALY, a system that digitised the human voice, encrypted each sample with a one-time key, and transmitted the result as noise indistinguishable from static. It was the first perfectly secure voice communication system ever built.
But building SIGSALY forced Shannon to confront a deeper question. If you are going to encrypt information, you need to know how much of it there is. If you are going to transmit it over a noisy channel, you need to know the minimum rate at which you can do so without loss. Neither question had a mathematical answer. Shannon spent the next five years building one.
A brief aside
In September 1943, Alan Turing visited Bell Labs on a liaison mission from Bletchley Park. He and Shannon ate lunch together in the Bell Labs cafeteria several times. They discussed machines, communication, and what it would mean for a machine to think. Neither left a detailed record of the conversations. We know only that two years later, Shannon published a classified paper proving mathematically why ciphers like Enigma could be broken, answering, in a sense, the question Turing's Bombe had been implicitly asking.
"Information is the resolution of uncertainty."
The people behind it
Claude Shannon
1916 – 2001
Mathematician · Bell Labs
In a single paper — 'A Mathematical Theory of Communication' (1948) — Shannon founded information theory, defined entropy, proved the source and channel coding theorems, and established the mathematical foundations for every digital communication system built since.
"Shannon's 1948 paper is considered one of the most important scientific papers of the twentieth century. Its concepts underpin the internet, mobile phones, data compression, and modern AI."
David Huffman
1925 – 1999
Electrical Engineer · MIT
As a graduate student in 1952, Huffman was given the choice of taking a final exam or finding an optimal prefix-free code. He found it instead. The Huffman coding algorithm reaches within 1 bit of Shannon's entropy lower bound and is still used in JPEG, MP3, and PDF compression today.
"Huffman later said he likely would not have discovered the algorithm if he had not been racing against a deadline."
Richard Hamming
1915 – 1998
Mathematician · Bell Labs
Shannon's colleague at Bell Labs. Invented error-correcting codes (Hamming codes) after becoming frustrated that the Bell Labs computer would stop and ring a bell when it hit an error on weekends — when no operator was present to restart it.
"Hamming codes are still used in computer memory (ECC RAM). The principle extends to the Reed-Solomon codes that protect data on CDs, DVDs, and space probes."
What does it mean to say a message contains information?
Shannon's insight was this: a message is informative in proportion to how surprising it is. If you already knew what the message said, it tells you nothing. If it was completely unexpected, it tells you a great deal.
More formally: the information content of an outcome x with probability is bits. A coin flip carries 1 bit. A die roll carries bits. A certain outcome carries 0 bits, no surprise, no information.
The entropy of a source is the average information content across all outcomes, the expected surprise:
Shannon Entropy
measured in bits per symbol
The logarithm is not arbitrary, it is forced by a requirement that makes perfect intuitive sense: information should be additive.
If you flip two fair coins, you learn the result of the first plus the result of the second. Two independent events should carry the sum of their individual information. For independent events, , so:
The log turns multiplication into addition. That is exactly the same insight Turing used when he invented decibans, and the same reason modern machine learning works in log-probabilities.
H = 0
complete certainty
One outcome is guaranteed. The “message” tells you nothing you didn't already know.
H is maximised
all outcomes equally likely
Maximum uncertainty = maximum information per message. For 26 letters: H = log₂(26) ≈ 4.7 bits.
H(English) ≈ 4.1
structured, not random
English has predictable patterns, Q is almost always followed by U, so its entropy is below the theoretical maximum.
Entropy Calculator
4.509
bits / letter
4.700
max (26 letters)
54
letters
95.9% of theoretical maximum
Letter distribution (by frequency)
Hover a bar: letter frequency and its contribution to entropy
H(X) = −Σ p(x) log₂ p(x) — English text averages ~4.1 bits/letter. Random has ~4.7. Repetitive text approaches 0.
Shannon's 1945 theorem, declassified in 1949, settled the question once and for all
In 1945, Shannon wrote a classified paper, A Communication Theory of Secrecy Systems, that used his new entropy framework to prove something fundamental about cryptography. It was not declassified until 1949.
Shannon's Perfect Secrecy Theorem (1945)
A cipher is perfectly secure, unbreakable regardless of computational power, if and only if the entropy of the key is at least as large as the entropy of the message:
The only cipher that satisfies this is the one-time pad: a key that is truly random, as long as the message, and used exactly once. SIGSALY was a voice one-time pad. It was perfectly secure.
How much entropy did a Naval Enigma key have? The day settings specified: rotor order (3 from 5), initial positions, plugboard pairs, and ring settings. The theoretical keyspace was about 10²³, roughly 77 bits of entropy if fully exploited.
A typical message was thirty letters of German military prose. English prose has about 4.1 bits per letter; German military prose is somewhat more predictable. Call it 3.5 bits per letter:
So far, the key entropy (77 bits) is less than the message entropy (105 bits). Shannon's theorem says: already breakable in principle.
But operators made it far worse. They reused the same settings all day. They opened weather reports with WETTER. They signed off with HEIL HITLER. They chose wheel positions like AAA. Each predictable pattern reduced the effective key entropy far below 77 bits, while the redundancy of German prose remained constant.
The theorem in one sentence
Turing's Bayesian decoder exploited the gap between the entropy Enigma operators actually used and the entropy their messages contained. Shannon proved this gap cannot be closed without a key as long as the message itself. Enigma's key was reused daily. The war was over before daily keys could be made long enough.
Two theorems that define the boundaries of all communication
On 1 July 1948, the Bell System Technical Journal published “A Mathematical Theory of Communication” by C.E. Shannon. In 55 pages, Shannon answered two questions that engineers had been asking for decades without a satisfying answer.
You cannot compress a source below its entropy.
If a source emits symbols with entropy bits per symbol, you need at least bits per symbol to represent it without loss. Any compression that goes below must, on average, throw information away.
English prose has entropy around 4.1 bits per letter. A fixed-length ASCII code uses 8 bits. Shannon's theorem says the compression floor is 4.1 bits, not 0. You can approach it; you cannot beat it.
Every noisy channel has a capacity , measured in bits per second, such that reliable communication is possible at any rate below and impossible above it.
For the most important case, a channel with bandwidth Hz and signal-to-noise ratio , the capacity is:
Shannon-Hartley Theorem
C in bits/s, B in Hz — the hard upper bound on any communication system
This formula is why your internet service provider quotes speeds in bits per second. It is the reason 5G requires wider bandwidth than 4G. It governs satellite communication, deep-space probes, and fibre-optic cables. Engineers do not argue with Shannon's bound, they design up to it.
Before Shannon (1948)
After Shannon (1948)
Shannon proved the floor exists. Huffman showed how to reach it.
Shannon's source coding theorem said the entropy bits per symbol was the compression floor. But it was an existence proof, it showed the floor was reachable without telling you how to reach it. In 1952, a graduate student named David Huffman found the algorithm.
The idea is beautifully simple. Frequent symbols should get short codes; rare symbols should get long ones. Build a binary tree bottom-up:
Start with each symbol as a leaf node, labelled with its frequency.
Repeatedly take the two lowest-frequency nodes and merge them into a new parent node whose frequency is their sum.
Assign 0 to one branch and 1 to the other.
The code for each symbol is the sequence of 0s and 1s on the path from root to leaf.
The resulting code is prefix-free: no codeword is a prefix of another. This means you can decode a bitstream unambiguously without knowing where one codeword ends and the next begins, just follow the tree.
The expected code length of a Huffman code is at most bits per symbol, within 1 bit of the theoretical floor.
Huffman Codes — Top 8 English Letters
Tree structure
2.96
bits entropy
3.00
avg code length
1.3%
above entropy floor
Frequent letters (E, T, A) get short codes. Rare ones (J, X, Q) get long codes. Shannon proved you cannot do better than the entropy floor on average. Huffman codes get within 1 bit of it.
Where Huffman codes live today
Huffman coding is a core component of DEFLATE (used in ZIP, gzip, PNG), JPEG, and MP3 compression. It is in every PDF you have ever opened. The browser decompressing this page very likely used it.
Three implementations: entropy, compression, and a noisy channel
The formula in seven lines. Compare English, random, and repetitive text.
import math
from collections import Counter
def entropy(text: str) -> float:
"""Shannon entropy in bits per symbol."""
text = text.upper()
text = [c for c in text if c.isalpha()]
if not text:
return 0.0
n = len(text)
counts = Counter(text)
return -sum((c / n) * math.log2(c / n) for c in counts.values())
# English prose
print(f"English: {entropy('the quick brown fox jumps over the lazy dog'):.3f} bits") # ~4.1
# Random-looking
print(f"Random: {entropy('xkqzjvbwpyfuhmolcnrsdtigea'):.3f} bits") # ~4.7
# Highly repetitive
print(f"AAAA...: {entropy('A' * 50):.3f} bits") # 0.0 Build the tree, assign the codes, verify that average code length matches the entropy floor within 1 bit.
import heapq
from dataclasses import dataclass, field
from typing import Optional
@dataclass(order=True)
class HuffNode:
freq: float
symbol: Optional[str] = field(compare=False, default=None)
left: Optional["HuffNode"] = field(compare=False, default=None)
right: Optional["HuffNode"] = field(compare=False, default=None)
def build_tree(freqs: dict[str, float]) -> HuffNode:
heap = [HuffNode(freq=f, symbol=s) for s, f in freqs.items()]
heapq.heapify(heap)
while len(heap) > 1:
lo, hi = heapq.heappop(heap), heapq.heappop(heap)
heapq.heappush(heap, HuffNode(freq=lo.freq + hi.freq, left=lo, right=hi))
return heap[0]
def build_codes(node: HuffNode, prefix="", codes=None) -> dict[str, str]:
if codes is None:
codes = {}
if node.symbol is not None:
codes[node.symbol] = prefix or "0"
else:
build_codes(node.left, prefix + "0", codes)
build_codes(node.right, prefix + "1", codes)
return codes
# English letter frequencies (%)
ENG = {"E":12.7,"T":9.1,"A":8.2,"O":7.5,"I":7.0,"N":6.7,"S":6.3,"H":6.1,
"R":6.0,"D":4.3,"L":4.0,"C":2.8,"U":2.8,"M":2.4,"W":2.4,"F":2.2}
tree = build_tree(ENG)
codes = build_codes(tree)
# Average code length
total = sum(ENG.values())
avg_len = sum((f / total) * len(codes[s]) for s, f in ENG.items())
print(f"Avg code length: {avg_len:.3f} bits/letter") # -> ~3.94
print(f"Entropy floor: {entropy_from_freqs(ENG):.3f} bits/letter") # -> ~3.93 Simulate a Binary Symmetric Channel. Measure the error rate with and without a simple repetition code. Verify against Shannon's capacity formula.
import random
def bsc_channel(bits: list[int], error_rate: float) -> list[int]:
"""Binary Symmetric Channel: flip each bit with probability error_rate."""
return [b ^ (1 if random.random() < error_rate else 0) for b in bits]
def encode_repetition(bits: list[int], n: int = 3) -> list[int]:
"""Repetition code: send each bit n times."""
return [b for b in bits for _ in range(n)]
def decode_repetition(received: list[int], n: int = 3) -> list[int]:
"""Majority vote over each group of n bits."""
decoded = []
for i in range(0, len(received), n):
block = received[i:i+n]
decoded.append(1 if sum(block) > n // 2 else 0)
return decoded
# Simulate: 100 bits at 10% error rate
message = [random.randint(0, 1) for _ in range(100)]
# Raw channel
received_raw = bsc_channel(message, error_rate=0.10)
raw_errors = sum(a != b for a, b in zip(message, received_raw))
# With rate-1/3 repetition code
encoded = encode_repetition(message, n=3)
received_coded = bsc_channel(encoded, error_rate=0.10)
decoded = decode_repetition(received_coded, n=3)
coded_errors = sum(a != b for a, b in zip(message, decoded))
print(f"Without coding: {raw_errors}/100 bit errors")
print(f"With repetition code: {coded_errors}/100 bit errors")
# Shannon's theorem: capacity of BSC with p=0.10
p = 0.10
capacity = 1 + p * math.log2(p) + (1-p) * math.log2(1-p)
print(f"BSC capacity at p=0.10: {capacity:.4f} bits per channel use") # -> 0.5310 Notebook 01
Entropy & Information
Derive H from first principles. Compute it on real text. Verify the additivity property.
Notebook 02
Source Coding
Build Huffman codes from scratch. Verify the entropy bound.
Notebook 03
Channel Capacity
Simulate noisy channels. Verify Shannon's capacity formula experimentally.
# Get started
$ git clone https://github.com/vivekatsuperset/lutchet
$ uv sync && uv run jupyter lab
Shannon's 1948 concepts are the mathematical core of every large language model
When you use ChatGPT, Claude, or Gemini, you are interacting with a system trained by minimising a loss function called cross-entropy. This is Shannon's entropy, applied to the difference between what the model predicts and what actually appeared.
Let be the true distribution of the next token and be the model's prediction. The cross-entropy is:
Minimised when — i.e., when the model's predictions match reality
In practice, for a language model predicting the next token from a training corpus, is 1 for the true next token and 0 for all others, so the sum collapses to . Training minimises the surprise the model expresses at the actual next word.
Perplexity is cross-entropy exponentiated:
Intuitively, perplexity is the “effective branching factor”, how many equally likely choices the model is choosing between at each token position. A model with PPL = 10 is as uncertain at each step as if it were choosing uniformly among 10 possibilities. State-of-the-art LLMs have perplexity below 10 on standard benchmarks.
The Kullback-Leibler divergence measures the extra bits needed when you encode a source distributed as using a code optimised for :
KL divergence is the loss in RLHF (reinforcement learning from human feedback), the technique used to align language models with human preferences. The penalty term keeps the aligned model from drifting too far from the base model's distribution.
| Shannon (1948) | Modern ML (2024) |
|---|---|
| Entropy H(X) | Training objective for every language model |
| Cross-entropy H(p, q) | Loss function: measures prediction quality |
| KL divergence D_KL(p ∥ q) | RLHF alignment penalty; VAE training objective |
| Channel capacity C | Theoretical limit on how well any model can learn a distribution |
| Source coding theorem | Compression in gzip, JPEG, MP3, every file format |
Notebook 03: Channel Capacity & Modern ML
Implements cross-entropy loss from scratch, plots KL divergence as a training curve, and shows that GPT-style token prediction is a direct descendant of Shannon's noisy channel model.
notebooks/03_channel_and_ml.ipynb Shannon published his paper in 1948. He spent the following decade riding a unicycle through the Bell Labs hallways, juggling, and mostly ignoring the field he had founded.
"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point."
The loss function that trains every language model, the compression that shrinks every file, the error correction that protects every data transmission all of it traces to a wartime question about sending a president's voice across an ocean without being heard.