SOURCE CHANNEL DECODED

BELL LABS - MURRAY HILL, NEW JERSEY - 1943

The United States needed to send President Roosevelt's voice across the Atlantic. In real time. Without the Nazis hearing a single word. The problem required a new kind of mathematics, one that did not yet exist.

The question Claude Shannon had to answer first: what is information, exactly?

A lesson in information theory and code

In 1943, Bell Labs was the most productive research institution in the world. Its scientists had already invented the transistor, the solar cell, and information coding techniques that underpinned the entire telephone network. Now the U.S. military needed something far more difficult: a voice communication system so secure that it could carry top-secret conversations between Allied commanders across an ocean patrolled by submarines.

Claude Shannon, twenty-seven years old and already one of the most original minds at Bell Labs, was assigned to the project. The result was SIGSALY, a system that digitised the human voice, encrypted each sample with a one-time key, and transmitted the result as noise indistinguishable from static. It was the first perfectly secure voice communication system ever built.

But building SIGSALY forced Shannon to confront a deeper question. If you are going to encrypt information, you need to know how much of it there is. If you are going to transmit it over a noisy channel, you need to know the minimum rate at which you can do so without loss. Neither question had a mathematical answer. Shannon spent the next five years building one.

A brief aside

In September 1943, Alan Turing visited Bell Labs on a liaison mission from Bletchley Park. He and Shannon ate lunch together in the Bell Labs cafeteria several times. They discussed machines, communication, and what it would mean for a machine to think. Neither left a detailed record of the conversations. We know only that two years later, Shannon published a classified paper proving mathematically why ciphers like Enigma could be broken, answering, in a sense, the question Turing's Bombe had been implicitly asking.

"Information is the resolution of uncertainty."

The people behind it

Three figures you should know

Claude Shannon

1916 – 2001

Mathematician · Bell Labs

In a single paper — 'A Mathematical Theory of Communication' (1948) — Shannon founded information theory, defined entropy, proved the source and channel coding theorems, and established the mathematical foundations for every digital communication system built since.

"Shannon's 1948 paper is considered one of the most important scientific papers of the twentieth century. Its concepts underpin the internet, mobile phones, data compression, and modern AI."

David Huffman

1925 – 1999

Electrical Engineer · MIT

As a graduate student in 1952, Huffman was given the choice of taking a final exam or finding an optimal prefix-free code. He found it instead. The Huffman coding algorithm reaches within 1 bit of Shannon's entropy lower bound and is still used in JPEG, MP3, and PDF compression today.

"Huffman later said he likely would not have discovered the algorithm if he had not been racing against a deadline."

Richard Hamming

1915 – 1998

Mathematician · Bell Labs

Shannon's colleague at Bell Labs. Invented error-correcting codes (Hamming codes) after becoming frustrated that the Bell Labs computer would stop and ring a bell when it hit an error on weekends — when no operator was present to restart it.

"Hamming codes are still used in computer memory (ECC RAM). The principle extends to the Reed-Solomon codes that protect data on CDs, DVDs, and space probes."

Section 2

Entropy: Measuring Surprise

What does it mean to say a message contains information?

Shannon's insight was this: a message is informative in proportion to how surprising it is. If you already knew what the message said, it tells you nothing. If it was completely unexpected, it tells you a great deal.

More formally: the information content of an outcome x with probability is bits. A coin flip carries 1 bit. A die roll carries bits. A certain outcome carries 0 bits, no surprise, no information.

The entropy of a source is the average information content across all outcomes, the expected surprise:

Shannon Entropy

measured in bits per symbol

Why log?

The logarithm is not arbitrary, it is forced by a requirement that makes perfect intuitive sense: information should be additive.

If you flip two fair coins, you learn the result of the first plus the result of the second. Two independent events should carry the sum of their individual information. For independent events, , so:

The log turns multiplication into addition. That is exactly the same insight Turing used when he invented decibans, and the same reason modern machine learning works in log-probabilities.

H = 0

complete certainty

One outcome is guaranteed. The “message” tells you nothing you didn't already know.

H is maximised

all outcomes equally likely

Maximum uncertainty = maximum information per message. For 26 letters: H = log₂(26) ≈ 4.7 bits.

H(English) ≈ 4.1

structured, not random

English has predictable patterns, Q is almost always followed by U, so its entropy is below the theoretical maximum.

Entropy Calculator

4.509

bits / letter

4.700

max (26 letters)

54

letters

0 bitsH = 4.51 bitslog₂(26) = 4.70 bits

95.9% of theoretical maximum

Letter distribution (by frequency)

O
E
T
H
U
R
N
Q
I
C
K
B
W
F
X
A
D
J
M
P
S
V
L
Z
Y
G

Hover a bar: letter frequency and its contribution to entropy

H(X) = −Σ p(x) log₂ p(x) —  English text averages ~4.1 bits/letter. Random has ~4.7. Repetitive text approaches 0.

Section 2b

Perfect Secrecy: Why Enigma Was Breakable

Shannon's 1945 theorem, declassified in 1949, settled the question once and for all

In 1945, Shannon wrote a classified paper, A Communication Theory of Secrecy Systems, that used his new entropy framework to prove something fundamental about cryptography. It was not declassified until 1949.

Shannon's Perfect Secrecy Theorem (1945)

A cipher is perfectly secure, unbreakable regardless of computational power, if and only if the entropy of the key is at least as large as the entropy of the message:

The only cipher that satisfies this is the one-time pad: a key that is truly random, as long as the message, and used exactly once. SIGSALY was a voice one-time pad. It was perfectly secure.

Why Enigma Fell Short

How much entropy did a Naval Enigma key have? The day settings specified: rotor order (3 from 5), initial positions, plugboard pairs, and ring settings. The theoretical keyspace was about 10²³, roughly 77 bits of entropy if fully exploited.

A typical message was thirty letters of German military prose. English prose has about 4.1 bits per letter; German military prose is somewhat more predictable. Call it 3.5 bits per letter:

So far, the key entropy (77 bits) is less than the message entropy (105 bits). Shannon's theorem says: already breakable in principle.

But operators made it far worse. They reused the same settings all day. They opened weather reports with WETTER. They signed off with HEIL HITLER. They chose wheel positions like AAA. Each predictable pattern reduced the effective key entropy far below 77 bits, while the redundancy of German prose remained constant.

The theorem in one sentence

Turing's Bayesian decoder exploited the gap between the entropy Enigma operators actually used and the entropy their messages contained. Shannon proved this gap cannot be closed without a key as long as the message itself. Enigma's key was reused daily. The war was over before daily keys could be made long enough.

Section 3

The 1948 Paper

Two theorems that define the boundaries of all communication

On 1 July 1948, the Bell System Technical Journal published “A Mathematical Theory of Communication” by C.E. Shannon. In 55 pages, Shannon answered two questions that engineers had been asking for decades without a satisfying answer.

Theorem 1: Source Coding

You cannot compress a source below its entropy.

If a source emits symbols with entropy bits per symbol, you need at least bits per symbol to represent it without loss. Any compression that goes below must, on average, throw information away.

English prose has entropy around 4.1 bits per letter. A fixed-length ASCII code uses 8 bits. Shannon's theorem says the compression floor is 4.1 bits, not 0. You can approach it; you cannot beat it.

Theorem 2: Channel Capacity

Every noisy channel has a capacity , measured in bits per second, such that reliable communication is possible at any rate below and impossible above it.

For the most important case, a channel with bandwidth Hz and signal-to-noise ratio , the capacity is:

Shannon-Hartley Theorem

C in bits/s, B in Hz — the hard upper bound on any communication system

This formula is why your internet service provider quotes speeds in bits per second. It is the reason 5G requires wider bandwidth than 4G. It governs satellite communication, deep-space probes, and fibre-optic cables. Engineers do not argue with Shannon's bound, they design up to it.

Before Shannon (1948)

  • • Engineers thought adding error correction always cost capacity
  • • There was no known limit on how much noise could be removed
  • • Communication was an engineering art, not a mathematical science

After Shannon (1948)

  • • There is a hard capacity limit; no engineering can exceed it
  • • Below capacity, arbitrarily reliable transmission is provably possible
  • • Information theory became a mathematical discipline
Section 4

Compression: Huffman Codes

Shannon proved the floor exists. Huffman showed how to reach it.

Shannon's source coding theorem said the entropy bits per symbol was the compression floor. But it was an existence proof, it showed the floor was reachable without telling you how to reach it. In 1952, a graduate student named David Huffman found the algorithm.

The Algorithm

The idea is beautifully simple. Frequent symbols should get short codes; rare symbols should get long ones. Build a binary tree bottom-up:

  1. 01

    Start with each symbol as a leaf node, labelled with its frequency.

  2. 02

    Repeatedly take the two lowest-frequency nodes and merge them into a new parent node whose frequency is their sum.

  3. 03

    Assign 0 to one branch and 1 to the other.

  4. 04

    The code for each symbol is the sequence of 0s and 1s on the path from root to leaf.

The resulting code is prefix-free: no codeword is a prefix of another. This means you can decode a bitstream unambiguously without knowing where one codeword ends and the next begins, just follow the tree.

The expected code length of a Huffman code is at most bits per symbol, within 1 bit of the theoretical floor.

Huffman Codes — Top 8 English Letters

Tree structure

63.6%
0
26.4%
0
E (12.7%)
1
13.7%
0
N (6.7%)
1
I (7.0%)
1
37.2%
0
15.7%
0
O (7.5%)
1
A (8.2%)
1
21.5%
0
T (9.1%)
1
12.4%
0
1

2.96

bits entropy

3.00

avg code length

1.3%

above entropy floor

Frequent letters (E, T, A) get short codes. Rare ones (J, X, Q) get long codes. Shannon proved you cannot do better than the entropy floor on average. Huffman codes get within 1 bit of it.

Where Huffman codes live today

Huffman coding is a core component of DEFLATE (used in ZIP, gzip, PNG), JPEG, and MP3 compression. It is in every PDF you have ever opened. The browser decompressing this page very likely used it.

Section 5

Python: Build It

Three implementations: entropy, compression, and a noisy channel

Part A: Shannon Entropy

The formula in seven lines. Compare English, random, and repetitive text.

entropy.py Notebook 01
import math
from collections import Counter

def entropy(text: str) -> float:
    """Shannon entropy in bits per symbol."""
    text = text.upper()
    text = [c for c in text if c.isalpha()]
    if not text:
        return 0.0
    n = len(text)
    counts = Counter(text)
    return -sum((c / n) * math.log2(c / n) for c in counts.values())

# English prose
print(f"English:  {entropy('the quick brown fox jumps over the lazy dog'):.3f} bits")  # ~4.1

# Random-looking
print(f"Random:   {entropy('xkqzjvbwpyfuhmolcnrsdtigea'):.3f} bits")                   # ~4.7

# Highly repetitive
print(f"AAAA...:  {entropy('A' * 50):.3f} bits")                                        # 0.0

Part B: Huffman Encoder

Build the tree, assign the codes, verify that average code length matches the entropy floor within 1 bit.

huffman.py Notebook 02
import heapq
from dataclasses import dataclass, field
from typing import Optional

@dataclass(order=True)
class HuffNode:
    freq: float
    symbol: Optional[str] = field(compare=False, default=None)
    left:  Optional["HuffNode"] = field(compare=False, default=None)
    right: Optional["HuffNode"] = field(compare=False, default=None)

def build_tree(freqs: dict[str, float]) -> HuffNode:
    heap = [HuffNode(freq=f, symbol=s) for s, f in freqs.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        lo, hi = heapq.heappop(heap), heapq.heappop(heap)
        heapq.heappush(heap, HuffNode(freq=lo.freq + hi.freq, left=lo, right=hi))
    return heap[0]

def build_codes(node: HuffNode, prefix="", codes=None) -> dict[str, str]:
    if codes is None:
        codes = {}
    if node.symbol is not None:
        codes[node.symbol] = prefix or "0"
    else:
        build_codes(node.left,  prefix + "0", codes)
        build_codes(node.right, prefix + "1", codes)
    return codes

# English letter frequencies (%)
ENG = {"E":12.7,"T":9.1,"A":8.2,"O":7.5,"I":7.0,"N":6.7,"S":6.3,"H":6.1,
       "R":6.0,"D":4.3,"L":4.0,"C":2.8,"U":2.8,"M":2.4,"W":2.4,"F":2.2}

tree  = build_tree(ENG)
codes = build_codes(tree)

# Average code length
total = sum(ENG.values())
avg_len = sum((f / total) * len(codes[s]) for s, f in ENG.items())
print(f"Avg code length: {avg_len:.3f} bits/letter")   # -> ~3.94
print(f"Entropy floor:   {entropy_from_freqs(ENG):.3f} bits/letter")   # -> ~3.93

Part C: Noisy Channel

Simulate a Binary Symmetric Channel. Measure the error rate with and without a simple repetition code. Verify against Shannon's capacity formula.

channel.py Notebook 03
import random

def bsc_channel(bits: list[int], error_rate: float) -> list[int]:
    """Binary Symmetric Channel: flip each bit with probability error_rate."""
    return [b ^ (1 if random.random() < error_rate else 0) for b in bits]

def encode_repetition(bits: list[int], n: int = 3) -> list[int]:
    """Repetition code: send each bit n times."""
    return [b for b in bits for _ in range(n)]

def decode_repetition(received: list[int], n: int = 3) -> list[int]:
    """Majority vote over each group of n bits."""
    decoded = []
    for i in range(0, len(received), n):
        block = received[i:i+n]
        decoded.append(1 if sum(block) > n // 2 else 0)
    return decoded

# Simulate: 100 bits at 10% error rate
message = [random.randint(0, 1) for _ in range(100)]

# Raw channel
received_raw   = bsc_channel(message, error_rate=0.10)
raw_errors     = sum(a != b for a, b in zip(message, received_raw))

# With rate-1/3 repetition code
encoded        = encode_repetition(message, n=3)
received_coded = bsc_channel(encoded, error_rate=0.10)
decoded        = decode_repetition(received_coded, n=3)
coded_errors   = sum(a != b for a, b in zip(message, decoded))

print(f"Without coding: {raw_errors}/100 bit errors")
print(f"With repetition code: {coded_errors}/100 bit errors")

# Shannon's theorem: capacity of BSC with p=0.10
p = 0.10
capacity = 1 + p * math.log2(p) + (1-p) * math.log2(1-p)
print(f"BSC capacity at p=0.10: {capacity:.4f} bits per channel use")  # -> 0.5310

Notebook 01

Entropy & Information

Derive H from first principles. Compute it on real text. Verify the additivity property.

Notebook 02

Source Coding

Build Huffman codes from scratch. Verify the entropy bound.

Notebook 03

Channel Capacity

Simulate noisy channels. Verify Shannon's capacity formula experimentally.

# Get started

$ git clone https://github.com/vivekatsuperset/lutchet

$ uv sync && uv run jupyter lab

Section 6

Information Theory in Modern AI

Shannon's 1948 concepts are the mathematical core of every large language model

When you use ChatGPT, Claude, or Gemini, you are interacting with a system trained by minimising a loss function called cross-entropy. This is Shannon's entropy, applied to the difference between what the model predicts and what actually appeared.

Cross-Entropy Loss

Let be the true distribution of the next token and be the model's prediction. The cross-entropy is:

Minimised when — i.e., when the model's predictions match reality

In practice, for a language model predicting the next token from a training corpus, is 1 for the true next token and 0 for all others, so the sum collapses to . Training minimises the surprise the model expresses at the actual next word.

Perplexity

Perplexity is cross-entropy exponentiated:

Intuitively, perplexity is the “effective branching factor”, how many equally likely choices the model is choosing between at each token position. A model with PPL = 10 is as uncertain at each step as if it were choosing uniformly among 10 possibilities. State-of-the-art LLMs have perplexity below 10 on standard benchmarks.

KL Divergence

The Kullback-Leibler divergence measures the extra bits needed when you encode a source distributed as using a code optimised for :

KL divergence is the loss in RLHF (reinforcement learning from human feedback), the technique used to align language models with human preferences. The penalty term keeps the aligned model from drifting too far from the base model's distribution.

Shannon (1948) Modern ML (2024)
Entropy H(X) Training objective for every language model
Cross-entropy H(p, q) Loss function: measures prediction quality
KL divergence D_KL(p ∥ q) RLHF alignment penalty; VAE training objective
Channel capacity C Theoretical limit on how well any model can learn a distribution
Source coding theorem Compression in gzip, JPEG, MP3, every file format
📓

Notebook 03: Channel Capacity & Modern ML

Implements cross-entropy loss from scratch, plots KL divergence as a training curve, and shows that GPT-style token prediction is a direct descendant of Shannon's noisy channel model.

notebooks/03_channel_and_ml.ipynb

Shannon published his paper in 1948. He spent the following decade riding a unicycle through the Bell Labs hallways, juggling, and mostly ignoring the field he had founded.

"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point."
- Claude Shannon, 1948

The loss function that trains every language model, the compression that shrinks every file, the error correction that protects every data transmission all of it traces to a wartime question about sending a president's voice across an ocean without being heard.