Information Theory: The Story

In 1943, Bell Labs was the most productive research institution in the world. Its scientists had already invented the transistor, the solar cell, and information coding techniques that underpinned the entire telephone network. Now the U.S. military needed something far more difficult: a voice communication system so secure that it could carry top-secret conversations between Allied commanders across an ocean patrolled by submarines.

Claude Shannon, twenty-seven years old and already one of the most original minds at Bell Labs, was assigned to the project. The result was SIGSALY, a system that digitised the human voice, encrypted each sample with a one-time key, and transmitted the result as noise indistinguishable from static. It was the first perfectly secure voice communication system ever built.

But building SIGSALY forced Shannon to confront a deeper question. If you are going to encrypt information, you need to know how much of it there is. If you are going to transmit it over a noisy channel, you need to know the minimum rate at which you can do so without loss. Neither question had a mathematical answer. Shannon spent the next five years building one.

A brief aside

In September 1943, Alan Turing visited Bell Labs on a liaison mission from Bletchley Park. He and Shannon ate lunch together in the Bell Labs cafeteria several times. They discussed machines, communication, and what it would mean for a machine to think. Neither left a detailed record of the conversations. We know only that two years later, Shannon published a classified paper proving mathematically why ciphers like Enigma could be broken, answering, in a sense, the question Turing's Bombe had been implicitly asking.

"Information is the resolution of uncertainty."
- Claude Shannon

The people behind it

Three figures you should know

Claude Shannon

1916 – 2001

Mathematician · Bell Labs

In a single paper, 'A Mathematical Theory of Communication' (1948), Shannon founded information theory, defined entropy, proved the source and channel coding theorems, and established the mathematical foundations for every digital communication system built since.

"Shannon's 1948 paper is considered one of the most important scientific papers of the twentieth century. Its concepts underpin the internet, mobile phones, data compression, and modern AI."

David Huffman

1925 – 1999

Electrical Engineer · MIT

As a graduate student in 1952, Huffman was given the choice of taking a final exam or finding an optimal prefix-free code. He found it instead. The Huffman coding algorithm reaches within 1 bit of Shannon's entropy lower bound and is still used in JPEG, MP3, and PDF compression today.

"Huffman later said he likely would not have discovered the algorithm if he had not been racing against a deadline."

Richard Hamming

1915 – 1998

Mathematician · Bell Labs

Shannon's colleague at Bell Labs. Invented error-correcting codes (Hamming codes) after becoming frustrated that the Bell Labs computer would stop and ring a bell when it hit an error on weekends, when no operator was present to restart it.

"Hamming codes are still used in computer memory (ECC RAM). The principle extends to the Reed-Solomon codes that protect data on CDs, DVDs, and space probes."

The first insight

What Is Information?

The answer turns out to be: surprise.

Shannon's central insight sounds almost too simple: a message is informative in proportion to how surprising it is. If you already knew what the message would say, it tells you nothing. If it was completely unexpected, it tells you a great deal.

Consider two events. A coin flip: could go either way. The sun rising tomorrow: essentially guaranteed. If someone sends you a message telling you which side the coin landed on, that message is genuinely informative: you couldn't have known beforehand. If someone sends you a message telling you the sun rose, the message carries almost no information at all. You already knew.

High surprise = high information

The coin came up heads. A fair coin could have gone either way. Learning the outcome tells you something real.

No surprise = no information

The sun rose this morning. You knew it would. The message resolves no uncertainty and carries no information.

Shannon called this quantity entropy, borrowing the term from thermodynamics where it measures disorder. A source with high entropy produces lots of surprise per message. A source with low entropy is predictable: you can often guess what's coming next.

English is a low-entropy language. If someone sends you the letters TH, you can be fairly confident the next letter is E. Q is almost always followed by U. Sentences have structure. That structure means English carries less information per letter than a truly random sequence of letters would: about 4.1 bits per letter, compared to a theoretical maximum of 4.7.

The connection to Enigma

The predictability of German military prose, its low entropy, is precisely what made Enigma breakable. The patterns Shannon later measured mathematically are the same patterns Turing exploited at Bletchley. A truly random key applied to a truly random message would have been unbreakable. Neither the key nor the message was truly random. The entropy gap was the vulnerability.

The second insight

Compression: Saying More with Less

Frequent things should get short codes. Rare things can afford long ones.

If English letters were equally common, you'd want to assign them all equal-length codes. But they're not. E appears in roughly 13% of all English text. Z appears in less than 0.1%. A code that gives E and Z the same length is wasting space, paying the same price for a common event as for a rare one.

The smarter approach: give frequent letters short codes and rare letters long ones. This is exactly what Morse code does. E gets a single dot. The most common letters are the shortest. Rare letters like Q and Z get long sequences of dots and dashes. Samuel Morse didn't derive this from a theorem: he counted letter frequencies in a printer's type drawer. But the intuition was right.

In 1952, a graduate student named David Huffman proved you could find the optimal assignment, the one that minimises average code length across all letters, given their actual frequencies. His algorithm, now called Huffman coding, builds a tree of decisions. The most common letter ends up at the shallowest branch: short code. The rarest letter ends up at the deepest branch: long code. The result is provably optimal.

The Huffman algorithm is still inside every JPEG image, every MP3 file, every PDF. When you compress a document and it comes out smaller, some version of this idea is almost certainly at work.

Shannon proved something more fundamental: there is a floor. No matter how clever your compression scheme, you cannot represent a source using fewer bits than its entropy. You can approach it. You cannot beat it.

This is Shannon's source coding theorem. It is a limit, not a recipe. It tells you the best you can ever do. Huffman coding gets within about one bit per symbol of that limit. Modern algorithms like LZ77 and arithmetic coding get closer still. But none of them can cross Shannon's floor.

The third insight

Noisy Channels: Designing Around Error

Errors are inevitable. Reliable communication is not.

Every communication channel introduces noise. A cable picks up electrical interference. A radio signal weakens over distance. A satellite link crosses 200,000 miles of space. Bits get flipped. Before Shannon, engineers thought this was a fundamental barrier: you could reduce errors by sending more slowly, but you could never eliminate them.

The obvious workaround is repetition. If you're not sure your message arrived, say it three times. If the recipient gets 1 1 0, two votes say 1, so they go with 1. The majority rules. This works: errors drop dramatically, but it costs three times the bandwidth to send the same information.

Shannon proved something that shocked everyone who heard it. You do not need to pay three-to-one. For any noisy channel, there exists a maximum rate, which he called capacity, such that if you transmit below that rate, you can make errors arbitrarily rare. Not small. Arbitrarily rare. Approaching zero. Without slowing down to a crawl.

What this means in plain language

Imagine a telephone line that randomly garbles 10% of all bits. Shannon proved that there is a specific rate, around 53% of the raw line speed, at which you can transmit information with errors so rare they are negligible. Below that rate: reliable. Above it: reliability collapses no matter what you do. The rate is fixed by physics and mathematics, not by the cleverness of your engineering.

His colleague Richard Hamming immediately went to work building codes that could achieve this. Hamming codes use a small number of extra "parity" bits to detect and correct errors. A handful of extra bits can protect a message far larger than themselves: a much better exchange rate than repeating everything three times.

Hamming codes are still in your computer's memory today. Every time ECC RAM silently corrects a bit that was flipped by a stray cosmic ray, it is running a version of the mathematics Hamming worked out in the Bell Labs cafeteria, after getting frustrated that the computer kept stopping on weekends when no one was there to restart it.

Where it lives now

The Legacy

Shannon published "A Mathematical Theory of Communication" in 1948. It is 55 pages long. In those 55 pages he invented a new scientific discipline, proved its two central theorems, and laid the mathematical foundation for everything that followed. The internet, JPEG, MP3, WiFi, 5G, satellite communication, deep-space probes: all of them operate within limits Shannon described in 1948.

The ideas have also migrated into artificial intelligence in ways Shannon could not have anticipated. When you train a neural network, the core operation is this: given what the network predicted, and what actually happened, calculate how wrong it was. The standard way to measure that wrongness is called cross-entropy loss.

In plain English: the network is trying to be less surprised by the data. If it predicts that the next word in a sentence is probably "the", and the next word turns out to be "the", the network is unsurprised, with low loss. If it predicted "the" but the next word was "serendipity", the network is very surprised, with high loss. Training pushes the network toward lower surprise, which means better predictions.

Every large language model, every AI chatbot, every image generator, every text autocomplete, is trained this way. Shannon's measure of surprise is the error signal that shapes the model. The mathematics is exactly what he wrote down in 1948.

Shannon did his work before transistors, before computers, before the internet. He was measuring something so fundamental about the nature of communication that the technology has changed a dozen times and the mathematics has not needed to change at all.

He spent his later years juggling in the Bell Labs hallways and riding a unicycle. He built mechanical mice that could learn to navigate mazes. He invested early in tech companies and became wealthy enough to stop worrying about anything except what interested him. He died in 2001, having watched the world he mathematically described be built, piece by piece, around him.