Bayes' Theorem

How it worked

The Machine

Why brute force was impossible, and what the codebreakers could exploit

An Enigma machine looks like a typewriter. Press a key and a lamp lights up somewhere else on the keyboard — that is the encrypted letter. The path the signal takes depends on three things fitted into the machine:

Rotors

Three rotors chosen from a set of five, each wired to scramble the alphabet differently. They step forward with every keypress — the rightmost turns on every letter, the middle turns when the right notch is reached, the left turns more rarely. Like an odometer, but for substitution ciphers.

Reflector

After passing through the three rotors, the signal hits a fixed reflector that bounces it back through all three rotors again along a different path. This is what makes Enigma self-reciprocal: the same settings encrypt and decrypt. It also means a letter can never encrypt to itself — the fatal flaw.

Plugboard

Before and after the rotors, the signal passes through a plugboard that swaps 10 pairs of letters. The plugboard alone contributes more than 150 trillion configurations — the majority of Enigma's total keyspace.

Settings reset at midnight

Each day, every Enigma operator in the German military received a printed settings sheet specifying which rotors to use, what ring settings to apply, and which 10 pairs of letters to connect on the plugboard. At midnight, the sheet changed. Every break Hut 8 achieved expired in eighteen hours. The work started over the next morning.

The combined effect of rotor choice, rotor order, starting positions, ring settings, and plugboard wiring produces roughly 10²³ possible configurations. A machine checking one setting per microsecond would take longer than the age of the universe to exhaust them all. Brute force was never an option.

The aha moment: "Heil Hitler"

German operators were required to end every operational message with HEIL HITLER. The codebreakers knew this. Aligned against the last eleven letters of the ciphertext, those eleven known plaintext letters became a crib — a suspected piece of plaintext at a known position.

Here is what made the crib devastating: because Enigma's reflector means a letter can never encrypt to itself, any setting where H maps to H, E maps to E, or any other crib letter maps to itself is immediately, provably impossible. An eleven-character crib typically eliminated more than 99% of all candidate settings before a single probability had been calculated.

The constraint in one sentence

For a candidate setting to survive, no letter in the crib may align with the same letter in the ciphertext. This is not a heuristic — it is an absolute consequence of the reflector's design. Settings that violate it get a likelihood of zero and vanish from the posterior immediately.

What remained after constraint elimination was still a large number — but small enough to score with probability. That is where Bayes' theorem came in.

How do you find a needle in a haystack when the haystack contains a hundred sextillion pieces of hay?

You don't search it. You eliminate it, using evidence and probability.

Section 2

The Math Problem

Reframing codebreaking as a question about probability, before writing a single formula

Here is the precise question the codebreakers faced each morning:

“Given this intercepted ciphertext, what is the probability that each possible Enigma setting produced it?”

That is a question of posterior probability. It has three ingredients:

Prior

P(setting)

How plausible is each setting before we see the message? With no information: all equally likely.

Likelihood

P(message | setting)

If this setting were correct, how probable is the message we observed?

Posterior

P(setting | message)

After seeing the message: updated probability that this setting is correct.

Before we write a formula, let's build the intuition. Each square below represents a candidate Enigma setting.

candidate setting survives

Before any evidence

All ~10²³ Enigma settings are equally plausible. We have no reason to prefer any one of them.

Section 3

Bayes' Theorem

Derived from the Enigma problem, then proven on a simpler cipher

Deriving the Formula

The probability that both a setting H is correct and we observe message E can be written two ways using the product rule:

P (H \cap E) = P (E ∣ H) \cdot P (H) = P (H ∣ E) \cdot P (E)

Since both expressions equal the same thing, divide both sides by $P (E)$ :

Bayes' Theorem

P (H ∣ E) = \frac{P ( E ∣ H ) \cdot P ( H )}{P ( E )}

where $P (E) = \sum_{i} P (E ∣ H_{i}) \cdot P (H_{i})$ sums over all hypotheses

Term	Name	In our problem
P(H)	Prior	Uniform — all possible settings equally likely before any evidence
P(E \| H)	Likelihood	If this setting is correct, how probable is the observed ciphertext?
P(H \| E)	Posterior	Updated probability after seeing the ciphertext
P(E)	Normaliser	Total probability summed across all hypotheses (makes it sum to 1)

Worked Example: Caesar Cipher

Enigma is too complex to trace by hand. Let's first prove the theorem works on something you can follow letter by letter: a Caesar cipher, where the keyspace is just 26 possible shifts.

We intercept "KHOOR". We suspect it's English. We know nothing else. Click through the letters and watch the posterior converge on the correct shift.

Caesar Cipher — Bayesian Decoder

Intercept: "KHOOR". Unknown shift. Each letter we reveal updates our probability over all 26 possible shifts. The correct shift is 3 (decrypts to "HELLO").

Shift 0 (A)P(shift | evidence)Shift 25 (Z)

ABCDEFGHIJKLMNOPQRSTUVWXYZ

Most probable shift:0 (A) — 3.8% probability

Uniform prior — all 26 shifts equally likely.

How Prior and Likelihood Shape the Posterior

The formula has two inputs you control: how strongly you believed the hypothesis before, and how decisively the evidence favours it.

Prior × Likelihood → Posterior

Adjust the sliders and watch Bayes' theorem update in real time.

Prior probability P(H)10.0%

Very unlikely (1%)Very likely (99%)

Likelihood ratio L = P(E|H) / P(E|¬H)10¹ = 10×

No information (L=1)Strong evidence (L=1000)

Prior P(H)

10.0%

Likelihood ratio

10×

Posterior P(H|E)

52.6%

52.6%=10×0.1 : 1(odds form)

The evidence has substantially updated our belief.

Section 4

Banburismus

Turing's sequential extension, and the invention of the ban

The Caesar demo showed Bayesian updating with 26 hypotheses. Enigma has 10²³. Turing needed to combine many small pieces of evidence without multiplying chains of vanishingly small probabilities.

His solution: work in log-odds. In odds form, Bayesian updating is:

Ω_{1} = Λ \times Ω_{0} Ω₁ = posterior odds Λ = likelihood ratio P(E|H) / P(E|\negH) Ω₀ = prior odds

Taking $lo g_{10}$ of both sides:

lo g_{10} Ω_{1} = lo g_{10} Λ + lo g_{10} Ω_{0}

Multiplication becomes addition. Each piece of evidence adds to the tally.

Turing called the unit a ban (named after Banbury, where the scoring sheets were printed). One ban = $lo g_{10} (10)$ , a factor of 10 in the odds. A deciban is one-tenth of a ban. When a setting's tally crossed 30 decibans (3 bans = 1000:1 odds), the team accepted it. Watch that process below:

Banburismus — Sequential Bayesian Updating

Each letter of the crib "WETTER" adds weight of evidence in decibans (Turing's unit). Addition — not multiplication. The 3-ban threshold (30 decibans) is the acceptance line.

II–I–III KDW

0.0 db

I–II–III AAA

0.0 db

I–III–II MQV

0.0 db

III–I–II ZAS

0.0 db

II–III–I BPK

0.0 db

III–II–I XRL

0.0 db

30 decibans = 3 bans = 1000:1 odds — Turing's acceptance threshold

Uniform prior — no evidence yet. All candidates treated equally.

What Banburismus actually was

Banburismus was the specific procedure for determining which day-key was used for Naval Enigma by comparing pairs of messages sent on the same settings. Each shared letter added decibans to the score. When the tally crossed the threshold, the result was fed into the Bombe. I.J. Good described it in 1979 as "the first serious application of sequential Bayesian analysis to a real problem."

Section 5

Python: Build It

From the maths to working code: three parts, progressively deeper

Part A: The Core Formula

Bayes' theorem in two lines of NumPy, applied letter-by-letter to crack the Caesar cipher from Section 3.

bayes_caesar.py Notebook 02

import numpy as np

# English letter frequencies (A=0 ... Z=25)
ENG_FREQ = np.array([
    0.082, 0.015, 0.028, 0.043, 0.127, 0.022, 0.020, 0.061,
    0.070, 0.002, 0.008, 0.040, 0.024, 0.067, 0.075, 0.019,
    0.001, 0.060, 0.063, 0.091, 0.028, 0.010, 0.024, 0.002,
    0.020, 0.001,
])

def bayesian_update(prior: np.ndarray, likelihoods: np.ndarray) -> np.ndarray:
    """One step of Bayes: posterior = (likelihoods * prior) / sum."""
    unnormalised = likelihoods * prior
    return unnormalised / unnormalised.sum()

# -- Caesar cipher example --------------------------------------------------
# Intercepted: "KHOOR" (= "HELLO" encrypted with shift 3).
# Hypotheses: shift in {0, 1, ..., 25}. Prior: uniform.

prior = np.ones(26) / 26

for cipher_letter in "KHOOR":
    c = ord(cipher_letter) - ord("A")
    # P(cipher_letter | shift=k) = English freq of the decrypted letter
    likelihoods = ENG_FREQ[[(c - k) % 26 for k in range(26)]]
    prior = bayesian_update(prior, likelihoods)

best_shift = prior.argmax()
print(f"Most probable shift: {best_shift}")   # -> 3
print(f"Decrypts to: {''.join(chr((ord(c)-ord('A')-best_shift)%26+ord('A')) for c in 'KHOOR')}")  # -> HELLO

Part B: Banburismus in Code

Now in log-odds (decibans). Each letter adds to the score. This is exactly Turing's insight, translated into Python.

banburismus.py Notebook 02 → 03

import math

def to_decibans(likelihood_ratio: float) -> float:
    """Turing's unit: 10 * log10(likelihood ratio)."""
    return 10 * math.log10(likelihood_ratio)

def banburismus(ciphertext: str) -> list[float]:
    """
    Sequential Bayesian updating in log-odds (decibans).
    Returns the deciban score for each of the 26 possible Caesar shifts.
    Addition, not multiplication -- that is Turing's key insight.
    """
    N = 26
    # Prior: uniform -> log-odds = log10(1/25) for each shift
    log_odds = [math.log10(1 / (N - 1))] * N

    for cipher_letter in ciphertext.upper():
        c = ord(cipher_letter) - ord("A")
        for shift in range(N):
            plain = (c - shift) % 26
            p_given_H     = ENG_FREQ[plain]
            p_given_not_H = 1 / N
            # ADD decibans -- not multiply probabilities
            log_odds[shift] += to_decibans(p_given_H / p_given_not_H)

    return log_odds

scores = banburismus("KHOOR")
best   = max(range(26), key=lambda k: scores[k])
print(f"Highest score: shift {best} ({scores[best]:.1f} decibans)")  # -> shift 3

Part C: Full Enigma Simulation

Real Enigma machine, real rotor search, crib-dragging with constraint elimination. The Enigma machine below lets you generate ciphertext to feed into the decoder.

decoder_demo.py Notebook 03

from src.enigma.machine import EnigmaMachine, EnigmaConfig
from src.bayes.decoder  import BayesianDecoder

# The message -- encrypted with unknown settings.
# Operators opened every weather report with "WETTER": the perfect crib.
CIPHERTEXT = "ABCXYZPQRLMNOPQDEFGHIJKSTUVWRST"   # replace with real intercept

# Decoder: search rotors I-III, all 26^3 positions, fixed reflector UKW-B
decoder = BayesianDecoder(rotor_choices=["I", "II", "III"])

# Run: Enigma constraint eliminates >99% instantly, Bayesian scoring resolves the rest
results = decoder.decode(CIPHERTEXT, crib="WETTER", top_n=5, verbose=True)

# Best result
print(results[0].decrypted)          # -> most probable plaintext
print(results[0].config.rotors)      # -> recovered rotor order
print(results[0].window)             # -> recovered starting positions

The Machine: Live in Your Browser

Encrypt something, then paste the ciphertext into Notebook 03 and watch the Bayesian decoder recover the settings.

Enigma Machine — Interactive Demo

Left

Middle

Right

III

LAMP BOARD

PLAINTEXT

Type or click keys…

CIPHERTEXT

Encrypted output…

▸ Machine settings

Left rotor

Middle rotor

Right rotor

Left start

Middle start

Right start

Click keys or type on your keyboard. Notice how the same letter never produces itself.

Notebook 01

The Enigma Machine

Build all components from scratch. Verify with real test cases.

Notebook 02

Bayes' Theorem

Derive, simulate, visualise. Caesar and Enigma worked examples.

Notebook 03

Cracking Enigma

Run the Bayesian decoder. Watch the posterior collapse.

# Get started

$ git clone https://github.com/vivekatsuperset/lutchet

$ uv sync && uv run jupyter lab

Section 6: Advanced

Going Deeper

For undergrads: numerical stability, information theory, and why this is everywhere in modern ML

Why Log-Odds? Numerical Stability

Enigma's prior probability for any single setting is about $1 0^{- 23}$ . Multiplying 30 such numbers together produces values that underflow to zero in floating-point arithmetic. Working in log-odds, where we add instead of multiply, sidesteps this entirely. This is why modern machine learning frameworks compute log-probabilities by default.

Connection to Information Theory

Claude Shannon published his theory of information in 1948, eight years after Turing built the ban system. The two frameworks measure the same thing in different units:

Turing (1940)	Shannon (1948)
1 ban = log10(10)	~3.32 bits of information
Decibans of evidence	Bits of mutual information
Prior log-odds to posterior log-odds	Entropy reduction H(X) to H(X\|E)
30-deciban threshold	~10 bits to identify 1 in 10^23 settings

The Modern Descendants

Naive Bayes Classifier

Bletchley: Score each Enigma setting by multiplying per-letter likelihoods

Today: Spam filters, text classification — same per-word likelihood multiplication

Logistic Regression

Bletchley: Turing's deciban tally: additive log-likelihood scores

Today: The logit function is log-odds. Learning is adding log-likelihoods.

Sequential Testing

Bletchley: Add evidence until 3-ban threshold — then accept

Today: A/B testing with early stopping. SPRT (Wald 1945) was directly inspired by Banburismus.

Language Models

Bletchley: P(next letter | rotor setting and position)

Today: P(next token | all previous tokens) — Bayesian language modelling at scale

📓

Notebook 04: Advanced Topics

Log-odds in depth, sequential Bayesian updating, Shannon entropy, KL divergence, and implementing a Naive Bayes language classifier that descends directly from Turing's methods.

notebooks/04_advanced.ipynb →