Build — antonlebed.com

Things built on the tower substrate — in silicon, over radio, and inside neural networks. The lesson that repeats across everything below: the tower's properties must be wired in by construction, not hoped for emergently. What worked and what didn't are both documented.

Silicon: FPGA pipelines

Three primorial rings on a $20 Tang Nano 20K FPGA. Each design has per-channel GF(p) multipliers, CRT reconstruction, and ECC testing — every ring element round-tripped and multiplied, every codeword's error correction exercised. Each of the three proofs runs twice: exhaustively in RTL simulation, then on the device at the 27 MHz board clock with the verdict on the LEDs.

k = 7 (Z/510,510) exhaustive proof: 1,022,070 checks, 0 failures — CRT roundtrip for all 510,510 elements, multiply ×19 for all elements, ECC clean + four single-channel corruption phases. 2,922 LUT4 (14% of the chip); the full proof passes on silicon in ~38 ms.

k = 8 (Z/9,699,690) exhaustive proof: 19,413,240 checks, 0 failures — eight phases, 8 channels (5 data + 3 parity), rate 5/8 MDS ECC. 4,670 LUT4 (23%); ~720 ms on silicon.

k = 9 (Z/223,092,870) exhaustive proof: 446,395,950 checks, 0 failures — 9 channels (6 data + 3 parity), rate 6/9 MDS ECC, multiply ×29, six corruption phases. 6,849 LUT4 (33%); the full proof passes on silicon in ~16.5 s.

Each rung adds one channel, widens every datapath, and adds a reduction stage: roughly +40–60% logic per rung (2,922 → 4,670 → 6,849 LUT4). Three rungs deep, every resource axis stays under half the device. Field-only rings are lean: a historical fattened-ring design used 31% of the FPGA for multiply alone. The silicon builds disable the toolchain's carry-chain inference: running these exhaustive proofs on the device exposed inferred ALU carry cells computing wrong somewhere downstream of synthesis in the open-source flow — a passing gate-level netlist simulation exonerated the synthesizer — routed around, and queued for an upstream report.

A designed ring: multiplication from logarithm tables

The fourth design leaves the primorial path. In Z/65,535 = Z/3 × Z/5 × Z/17 × Z/257 — the internet checksum's ring, four Fermat primes — every channel's unit count is a power of two, so multiplying units reduces to adding their discrete logarithms, and the wrap-around of each addition is plain bit truncation. Per channel: a logarithm table, a masked add, an exponential table. The classic trade is tables instead of multipliers; the question is what the tables cost.

Exhaustive proof on silicon: 2³⁰ + 256 checks, 0 errors — all 1,073,741,824 ordered pairs of units, the table-lookup core compared against a per-channel multiplier core, in 39.8 s at 27 MHz, verdict over serial with a build-id check. The tables prove themselves first: before the sweep, the device exhaustively checks per channel that the log and exponential tables invert each other, that each exponential step multiplies by the channel's generator, and that index 0 maps to 1 — so the table contents need no off-board trust.

The cost, measured both ways (one board, one toolchain): with the tables in block RAM — the synthesizer maps them there on its own once the core sits between registers — the lookup core uses ~4× less soft logic at 2.2× the clock than the bank of multipliers built from logic fabric (233 vs 995 LUT-class cells, 179.7 vs 80.8 MHz, one result per cycle each). Forced into LUTs, the same tables cost ~2.4× the multiplier bank — the lookup core loses the area race. The niche is exactly “tables in block RAM, adds in fabric”; without the dedicated memory, the area verdict flips to the multipliers.

Radio: the ESP32 mesh

Three ESP32 boards form an ESP-NOW mesh. Each board CRT-encodes data into Z/510,510 residues (7 bytes), broadcasts, and the receiver reconstructs and checks parity. Measurements, not simulations — in three acts: detect, recover, self-check.

Protocol test: 104/104 received packets, 100% correct.

Phase	What it tests	Result
CRT roundtrip	encode counter → broadcast → reconstruct → verify	24/24
ECC clean	syndrome = (0,0,0) for valid codewords over radio	40/40
ECC injection	corrupt one data channel per packet, verify detection	40/40

Syndrome signatures over radio:

corrupt mod 2 (+1): syndrome (5, 12, 14)
corrupt mod 3 (+1): syndrome (7, 8, 15)
corrupt mod 5 (+1): syndrome (6, 4, 10)
corrupt mod 7 (+1): syndrome (1, 10, 16)

All 3 parity channels fire for every injection — the no-blind-spots property, confirmed over the air.

Real sensor data: BME280 temperature and pressure from 2 boards, CRT-encoded, broadcast, reconstructed: 24/24 exact match. Ring arithmetic preserves physical measurements exactly (26.28 °C / 27.29 °C, both 1025.4 hPa — same room, different shelf).

Node loss: erasure recovery

Detection is half the ECC story; the mesh also recovers. One value's 7 residues are sharded across the 3 boards (channels 2, 3, 5 / 7, 11 / 13, 17), one board is silenced each round, and a surviving board reconstructs the value over radio from the residues that still exist — subset CRT, no retransmission.

Silence any board, survivors reconstruct: 18/18 rounds exact — including the tight 4-residue case (the board holding channels 2, 3, 5 dies; {7, 11, 13, 17} suffices): 6/6, and the rounds where the value's own creator dies right after sharding: 6/6.

Clock sync that checks itself

NTP-style two-way time transfer, pairwise around the triangle 0→1→2→0. The three pairwise offsets sum to zero identically, so the measured sum — the closure — is pure estimation error: a self-check the mesh computes about its own synchronization with no reference clock, the radio sibling of the parity syndrome below.

Clean phase: boot skews of seconds to hundreds of seconds between boards, yet the triangle closes to a few hundred µs regardless of which board leads. Inject phase: one board lies by +5 ms to one neighbor while serving the other honestly — no single pair can tell the lie from ordinary skew, but the closure reports +4.88 ms. A board lying consistently to both neighbors merely has a wrong clock and cancels: the closure detects inconsistency, exactly like a parity check.

The closure's first real catch was not the injected lie. It was a ~5 ms scheduler-tick bias in our own measurement path — blocking radio reads quantized every timestamp pickup — found and fixed because the closure refused to sit at zero when the leader role rotated.

Neural: CRT architectures with built-in parity

Instead of one monolithic output head, use 7 independent CRT output heads, one per channel, over a shared backbone. The structural numbers below follow from the decomposition itself (computed at Z/510,510); the detection numbers are measured.

391M× Jacobian block-diagonal savings: N² vs Σq_i²

8,802× Forward-pass positions: 510,510 vs 58

22× Matmul FLOPs at k=7, theoretical (a GPU on an earlier ring measured 3.6–11.4× of its 16× theory)

4/7 ECC rate, no encoding layer

The parity syndrome

Train the 3 parity heads as extra objectives and every prediction carries a free self-check: the syndrome — how many parity heads disagree with the data heads. A token that contradicts itself announces it.

122/122 errors caught at k=7 (98% accuracy, 6‰ false alarms)

277/277 errors caught at k=8 (96% accuracy, 4‰ false alarms)

2.21 mean syndrome on out-of-domain input (in-domain: 0)

Measured: Elman RNN (h=128), PyTorch, 3 seeds, 2,361 characters of natural English (Carroll, 58-char vocabulary). Detection reaches 100% from 85% model accuracy onward and degrades gracefully below. The larger ring also wins at fixed capacity: at h=64, k=8 beats k=7 on both accuracy (64% vs 47%) and detection (99% vs 95%) — the extra channel adds an independent training signal.

The syndrome as a sequential alarm

The syndrome answers a per-token question: is this output self-consistent? Run sequentially, the same label-free signal answers a deployment question: has the input distribution changed? Each block of output, the syndrome count places a bet against the hypothesis “operation is in-distribution”; the running wealth the bets accumulate is the alarm, fired when it reaches 1/α. By Ville's inequality, the false-alarm budget α then holds no matter how long you watch or when you stop watching — on one condition: each bet must be fair against the null given the blocks already seen. This demo buys that by design, drawing its blocks independently.

3/100 clean streams alarmed in 200 blocks (budget at α = 0.05: ≤5)

block 3 median alarm on out-of-distribution text

block 12 median alarm under 10% input corruption

The model here is deliberately under-sized (h = 16, ~47% accuracy), so even in-distribution blocks carry 13–27 syndrome positions — a raw rule that flags any syndrome false-flags every clean stream at block 1. The betting version, reading the same signal on the same streams, stays quiet on 97 of 100. Corruption shows what the accumulation buys: corrupted blocks mostly overlap the clean range, so a fixed threshold waits for rare extreme blocks, while the wealth compounds the persistent mild shift across ordinary ones.

Bets precede measurement: projecting the parity heads onto the parities implied by the data heads — forcing every output to be a valid codeword — sets the syndrome identically to zero, and the alarm never fires on the same out-of-distribution stream. Post-processing that forces consistency destroys the evidence the alarm reads.

On real text (h = 64), the three parity channels can also bet separately, and how their evidence combines is a design law. Split the channels' input streams by design and the product of per-channel wealths is a valid alarm by construction: it localizes — in 18/18 single-channel shifts the shifted channel's wealth alarms and holds the max, unshifted channels never cross (0/36) — and it beats the pooled bet (median alarm at block 6 vs 20 under corruption: a sum dilutes a one-channel shift under two quiet channels' noise). On a shared stream that multiplication is invalid, measurably — the naive product alarms 10/60 clean streams, 3.3× past the budget — but each channel's wealth alone is still valid, so betting each at a third of the budget and alarming when any crosses needs no independence at all, and the alarm pattern names the broken part: corrupt one parity head's weights and that channel alone alarms (9/9 severe, 9/9 mild — the mild ones with no flaggable typical block); corrupt a data head and all three alarm at once (12/12) — a wrong reconstruction disagrees with every parity check. Naming costs about one block over the pooled bet on loud faults, and wins on mild ones.

Measured: 5 seeds, Elman RNN (h=16) at Z/510,510, a 10-sentence pangram corpus, one sentence per block with state reset — the model is deterministic, so the null distribution of each block's syndrome count is enumerated from its own emissions, not estimated. In-distribution here means the memorized 10 sentences; any unseen text is out-of-distribution for this toy. On real text (h = 64), where the null must be estimated from m calibration draws, false-alarm control survives at the composed budget α + δ (0/180 fresh clean streams); the price lands in detection delay, shrinking as √m. Independent blocks are this rig's choice, not the method's requirement: fairness given the past survives dependent reading policies, at the price of calibrating the null per reading state rather than once — skip that, and a dependent stream lands 6× past the budget.

What didn't work

× ECC as emergent behavior — parity channels are not independently predictable. Parity must be a training objective. Built-in beats bolted-on.

CRT decomposition is a structural technique, not a silver bullet. At character-level vocabulary it costs more output parameters than a monolithic head; at V = 1,000 it has 17× fewer, but the monolithic model still converges faster (all per-channel predictions must agree, so errors compound). The durable value is the syndrome: free error detection that no monolithic architecture has.

Platform comparison

GPU

357M ops/s (batch of 1M, earlier-era kernels)

FPGA

27M ops/s observed (1 check/cycle at the 27 MHz board clock)

ESP32

~2K ops/s interpreted; ~74K compiled (measured on-board)

The FPGA is the proof engine (exhaustive verification in silicon), the GPU the batch engine, the ESP32 the radio mesh node — throughput was never its job.