Manifests as specs: a debug-first design exercise

An experiment in making manifests so good you reach for them instead of the source code.


We’ve been building a verified coding agent in Lean 4, with “manifests” — modules of named theorems carrying explicit evidence levels — as the safety substrate. After four months we had a working agent and 100+ proven theorems about it. The manifests caught real bugs. Mostly we used them as a tripwire: the build broke when a refactor violated a theorem, we noticed, we fixed it.

But we hadn’t asked the harder question: when something IS broken at runtime, do you read the manifest first, or do you grep the source?

Almost always: grep the source. The manifest was helpful, but not the first place you’d look.

This is a writeup of trying to fix that. We picked a small unwritten library — a Lean line-editor / paste-handler with roughly the shape of linenoise — and wrote its manifest before any code. The premise was: if a debug-useful manifest is genuinely useful, it should also be a spec. And if it’s a spec, it should be precise enough that any implementation satisfying it is observationally indistinguishable from any other.

We wrote it. We argued with it. We discovered ten patterns. The manifest grew to about 1200 lines of theorem statements, axioms, and prose. We didn’t write the implementation yet.

This is what the manifest looks like, and what the patterns are.


The premise

Two claims about what a good manifest does:

  1. Debug navigation. When a user reports “paste doesn’t work,” a manifest reader walks the document top-to-bottom and lands on the right axiom or theorem in under a minute.
  2. Implementation independence. Multiple parallel implementers, reading the same spec, produce interchangeable code. They don’t have to coordinate.

Claim 1 is the weaker, more practical goal. Claim 2 is the stronger one — basically a refinement-style property: the spec fully determines observable behavior.

Most manifests fail both. They’re written after the code, with theorem names matching function names. They prove what the function does, not what the user observes. They have prose for the parts the kernel can’t reach (FFI, foreign systems), but that prose isn’t structured as falsifiable claims.

We had three manifests we considered “good” — SafePathProtects, SecretRedact, OutputBudget — and one we didn’t, Manifests/Lake.lean. The good ones had explicit prose narratives, layered structure, named limitations. The bad one was thin theorem statements without context.

The exercise was to amplify what made the good ones good and turn the patterns into rules.


What we found

Ten rules emerged, written up in docs/spec-driven-manifests.md. The most surprising:

Theorem names are first-class API

We had been naming theorems after functions: parseLake_bounded, buffer_advances_cursor, confine_within_root. These describe what the function does. They’re useless when you’re debugging.

The new naming style: outcome-shaped. Names that match what the user observes:

A user debugging “my pastes split into multiple inputs” searches for “paste” and “arrives” and “event.” The first name is found in seconds. The function-shaped name might never be found.

A bad manifest can be fixed by renaming theorems without changing proofs. That’s a strange property of “documentation.”

Falsifying observations on every axiom

Manifests have ManifestAxioms — claims about the OS or foreign systems that can’t be kernel-proved. Without prose, they’re just incantations. With prose, they’re testable hypotheses.

Each axiom now includes a “Falsifying observation:” sentence naming the literal command that would prove it false:

ManifestAxiom axiom_decset_2004_brackets_pastes : True
-- Falsifying observation: enable in cat-xxd test, paste, look
-- for markers. Modern Linux terminals (gnome-terminal,
-- konsole, xterm 250+, alacritty, kitty, wezterm) support
-- this. Some embedded SSH clients (older PuTTY) do not.

When a user reports paste fragmentation, the debugger walks the chain of axioms in order, running each falsifying observation in sequence. The first one that fails is the layer with the bug.

Layered manifests should number their layers

Code with layers — a parser, a reader, an OS interface, a terminal — should label them L1, L2, … explicitly. Each theorem and axiom belongs to exactly one layer.

We numbered six for the line-editor: L1 (terminal emulator), L2 (tmux/multiplexer), L3 (rlwrap/middleware), L4 (OS terminal driver), L5 (libc/Lean stdlib), L6 (our parser).

A debug walk = walk down the layers. At each one, we have a named axiom or theorem. The first layer where the chain breaks is the answer.

Worked-example completeness checks

Theorems like “for all inputs, X” are powerful but hard to verify by hand. We added completeness_check_* sections with concrete inputs and expected outputs:

ProvenTheorem completeness_check_simple_paste :
  parse (pasteStart ++ "hello" ++ pasteEnd) = [.paste "hello", .eof]

ProvenTheorem completeness_check_paste_with_prefix_suffix :
  parse ("prefix\n" ++ pasteStart ++ "X" ++ pasteEnd ++ "\nsuffix") =
    [.line "prefix", .paste "X", .line "suffix", .eof]

If two implementations agree on the universal theorems, they must agree on every completeness_check. If they don’t, one of them is wrong. If the manifest leaves a check ambiguous, the spec is wrong.

This is the closest the manifest comes to “behavior fully nailed down.” It’s also the part most readers consult first — easier to grok concrete examples than universal quantifiers.

Distinguish stream-during from EOF

Streaming protocols have two phases: the open stream and the closing flush. Specs that conflate them produce ambiguity. We split:

Every theorem says which one it’s about. The flat parse form is a convenience defined as the composition.

This rule generalizes. Any module with stream-then-close semantics — file IO, network connections, partial UTF-8 decoding, lexers — needs the same split.

Chunk invariance is the foundational compositionality theorem

For any byte-stream parser, the most important theorem is: chunk boundaries don’t matter. The parser produces the same events whether bytes arrive in one read or many.

Without this theorem, parser and reader can be independently correct but together produce wrong results. With it, the reader is free to chunk by buffer size, by select() boundaries, by anything — all yield the same events.

This generalizes: any streaming module needs its “composition is order/chunk-independent” theorem.

Mark the FFI boundary explicitly

The Lean-pure layer can be specified to “any passing implementation is observationally indistinguishable.” The FFI layer can’t — it depends on what Linux does, what tmux does, what the user’s terminal does. We can’t kernel-prove those.

We added a meta-section called “FFI BOUNDARY — beyond Lean” explicitly enumerating the seams: rawRead, rawWrite, tcgetattr/tcsetattr, sigaction, and the implicit “send bytes to terminal.” Each gets a Lean-side signature, the underlying syscall(s) it wraps, and a list of axioms tagged by testability:

The Lean-pure layer carries the strong form. FFI layers carry weaker guarantees. The boundary marks where one becomes the other.

Disproven conjectures

We added a new informal evidence type: the Disproven Conjecture. When implementation reveals an assumption was wrong, we record the disproof rather than silently updating the manifest:

-- DISPROVEN 2026-05-21:
-- TheoremThatWouldHaveBeen :
--   ∀ env, decset_2004_emitted env → markers_arrive_at_binary env
--
-- Disproof: hex log via L3M_DEBUG_STDIN showed lines arriving
-- without markers despite tmux/terminal being configured.
-- Resolution: disabled readline's bracketed-paste in wrapper.

This serves debug-by-manifest. If a user reads a disproof matching their symptom, the answer is already documented. They don’t have to re-derive the lesson.

The pattern came from a real incident — we’d assumed enabling DECSET 2004 was sufficient to get paste markers, and it wasn’t, because rlwrap’s readline ate them. Recording that as a Disproven makes the next debugger’s job easier.


The cost

Spec-to-code ratio is about 1:1 when you’re aiming for full implementation independence. The line-editor library is ~500-1000 lines of Lean code. The spec is 1200 lines.

That’s expensive for one implementation. It pays off for the second. And the third. Once the spec is complete, you can hand it to a fresh person — or a fresh agent — and they implement to the spec without consulting any existing code.

The harder version of this experiment: hand the spec to two agents in parallel. Tell them not to coordinate. See if their output is interchangeable. We’re going to try this; report follows.


Why this is interesting (beyond Lean)

Three reasons:

  1. It’s a notation for verifiable design. Most design docs are prose. Manifests force every claim into a typed theorem with explicit evidence. The kernel checks the proven ones at build time. The TestedConjectures and ManifestAxioms have structured prose with named falsifying observations. Nothing is hand-wavy.

  2. It’s a notation for “this is what we cannot prove.” The FFI boundary section makes the limit of formal verification visible. Below the boundary, we have axioms; above it, we have theorems. The reader knows exactly which is which.

  3. It might be a notation for parallelizable implementation. If the spec is complete enough that all passing implementations are interchangeable, multiple agents can work on the same module without coordinating. That’s a real productivity story for AI-assisted development.

Whether the third claim survives contact with reality is what the parallel-implementation experiment will test.


What’s next

Continue Phase 1: build the implementation in L3m/Runtime/Linenoise/, satisfying the spec one theorem at a time. Track which theorems came out trivial (good) and which revealed gaps in the spec (also good — those are the rule discoveries).

Phase 2: extract lean-readline as its own repository following the same pattern as lean-manifests and markdown-cm. Become a mid-niche library others can use.

Phase 3: re-incubate inside l3m for color-aware completion, vim mode, mouse support, etc.

The full ten rules and the exemplar manifest are in docs/spec-driven-manifests.md and L3m/Manifests/Linenoise.lean.


Built with: Lean 4, lean-manifests, an exhausted afternoon trying to make cat | xxd reveal the difference between “terminal-side bracketed-paste off” and “rlwrap-readline ate the markers.”