Manifests as specs: a debug-first design exercise

An experiment in making manifests so good you reach for them instead of the source code.

Update (2026, after a year and a large cull). This post documents an experiment from partway up the learning curve. Two pieces of advice below have since flipped, and I’ve marked them inline rather than silently rewriting: (1) Rule 3 originally recommended Sketch as a placeholder for properties “we can’t state yet” — I no longer believe that placeholder should exist; the difficulty of stating a property is a design signal, not a gap to park. (2) The “99% asymptote” section framed proof as the expensive top of a ladder. The deeper finding was the opposite: precise statement is the scarce resource, and proof is usually cheap once you have it — see the companion post on the state→prove asymmetry. The experiment itself (spec-first, layered, FFI-boundary-marked manifests) held up; the ladder metaphor underneath it did not.

We’ve been building a verified coding agent in Lean 4, with “manifests” — modules of named theorems carrying explicit evidence levels — as the safety substrate. After four months we had a working agent and 100+ proven theorems about it. The manifests caught real bugs. Mostly we used them as a tripwire: the build broke when a refactor violated a theorem, we noticed, we fixed it.

But we hadn’t asked the harder question: when something IS broken at runtime, do you read the manifest first, or do you grep the source?

Almost always: grep the source. The manifest was helpful, but not the first place you’d look.

This is a writeup of trying to fix that. We picked a small unwritten library — a Lean line-editor / paste-handler with roughly the shape of linenoise — and wrote its manifest before any code. The premise was: if a debug-useful manifest is genuinely useful, it should also be a spec. And if it’s a spec, it should be precise enough that any implementation satisfying it is observationally indistinguishable from any other.

We wrote it. We argued with it. We discovered ten patterns. The manifest grew to about 1200 lines of theorem statements, axioms, and prose. We didn’t write the implementation yet.

This is what the manifest looks like, and what the patterns are.

The premise

Two claims about what a good manifest does:

Debug navigation. When a user reports “paste doesn’t work,” a manifest reader walks the document top-to-bottom and lands on the right axiom or theorem in under a minute.
Implementation independence. Multiple parallel implementers, reading the same spec, produce interchangeable code. They don’t have to coordinate.

Claim 1 is the weaker, more practical goal. Claim 2 is the stronger one — basically a refinement-style property: the spec fully determines observable behavior.

Most manifests fail both. They’re written after the code, with theorem names matching function names. They prove what the function does, not what the user observes. They have prose for the parts the kernel can’t reach (FFI, foreign systems), but that prose isn’t structured as falsifiable claims.

We had three manifests we considered “good” — SafePathProtects, SecretRedact, OutputBudget — and one we didn’t, Manifests/Lake.lean. The good ones had explicit prose narratives, layered structure, named limitations. The bad one was thin theorem statements without context.

The exercise was to amplify what made the good ones good and turn the patterns into rules.

What we found

Ten rules emerged, written up in docs/spec-driven-manifests.md. The most surprising:

Theorem names are first-class API

We had been naming theorems after functions: parseLake_bounded, buffer_advances_cursor, confine_within_root. These describe what the function does. They’re useless when you’re debugging.

The new naming style: outcome-shaped. Names that match what the user observes:

paste_block_arrives_as_one_event_when_brackets_present
unterminated_paste_yields_parseError
withRawMode_restores_on_all_exits

A user debugging “my pastes split into multiple inputs” searches for “paste” and “arrives” and “event.” The first name is found in seconds. The function-shaped name might never be found.

A bad manifest can be fixed by renaming theorems without changing proofs. That’s a strange property of “documentation.”

There are exceptions. Pure-math theorems can keep function-shaped names: mean_singleton, regression_slope. The function name is the search term — you’d grep for “mean” first. Same for internal lemmas that aren’t user-facing. The rule applies to the manifest’s outward-facing claims; internal scaffolding can keep its function-shaped names without losing readability.

Falsifying observations on every axiom

Manifests have ManifestAxioms — claims about the OS or foreign systems that can’t be kernel-proved. Without prose, they’re just incantations. With prose, they’re testable hypotheses.

Each axiom now includes a “Falsifying observation:” sentence naming the literal command that would prove it false:

ManifestAxiom axiom_decset_2004_brackets_pastes : True
-- Falsifying observation: enable in cat-xxd test, paste, look
-- for markers. Modern Linux terminals (gnome-terminal,
-- konsole, xterm 250+, alacritty, kitty, wezterm) support
-- this. Some embedded SSH clients (older PuTTY) do not.

When a user reports paste fragmentation, the debugger walks the chain of axioms in order, running each falsifying observation in sequence. The first one that fails is the layer with the bug.

Layered manifests should number their layers

Code with layers — a parser, a reader, an OS interface, a terminal — should label them L1, L2, … explicitly. Each theorem and axiom belongs to exactly one layer.

We numbered six for the line-editor: L1 (terminal emulator), L2 (tmux/multiplexer), L3 (rlwrap/middleware), L4 (OS terminal driver), L5 (libc/Lean stdlib), L6 (our parser).

A debug walk = walk down the layers. At each one, we have a named axiom or theorem. The first layer where the chain breaks is the answer.

Worked-example completeness checks

Theorems like “for all inputs, X” are powerful but hard to verify by hand. We added completeness_check_* sections with concrete inputs and expected outputs:

ProvenTheorem completeness_check_simple_paste :
  parse (pasteStart ++ "hello" ++ pasteEnd) = [.paste "hello", .eof]

ProvenTheorem completeness_check_paste_with_prefix_suffix :
  parse ("prefix\n" ++ pasteStart ++ "X" ++ pasteEnd ++ "\nsuffix") =
    [.line "prefix", .paste "X", .line "suffix", .eof]

If two implementations agree on the universal theorems, they must agree on every completeness_check. If they don’t, one of them is wrong. If the manifest leaves a check ambiguous, the spec is wrong.

This is the closest the manifest comes to “behavior fully nailed down.” It’s also the part most readers consult first — easier to grok concrete examples than universal quantifiers.

This works when output is enumerable — parsers, state machines, lexers, query engines. For continuous-output libraries (linear regression, t-tests, anything with a real-valued result), the shape doesn’t fit cleanly: you can’t write down a Float exactly. Conformance against a reference system (R, numpy) plays the same role for those — DeanLean.Conformance is the infrastructure.

Distinguish stream-during from EOF

Streaming protocols have two phases: the open stream and the closing flush. Specs that conflate them produce ambiguity. We split:

parsePartial : State → Bytes → State × Events (during stream)
parseAtEof : State → Events (final flush)

Every theorem says which one it’s about. The flat parse form is a convenience defined as the composition.

This rule generalizes. Any module with stream-then-close semantics — file IO, network connections, partial UTF-8 decoding, lexers — needs the same split.

Chunk invariance is the foundational compositionality theorem

For any byte-stream parser, the most important theorem is: chunk boundaries don’t matter. The parser produces the same events whether bytes arrive in one read or many.

Without this theorem, parser and reader can be independently correct but together produce wrong results. With it, the reader is free to chunk by buffer size, by select() boundaries, by anything — all yield the same events.

This generalizes: any streaming module needs its “composition is order/chunk-independent” theorem.

Mark the FFI boundary explicitly

The Lean-pure layer can be specified to “any passing implementation is observationally indistinguishable.” The FFI layer can’t — it depends on what Linux does, what tmux does, what the user’s terminal does. We can’t kernel-prove those.

We added a meta-section called “FFI BOUNDARY — beyond Lean” explicitly enumerating the seams: rawRead, rawWrite, tcgetattr/tcsetattr, sigaction, and the implicit “send bytes to terminal.” Each gets a Lean-side signature, the underlying syscall(s) it wraps, and a list of axioms tagged by testability:

[lean-testable] — testable from Lean
[shell-testable] — needs strace/xxd/sibling-process
[OS-axiomatic] — about a foreign system, behavior may change

The Lean-pure layer carries the strong form. FFI layers carry weaker guarantees. The boundary marks where one becomes the other.

Disproven conjectures

We added a new informal evidence type: the Disproven Conjecture. When implementation reveals an assumption was wrong, we record the disproof rather than silently updating the manifest:

-- DISPROVEN 2026-05-21:
-- TheoremThatWouldHaveBeen :
--   ∀ env, decset_2004_emitted env → markers_arrive_at_binary env
--
-- Disproof: hex log via L3M_DEBUG_STDIN showed lines arriving
-- without markers despite tmux/terminal being configured.
-- Resolution: disabled readline's bracketed-paste in wrapper.

This serves debug-by-manifest. If a user reads a disproof matching their symptom, the answer is already documented. They don’t have to re-derive the lesson.

The pattern came from a real incident — we’d assumed enabling DECSET 2004 was sufficient to get paste markers, and it wasn’t, because rlwrap’s readline ate them. Recording that as a Disproven makes the next debugger’s job easier.

The pragmatic asymptote — 99% is the target, not 100%

A discipline that’s been crystallizing as we work the rules in practice: aim to catch 99% of errors at compile time or with named theorems, and then stop. Don’t push for 100%.

Strong types catch about 50% of the bugs you can make calling a function (a function expecting length won’t accept area). Pure-functional discipline catches more. Named ProvenTheorems on pure cores catch most of the rest. The specific marker for “we’ve gone past diminishing returns” is when the next theorem requires modeling the OS, formalizing IO, or proving termination of partial defs.

Three rules of thumb:

Prefer rfl-style proofs and computational tactics like native_decide wherever the function is decidable. They’re cheap to write and cheap to maintain. A definition change that breaks rfl gives you a clean “did you mean this?” prompt at the right moment.
Extract pure cores; leave IO bodies free. When a tool’s behavior decomposes into “gather inputs (IO) + compute output (pure) + emit (IO),” lift the pure middle into a named function and prove things about it. The IO above and below stays free to evolve. We’ve done this in two places on l3m: Code/CascadeLogic.lean (reviewer cascade decision logic) and Code/ExecuteCallsModel.lean (per-iteration block-builder for the dispatcher loop).
Don’t Sketch a property you can’t state — restructure until you can. This is the rule I most wish I’d had at the start, and it’s a reversal. My original advice here was “use Sketch for slots where the formal language doesn’t exist yet” — a Sketch being an honest placeholder for “we don’t yet know how to say this precisely.” A year and an aggressive cull later, I don’t believe in that placeholder. When we finally deleted every Sketch in l3m (about a hundred True-typed slots), the ones that had languished unstated split into exactly two piles, and neither was “hard theorem waiting for language.” Either the property was statable — and then a real proof already existed ten lines away, making the Sketch a tombstone — or it was unstatable because it was about state (concurrency, a global, an IO boundary) that no clean type captured, in which case it’s a WorldClaim (an assumption about a foreign system, with a falsifying observation) or it’s a plain comment, not a theorem in waiting. The scarce resource, it turns out, is precise statement, not proof: once you can state a property over data the kernel can see, you can almost always prove it cheaply. So the productive response to “I can’t state this” is not to park a Sketch — it’s to treat the difficulty as a design signal and restructure the code (push the state into a thin IO shell, express the core functionally) until the property becomes easy to state. At that point it’s usually easy to prove. A Sketch lets you skip that restructuring and pretend the property is captured. It isn’t.

The trade-off is real. We wrote about 10K lines of Lean before the binary ran for the first time, and it ran. The errors that did happen were in IO boundaries — exactly where the type system can’t see. That ratio (only IO errors) is the target equilibrium. Pushing further requires formalizing things outside Lean’s reach, which would either trap the project in research or force us to write models we can’t really validate.

What this rules out: modeling Linux as a Lean inductive, proving the LLM behaves correctly, proving termination of recursive IO loops, proving “the IO body is observationally equivalent to a pure function” (provable in principle, not worth the cost). Each would push accuracy past 99%; none would catch a class of bugs we don’t already catch with structural types + audit-grep + smoke tests.

The implication for manifest design: every theorem should be cheap to maintain, or it’s the wrong theorem to write. A proof that breaks on every refactor is a proof that documents the original code, not the property. But — and this is the part I got wrong the first time — the fix for such a proof is not to demote it to a Sketch naming the property and let the code be reviewed by reading. That just hides an unstated property behind a theorem-shaped keyword. The fix is to find the property-level statement (the monotonicity, the confinement, the roundtrip) that survives the refactor, because it’s about what the code guarantees rather than how it’s built. If no such statement exists, that’s the signal the code is carrying state you haven’t tamed — restructure, don’t Sketch.

The cost

Spec-to-code ratio is expected to be about 1:1 when you’re aiming for full implementation independence. The line-editor library will be ~500-1000 lines of Lean code; the spec is 1200 lines. The “1:1” claim is provisional — the implementation isn’t written yet, so this is a forecast, not a measurement. The parallel-implementation experiment will pin it down.

That’s expensive for one implementation. It pays off for the second. And the third. Once the spec is complete, you can hand it to a fresh person — or a fresh agent — and they implement to the spec without consulting any existing code.

The harder version of this experiment: hand the spec to two agents in parallel. Tell them not to coordinate. See if their output is interchangeable. That’s the headline test for the “complete-by-construction” claim, and it’s planned next.

When this is overkill

Spec-first manifests are not for every module. The discipline pays back when:

The library is expected to outlive any single implementation
Correctness requires interchangeable behavior across implementations (parsers, protocols, anything that touches a shared interface)
The implementation crosses an FFI boundary you can’t kernel- prove (Linux syscalls, git, terminal, network)

For a one-off internal helper one person will write and use once, this is overengineering. A regular ProvenTheorem / TestedConjecture / ManifestAxiom manifest plus a README is fine. The 10 rules become mandatory only when the cost of reimplementation, regression, or silent divergence exceeds the 1000 lines of spec you’d have to write.

A useful test: would you be embarrassed if a future contributor reimplemented this module from scratch and produced something visibly different? If yes, spec-first. If “they’d just do whatever,” save the time.

Why this is interesting (beyond Lean)

Three reasons:

It’s a notation for verifiable design. Most design docs are prose. Manifests force every claim into a typed theorem with explicit evidence. The kernel checks the proven ones at build time. The TestedConjectures and ManifestAxioms have structured prose with named falsifying observations. Nothing is hand-wavy.
It’s a notation for “this is what we cannot prove.” The FFI boundary section makes the limit of formal verification visible. Below the boundary, we have axioms; above it, we have theorems. The reader knows exactly which is which.
It might be a notation for parallelizable implementation. If the spec is complete enough that all passing implementations are interchangeable, multiple agents can work on the same module without coordinating. That’s a real productivity story for AI-assisted development.

Whether the third claim survives contact with reality is what the parallel-implementation experiment will test.

A later finding: three audiences, three documents

The original premise had two goals — debug navigation and implementation independence. After accumulating 50+ manifests in the project, a third goal surfaced: the manifest layer should serve a prospective adopter, not just a debugger or a parallel implementer.

A user who’s never seen the project asks different questions than a maintainer. Will this touch files outside my workspace? Can I undo what it does? Will my secrets leak? Why this versus the alternatives? These are adoption-decision questions, and they cut across subsystem boundaries.

We measured. A fresh agent given the four subsystem manifests (capability boundary, tool dispatcher, agent loop, reviewer cascade) answered 7/10 user-facing questions. The misses were all the same shape: cross-cutting concerns the architecture files didn’t surface (“what does the trust report number mean?”, “is this property closed under iteration?”), or property contracts living in topic files the reader didn’t know to consult.

We added a top-level Manifests/Spec.lean — 12 promises in inverted-pyramid order, each a paragraph of prose plus a typed claim that Restates an existing kernel-checked theorem. The paragraph says what the promise means and why a user cares; the typed claim is what makes it more than a doc-string. The proof body stays invisible — the user trusts the kernel approved it, the way they trust the typechecker approved their code. A fresh agent given Spec.lean alone answered 9/10 of the same questions; the agent’s verdict was “as a spec, it works.”

The pattern that fell out:

Spec.lean — for prospective adopters. Single file. Promise prose with kernel-checked binding underneath. Every promise has both: equal measures of truth and beauty.
Architecture.lean — for system modifiers. Names the subsystems, points at the topic files. No theorems of its own.
Topic manifests — for property auditors. The proven claims with proof bodies adjacent.

Three documents, three audiences. Each stays small because it serves one reader. The adopter doesn’t have to chase pointers; the modifier doesn’t have to wade through architectural narrative; the auditor doesn’t have to read user-facing prose to find the claim they want to verify.

This is the contribution that makes “manifest” feel different from “doc-string.” Other languages can attach prose to declarations; only a kernel-checked language lets the prose ride on top of a typed claim whose proof was verified at build time. The README of an unverified library makes claims; this layer makes promises.

The pattern is now Rule 12 in docs/spec-driven-manifests.md.

What’s next

Continue Phase 1: build the implementation in L3m/Runtime/Linenoise/, satisfying the spec one theorem at a time. Track which theorems came out trivial (good) and which revealed gaps in the spec (also good — those are the rule discoveries).

Phase 2: extract lean-readline as its own repository following the same pattern as lean-manifests and markdown-cm. Become a mid-niche library others can use.

Phase 3: re-incubate inside l3m for color-aware completion, vim mode, mouse support, etc.

The full set of rules and the exemplar manifest are in docs/spec-driven-manifests.md and L3m/Bindings/Linenoise.lean.

Built with: Lean 4, lean-manifests, an exhausted afternoon trying to make cat | xxd reveal the difference between “terminal-side bracketed-paste off” and “rlwrap-readline ate the markers.”