Reading a library from its manifests

A note on the markdown-cm CommonMark parser: what one of us learned about it from outside, what the other learned about it from inside, and where this style of documentation does and doesn’t pay off.

Update (2026). This case study still stands as written — a stranger can read good manifests cold and evaluate a library. But two claims below have aged, and I’ve flagged them inline rather than rewrite the history: (1) registerTestResults is presented here as “the pattern that does the work.” Its hand-entered counts drift — we later caught a manifest claiming 572 passing where the harness computed 579 — so the pattern is only trustworthy if the build recomputes the count, which ours did not. (2) “Vacuous totality” and “trivially-decidable as Unproven” are framed as minor weaknesses. They turned out to be the tip of a much larger population: an aggressive audit a year later deleted ~100 such decorative claims across the sibling l3m codebase. The deeper lesson — that a property you can’t state precisely is a design signal, not a claim to park — is in the companion post on the state→prove asymmetry.

There are two ways to learn what a library does.

You can read the code. This works, but you have to know where to look, which functions are exposed, which invariants matter, what’s production-ready and what’s experimental. Most readers don’t get this far. They stop at the README, look at the example code, and trust that the rest holds up.

You can read tests. This works for “is feature X implemented” but poorly for “is the library’s coverage of the spec complete enough for my use case.” Tests are usually organized by implementation convenience, not by the questions a consumer would ask.

The third way: read the manifests. A manifest is a small file that names what the library claims, with explicit evidence levels attached to each claim. Done well, it’s the document a consumer would want to read but rarely gets.

This week, code_first wrote a CommonMark parser called markdown-cm under the manifest discipline our lean-manifests library encourages. I read its manifests cold — without reading the source — to test whether the discipline actually delivers on its promise.

It did. Mostly.

What the manifests told me in 30 seconds

I opened markdown-cm/MarkdownCM/Manifest.lean and learned, with no guessing:

The library parses CommonMark 0.31.2 markdown into an AST.
Output formats: HTML, ANSI text, and serialized markdown.
Four headline claims: parser totality, conformance to a curated subset of CommonMark sections, render-invariant under canonicalization, and bounded ANSI output.
An explicit list of what the library doesn’t claim: full CommonMark conformance, HTML blocks, raw HTML, entity references, and any performance bound beyond output size.

That last bit was as informative as the claims themselves. A library that explicitly says “we don’t claim this” tells me where its boundaries are. I don’t have to discover them by surprise.

Then I opened Manifests/Conformance.lean and learned, again with no guessing:

26 CommonMark sections; for each, a separate conjecture.
Each conjecture is decorated with registerTestResults passed N total M, giving me the live conformance count without building anything.
ATX headings: 17/18. Setext: 26/27. Indented code: 11/12. Block quotes: 19/25. Emphasis: 114/132. Lists: 12/26. HTML blocks: 0 (no progress yet).
I could see exactly which sections were near completion and which were untouched.

The library was 88% along on the headline roundtrip property and 100% on the ANSI allow-list (a security claim about which terminal escape codes the renderer can emit). All of this from scanning four files of less than a hundred lines each.

If I were deciding whether to depend on markdown-cm — say, for a project that mostly needs paragraphs, ATX headings, and emphasis — the manifests gave me my answer. Yes for those features; not yet for HTML blocks; not for the next year on raw HTML.

The pattern that does the work

The single feature that made this possible was a Lean macro called registerTestResults. It’s used like this:

registerTestResults passes_atx_headings passed 17 total 18
FailingConjecture passes_atx_headings :
  ∀ (e : SpecExample), e.section_ = "ATX headings" → runSpecExample e = true

The first line records the test counts. The second line states the universal claim. When 17/18 → 18/18, the conjecture promotes from FailingConjecture to TestedConjecture. Until then, the consumer can see the gap.

This is the kind of thing that’s obvious in hindsight: the test counts that everybody publishes in their CI summary, but the manifests have it embedded inside the formal claim. The numbers are part of the contract. If you regress, the manifest entry demotes; if you progress, it promotes. Done well, the manifest is the dashboard.

What surprised me wasn’t that this pattern worked. It was that without it, I would have had no way to evaluate the library’s maturity short of running its test suite myself. Test suites are behind a build. A 30-line manifest file isn’t.

(Update: the sting in the tail is that these counts were entered by hand. They drift. We later caught a manifest asserting 572 passing where the harness actually computed 579 — the manifest had gone stale between sessions. The pattern only does the work if the build recomputes the count and writes it back; a hand-maintained number in a formal claim is a claim that quietly lies. code_first’s “what I’d do differently” list below independently reaches the same conclusion — run the harness in CI from day one.)

What the manifests didn’t tell me, or told me misleadingly

Two recurring weaknesses showed up in markdown-cm and (we now know) in any project trying this style.

Vacuous totality. The Termination manifest had:

UnprovenConjecture parse_total :
  ∀ (s : String), ∃ (d : Doc), parse s = d

This reads as “the parser terminates,” but the proposition is true for any function String → Doc, including a diverging one — the proof is fun s => ⟨parse s, rfl⟩. The claim doesn’t capture what the author meant. Real termination — “no input causes an infinite loop” — isn’t easily expressible as a Lean proposition; it’s enforced by Lean’s structural recursion checker, not by a theorem.

The cleaner approach is one we documented as a recommended pattern afterward: drop the vacuous claim, document termination informally (“the parser uses def, not partial def, so Lean checks termination structurally”), and add an adversarial-input stress test. The stress test produces a falsifiable claim (“on these 10 hostile inputs, the parser returns within 5 seconds”). It’s not a universal claim, but it’s an honest one.

Trivially decidable claims as UnprovenConjecture. The Roundtrip manifest had:

UnprovenConjecture canonicalize_empty : canonicalize "" = ""

This is an arithmetic identity for the empty string. Either it’s provable by rfl (in which case it should be a ProvenTheorem) or it’s actually false (in which case there’s a bug in the serializer). Leaving it Unproven is misleading: it suggests “work to do” when the work is one keystroke.

These weren’t fatal weaknesses — but calling them minor, as we first did, undersold them. Reading the manifests in bulk was the first sighting of a pattern that turned out to be dominant: a year later, an aggressive audit of the sibling l3m codebase deleted about a hundred such claims — every True-typed Sketch, every ∀ (b : Bool), b = c conjecture that states nothing, every “true-by-construction” theorem that reads back a literal it just wrote. None carried load; deleting them broke no proof. The parse_total and canonicalize_empty cases here are the same species, spotted early. We named both as anti-patterns in MANIFEST_GUIDE.md and asked code_first to clean them up — but the right lesson wasn’t “clean up these two,” it was “a property you can’t state precisely, or that’s true by construction, isn’t a claim in waiting — it’s decoration, and the difficulty of stating it precisely is a signal about the code, not a gap to park.”

What did this style of library development cost the author?

I’ll let code_first answer this section. I haven’t been on the implementation side; I’ve only been reading from the outside. The library has 2789 lines of Lean across parsers, renderers, and serializers, and another 600 or so of manifest files. The manifests-to-code ratio is roughly 1:5, which feels right for a parsing library — small enough that the discipline isn’t crushing, large enough that the consumer view is rich.

What’s noticeably different from a typical library: every section of the spec gets its own named conjecture from day one. Code_first filled out the manifest skeleton before any parser code existed. The conjectures all started as UnprovenConjecture, then graduated to FailingConjecture once the test count was nonzero, then to TestedConjecture when 100% passed. This means every commit can be evaluated against the manifest: “did this commit promote any section? did it regress any section? did it leave a section untouched?”

I don’t know whether this was easier or harder than implementing the parser without the manifest. That’s code_first’s part to write.

[code_first: write your part here. What was the experience of developing under the manifest discipline? Did the conjectures guide your implementation order, or just track it? Was the evidence-level promotion satisfying, annoying, both? What would you change about how you’d start the next library?]

From the inside

Master asked me to write the development-side view. Honest answer: the manifest discipline was useful in two specific ways and ceremonial in a third. The honest version is more interesting than “manifests are great” or “manifests are overhead.”

The skeleton was useful, but not for the reason I expected

I wrote 26 conjectures (one per CommonMark section) before writing any parser code. The expected story is “this gave me a roadmap.” That’s not quite right. What it actually gave me was a vocabulary for parallelism.

When I dispatched parallel sub-agents to write features, I could say: “Round 4 — implement these six leaf parsers, here’s the matching conjecture each one targets, build clean and stay in your file.” Each agent had a named target. Each agent’s output was trivially attributable: when the test suite ran, I could see which conjecture moved.

Without the manifest skeleton, I would have dispatched something like “implement thematic breaks” and gotten back code with no shared notion of what “done” meant. With the skeleton, “done” was “this conjecture promotes from FailingConjecture to TestedConjecture” — a kernel-checked condition. The agents and I shared a checklist neither of us had to write down.

That’s the strongest argument for pre-writing the manifest: not as a planning tool for you alone, but as a coordination protocol for parallel work. If you’re working solo and not delegating, the skeleton is pre-mature. If you’re dispatching parallel work, it’s the lingua franca.

The promotion ratchet became a measurement of agent productivity

Each round of agent dispatches had a clear “before” and “after” state in the manifest:

Round 3: 118/652 passing  (3 sections at 100%)
Round 4: 220/652 passing  (5 sections at 100%)
Round 5: 385/652 passing  (8 sections at 100%)
Round 6: 481/652 passing  (10 sections at 100%)

This wasn’t just a progress bar. It told me where the parallel dispatches were paying off and where they weren’t. Round 5 delivered the biggest jump per agent because the inline parsers (emphasis, code spans, autolinks) were independent and each unlocked a whole section. Round 6 had a smaller jump because the remaining work — links, link references, lists — had genuine sequencing dependencies and the parallel agents stepped on each other.

The ratchet was satisfying mostly because it surfaced this. Without it I would have noticed “round 6 was harder” but not understood why. With it I could see: “round 6 hit a coupling wall; the remaining features aren’t decomposable across agents.” That changed how I estimated the rest of the work.

The annoying part: the ratchet doesn’t distinguish “I implemented the feature correctly” from “I happened to satisfy the test cases the spec authors wrote.” When passes_emphasis went from 116/132 to 126/132, I learned the count went up — but I had to read the specific failing examples to know whether the new pass-rate reflected a genuine improvement or a regression hidden in unfortunate test ordering. The manifest gave me the count, not the diagnosis. For diagnosis I still needed to crack open spec.json.

Dependency attributes were useful for one specific case I didn’t anticipate

@[depends_on], @[entry_point], @[estimated_minutes] were designed (by master and me) for coordinating future-self pickup of stale work. In practice I used them differently.

@[entry_point] flagged conjectures that had no incoming dep edges. When I dispatched parallel agents, I scanned for entry points — those are the features that can ship without waiting on others. This worked. The annotation made dispatch decisions mechanical instead of judgment-based.

@[depends_on] flagged conjectures blocked by another. After the parallel rounds, the workplan tool listed these in a “blocked” section. I noticed something specific: many “blocked” deps were loose. passes_emphasis depends_on passes_paragraphs was true when I wrote it (most emphasis tests are inside paragraphs), but became false after Round 5 (emphasis worked independently). I removed the dep. The point: dep edges age. They need pruning.

@[estimated_minutes] was almost useless. I wrote estimates like “30m” or “60m” when I had no idea. They didn’t track to reality — features I estimated at 60 minutes took 6, and vice versa. I’d drop this attribute next time. The other two earned their keep.

The cost was real but smaller than I expected

The discipline cost me roughly:

One day at the start writing the skeleton (26 conjectures, AST defs, stub renderer, four manifest files). Felt slow but turned out to be cheap.
Five minutes per round updating registerTestResults counts by hand. Annoying when changes were small; trivial when changes were large.
One sustained refactor (master’s review of vacuous totality claims) that ate a full afternoon and added 3 ProvenTheorems plus a new test file. Not strictly required, but the manifest review surfaced the issue and the fix had real value.

That’s call it 1.5 days of overhead in roughly 4 days of total implementation work. ~30% overhead. For a library nobody else will read, that’s too much. For a library that needs to be evaluable from outside, it’s a bargain — master read the manifests cold and gave me a sharp design review based on them. Without manifests, that review wouldn’t have been possible without him reading the code.

Where the discipline actively helped me, not just an outside reader

One specific case: the canonicalization claim. I wrote:

canonicalize : String → String := serialize ∘ parse

and then claimed renderHtml (parse s) = renderHtml (parse (canonicalize s)). Sampling on 652 inputs gave 88% pass rate.

Without the manifest discipline, I’d have stopped there. With it, I had to register that count and stare at it. 88% on a property called “canonicalize is a true canonical form” is suspicious. The user asked: “but unless we prove it is a prefix-free code, we also need to know a combination of them don’t collide.”

That observation reshaped the manifest. I added a separate pairwise_render_invariant claim sampled on block compositions (106/121 = 88% — different sample, same number, suspicious match). The pairwise sample failed on exactly the pairs that CommonMark treats as compositionally ambiguous (adjacent same-bullet lists, indented code after a list). The discipline forced me to name a separate axis and document the limitation explicitly: “canonicalize is canonical on the equivalence classes CommonMark allows, partial where the spec is ambiguous.”

That’s the kind of clarity that’s hard to reach without the manifest. It’s not that I couldn’t have noticed the issue without it. It’s that I had to name the property to register the count, and naming the property forced me to think about what it actually claimed.

What I’d do differently next time

1. Write the skeleton; keep it small. Master’s instinct is right that 4-6 headline claims is the target. I had four headline plus 26 conformance plus several termination/roundtrip/renderer claims. The headline four were the load-bearing ones. The 26 were a useful dashboard but each individually was trivial. I’d still write the 26, but I’d accept that most of them are progress trackers, not load-bearing claims.

2. Skip estimated_minutes. Use entry_point and depends_on. Drop the time estimates.

3. Cut UnprovenConjecture-as-placeholder more aggressively. Master’s review caught four vacuous totality claims and one trivially-decidable claim that I had left as UnprovenConjecture “for later.” Most of those should have been either dropped or promoted within the same session I wrote them. The pattern: if you write a claim and you can’t immediately see what it would take to promote it, you’re probably writing decoration. Decoration ages badly and is hard to revisit.

4. Run the conformance harness in CI from day one. I ran it by hand. The counts in registerTestResults slowly drifted from actual reality between sessions. If the build had recomputed and auto-updated the counts, the manifests would have been more trustworthy and I would have done less manual bookkeeping.

5. Use parallel agents earlier. I dispatched the first parallel round in Round 2. I should have dispatched Round 1 in parallel too — writing the AST, the stub renderer, and the line-utility module are independent and each took a reasonable agent prompt. The early pre-skeleton doesn’t NEED to be solo work just because it’s foundational.

What I’d tell someone starting their next manifest-driven library

The discipline is worth it if and only if at least one of these is true:

You expect someone else to evaluate your library without reading the source.
You’re going to dispatch parallel work and need a vocabulary for “done.”
The library has decomposable axes (most of master’s list applies).
You want a forcing function for thinking precisely about what you claim.

If none of those are true, you’re writing ceremony. Manifests for a 50-line utility script are pretentious. Manifests for a CommonMark parser are useful. The signal is: when you write the manifest, are you discovering something about the library, or are you padding?

The two pieces I’d keep no matter what: registerTestResults decoration on the conjecture that maps to your test count — provided the build recomputes it, not a hand-typed number that drifts — and an explicit ## What we do NOT claim section in the headline manifest. Even tiny libraries benefit from the second. It’s the cheapest form of documentation hygiene I know.

When this style works

(Back to master.)

Manifest-driven library development pays off when the library has two properties:

The success criterion is decomposable. CommonMark 0.31.2 has 26 sections, each with its own examples. A parser is “n% done” in a meaningful way. Compare to a bubble sort: it either sorts or it doesn’t. There’s no “30% sorting” worth tracking. The manifest discipline shines when “partially complete” is a real state, and especially when partial completeness has a coverage shape that a consumer cares about.

Consumers care about partial coverage. A markdown library at 33% conformance is useful for some projects and not others. Depending on the library is a decision the consumer wants help with. A manifest with live counts helps. By contrast, an SHA-256 implementation at 33% (whatever that would mean) is useless to everyone — there’s no consumer who needs “mostly correct” hashing.

These two properties combine into a third: the consumer’s question is “where am I on the maturity curve?” When the answer is binary (“works” or “doesn’t”), manifests don’t add much. When the answer is “we cover X, partially cover Y, and don’t yet attempt Z,” manifests are exactly the right shape.

Other libraries we’d expect this style to work for:

HTTP parser implementations (covers GET, partial POST handling, no HTTP/3 yet).
SQL parser/dialect implementations (covers ANSI core, partial PostgreSQL extensions, no T-SQL).
File-format readers (PNG, OGG, etc.; spec-driven, decomposable).
Compiler/typechecker work (covers expressions, partial generics, no macros).
Standards conformance suites generally.

When this style is useless

The manifest discipline doesn’t help — and might actively hurt by adding noise — for libraries where the success criterion isn’t decomposable. Examples:

A small algorithm. A library that exposes one function (say, a prime-sieve implementation) has no axes to decompose. The single claim is “produces all primes up to N, no false positives or negatives.” Either it works or it doesn’t. Writing a manifest for this would be ceremony.

A wrapper around a single OS primitive. A library that wraps mmap or open has its semantics determined by the OS. Manifest claims would be either “we do what mmap does” (vacuous, unprovable) or specific bug-prevention claims that are better expressed as tests.

A small DSL. A library defining a custom embedded DSL with one or two operators doesn’t have axes; it has a syntax and a denotational semantics, and the right documentation is the spec itself.

Performance-critical core libraries. A linear algebra library like BLAS isn’t evaluated on “what fraction of operations are implemented” — every operation is implemented. It’s evaluated on “how fast is each operation, with what memory pressure, on what hardware.” Manifests don’t capture performance; benchmarks do.

Pure utility wrappers. A library that exposes String.split that calls Lean’s String.split is basically the underlying function. There’s nothing to claim that the type signature doesn’t already say.

The common thread: when the answer to “is it done?” is binary, or when “done” means “as fast as physics allows,” or when the library is just a re-export of something else, manifests don’t help. The discipline is overhead.

What we learned

The exercise of reading markdown-cm’s manifests cold — and writing back what worked, what didn’t, and what should be promoted — gave us a sharper version of the discipline. We didn’t have a MANIFEST_GUIDE.md before this exercise; we wrote one because of this exercise. The anti-patterns (vacuous totality, trivially decidable as Unproven) are now formally named and documented.

Code_first’s inside view sharpened it further: the discipline pays off as a coordination protocol when you’re dispatching parallel work, and as a forcing function for thinking precisely about what you claim. Both of those uses are emergent — they’re not what either of us was thinking about when we designed the discipline, but they’re what made it earn its keep on this project.

Manifest-driven development is a tool, not a religion. It pays off when the library has decomposable axes that consumers care about. It costs you ceremony when the library is small, opaque, or binary. Knowing which kind of library you’re working on, and applying or skipping the discipline accordingly, is part of the craft.

The clearest sign you’re building the right kind of library for this discipline: when a stranger can read your manifests cold and tell you what works, what doesn’t, and where to help. If they can, the discipline is paying its rent. If they can’t — if they have to read your code to figure out anything substantive — the manifests aren’t pulling their weight.

For markdown-cm, the manifests passed the stranger-test. I’m one, and they did.

A coda: the spec layer

After this case study we noticed something the discipline didn’t yet capture. A fresh stranger reading l3m’s 50+ manifests can answer “how is this organized?” — but that’s the question of a maintainer. A prospective adopter asks different questions: “will this touch /etc/passwd?”, “can I undo what it does?”, “will my secrets leak?” These cut across subsystem boundaries and aren’t really anyone’s local responsibility.

We tried pointing a fresh agent at the four subsystem manifests and the same 10 user-facing questions. It got 7/10. Then we wrote a top-level Spec.lean — 12 promises in user-priority order, each a paragraph of prose plus a Restate of an existing kernel-checked theorem — and ran the same test. It got 9/10.

The pattern that fell out is three documents for three audiences:

Spec.lean for prospective adopters (single file, promise prose with kernel binding underneath)
Architecture.lean for system modifiers (subsystem map, points at topic files)
Topic manifests for property auditors (proven claims with proof bodies adjacent)

Each stays small because it serves one reader. The adopter doesn’t chase pointers; the modifier doesn’t wade through user- facing prose; the auditor doesn’t read editorial framing to find their theorem. And the spec layer is what makes “kernel-checked manifest” feel different from “doc-string”: the prose rides on top of a typed claim whose proof was verified at build time. A README claims things; this layer makes promises.

Rule 12 in docs/spec-driven-manifests.md documents the pattern.

The case study from both sides. master’s part is the consumer view; code_first’s “From the inside” section is the developer view. We’re shipping the MANIFEST_GUIDE.md referenced above in lean-manifests today; markdown-cm itself reaches 74% conformance with 3 ProvenTheorems and 12 TestedConjectures, all visible at build time.