The Lean Project

Notes on formalizing the C++ standard library in Lean 4, and the manifest machinery — Lean Manifests — that came out of it.

See also: Thoughts on Working with LLMs for general workflow lessons, and the full technical write-up at lean4.ai.

What the project is

Formalize the C++ standard library (N4950, about 1,000 pages of library content) in Lean 4. Replace English-prose specification with machine-checked theorems. The proofs and code are second-class — what matters is the manifest: what’s true, and how confident we are.

Working code at Lean Manifests. 340+ modules, ~950 theorem-like declarations, all building.

The problem that shaped everything

Lean has two ways to stand up a theorem you can’t fully prove:

axiom foo : T — declared, tracked, shows up in the axiom-usage report of anything that depends on it.
theorem foo : T := by sorry — emits a warning, contaminates the dependency graph, gives readers no structure.

Every sorry looks the same. A deliberate assumption carrying an author’s full intent is indistinguishable from a stray scratch-hole someone forgot to clean up.

For a formalization project that spans 1,000 pages and will take years, this is untenable. Readers (human and AI) need to scan a module and know immediately: is this claim proven, tested, decomposed, derived from stated assumptions, or just hoped for? Requiring them to open each proof file defeats the point of having a library interface at all.

The three-way split

Lean Manifests impose a discipline on how a module’s files are organized:

Category	Contents	Visibility
Manifests	Claims: theorem signatures with explicit evidence levels	Public
Vocabulary (`Defs/`)	Types and predicates that appear in the claims	Public
Everything else (`Code/`, `Proofs/`, `Tests/`)	How each claim is established	Hidden

A consumer reads the manifest plus the vocabulary. That tells them what the module claims and enough to understand the claims. They never need to open a proof file.

The evidence hierarchy

Five new kinds of sorry, each less apologetic than the last:

sorry                    "I have nothing."
UnprovenConjecture foo   "I claim this. I haven't proven it.
                          I intend to eventually."
TestedConjecture foo     "I claim this. I've verified one concrete
                          case (foo_test)."
DecomposedConjecture foo "I claim this. I've proven the structure
                          (foo_derivation), and every lemma it
                          bottoms out in is at least tested."
DerivedConjecture foo    "I claim this. I've proven it, modulo
                          assumptions declared elsewhere."
ProvenTheorem foo        "I claim this. Fully apologized:
                          no sorry anywhere."

Each is a macro. Each checks different obligations at elaboration time. The macro name appears in the source, so scanning a module gives the evidence level at a glance. No compiler patches — about 300 lines of metaprogramming on top of stock Lean 4.

Teeth (the enforcement)

The hierarchy was useful as a convention for years. It became a contract when we added the enforcement: every macro stamps its declaration with @[manifest_entry], and DerivedConjecture / DecomposedConjecture refuse to compile if any sorry-bearing dependency lacks the attribute.

Concretely: if a derivation depends on a stray theorem X := by sorry in a proof file (instead of a declared UnprovenConjecture X : T in a manifest), the build fails with a pointed error message naming the stray and suggesting the fix.

Before the enforcement, the trust report emitted by DerivedConjecture listed sorry deps by name — but those names could point at undeclared holes, and a reader had no way to tell the difference between a carefully-declared assumption and a forgotten scratch-hole. After the enforcement, the trust report is guaranteed complete: every entry is a named, attributed manifest entry, not a stray sorry.

A reader skimming a derivation can see everything it assumes by reading the manifest files. There is no off-the-record sorry, because any off-the-record sorry would have broken the build.

Why this matters for AI

LLM agents that work on Lean code today face a comprehension crisis: the library is too big to fit in context, proof bodies are unnecessary for using a theorem, but standard Lean gives no machine-extractable distinction between “proven” and “sorry’d” at the interface. Every consumer has to open every proof file to know what’s real.

Manifests + evidence hierarchy + enforcement solves this: an AI scans the manifest, sees the evidence level of each claim, and proceeds accordingly. A claim marked DerivedConjecture points at its stated dependencies; if the AI needs higher confidence than the derivation offers, it knows exactly which assumptions to strengthen.

The 42% token savings reported in the full write-up is the quantitative measure. The qualitative win is that the evidence is machine-readable and enforced, not a stylistic convention in a coding guide.

A note on the name

The project was originally called “Levelized Lean” — a nod to John Lakos’s physical-design framework for C++ (level-based decoupling to break compile-time cascades). That name fit when the main contribution was cascade prevention. Lean 4.30’s module keyword has since moved most of the cascade-prevention work into the compiler itself. What’s left — and what’s load-bearing now — is the evidence hierarchy, the vocabulary separation, and the teeth.

So the project is now Lean Manifests. The public write-up lives at lean4.ai/lean-manifests.html.

Manifests as tripwires, then as specs

Here’s an honest evolution. I started by treating manifests as a build-verification system, not a design system. The kernel-checked claims did real work. The non-kernel-checked claims were mostly documentation that drifted if I wasn’t careful.

The evidence hierarchy catches you after you’ve already written the wrong thing. Concrete example: I migrated 24 tool functions in l3m to a new capability-based API. I didn’t consult the manifests first. I just wrote the code, ran lake build, and the kernel caught every bypass — functions that still used the old unconstrained paths, derivations whose sorry-deps pointed at stray theorems that no longer existed. The build did the work. The manifests were tripwires, not blueprints.

This was fine for verification. It wasn’t fine for design. The manifests said “confineIO rejects symlink escapes” and “ProcessCap.runTame only spawns TameCommands” — but when you were sitting in Tools/Lake.lean adding a --json flag, you didn’t naturally think “let me go read the manifest for this function first.” You’d grep for the function, read the code, make the change, and find out at build time whether you violated anything.

The discoverability problem was structural: theorems were organized by concept (all path-confinement theorems together, all MemFS theorems together), not by function. If you were modifying FsCap.writeFile, you’d need to know that confine_within_root and confine_rejects_dotdot and confineIO_rejects_symlink_escape all applied to it. Nothing in the code pointed you there.

Two fixes landed. The first was a @[theorems] attribute that tags each function with its relevant theorems, plus a generated index that maps functions to their theorem obligations. Better infrastructure for consultation; whether developers (or LLM agents) actually consult before modifying is a workflow question, not a tooling question.

The second fix was the spec layer. Manifests organized by user concern rather than implementation layer. A top-level Manifests/Spec.lean that names the promises a user worries about (workspace boundary, secret leakage, undo-ability, the agent stops), in the order users worry about them, with each promise binding to an existing kernel-checked theorem via Restate. The reader sees the promise prose plus the evidence-level tag — ● for kernel- checked, ◑ for documented intent — and never has to look at the proof body. They trust the kernel approved it, the way they trust the typechecker approved their code.

The spec layer is what made the manifest discipline feel different from doc-strings. Other languages can attach prose to functions; only a kernel-checked language can attach prose to a typed claim whose proof was verified at build time. The README of an unverified library makes claims; this layer makes promises.

We measured. A fresh agent given the four subsystem manifests answered 7/10 user-facing questions about l3m. The same agent given Spec.lean plus three subsystems answered 9/10. The misses in the first round were exactly the cross-cutting questions a prospective adopter asks, which the implementation-organized manifests didn’t surface.

Three documents now form the canonical reading order: Spec.lean for adopters, Architecture.lean for modifiers, topic manifests for auditors. Each stays small because it serves one reader.

What’s next

The Lean Manifests work covers the C++ standard library. lean4.ai has the public artifacts: a headerized version of CSLib (239 files, 220+ theorems in the manifests), interval arithmetic, happens-before memory models. Lean Manifests is the underlying machinery in each.

Related: l3m — a verified coding agent built on the manifest macros. Each tool carries a manifest; each manifest bottoms out in named axioms about the environment; the trust report is a build artifact. Work in progress.