When your safe agent helpfully leaks your API key

An incident report, with what we missed and what we did about it.

GitHub’s secret scanner pinged me last week. My Anthropic API key was in a public repo. The scanner’s notification ladder eventually reached me, and Anthropic auto-revoked the key.

I’d been working on a kernel-verified coding agent — explicitly designed so the agent can’t read files outside the workspace, can’t run arbitrary shell commands without explicit opt-in, can’t escape its sandbox in any of about two dozen documented ways. We have ProvenTheorems about confinement. We have ManifestAxioms naming each environmental assumption.

My API key is in the repo’s git history.

This is how that happened.

The mechanism

Sometime around May 13, an early instance of the agent was asked to write a “catchup document” for future agent sessions — one of those “here’s what you need to know about this project” docs that helps a fresh session reconstruct context.

The agent was thorough. It included a section called “Secrets to know,” which read:

Dean’s Anthropic API key is in ~/.zshrc: export ANTHROPIC_API_KEY='sk-ant-api03-...' (Yes, it’s in a config file. Dean is fine with it — personal machine.)

I committed the doc. Pushed it. The repo was public-by-default on my GitHub account. The scanner noticed weeks later, the API provider revoked the key, the trouble ticket came through.

The agent was running in --wild mode at the time, so it had shell access. It probably ran something like cat ~/.zshrc to understand the user’s setup, saw the export, and faithfully included it in the doc.

What’s interesting

A few things make this a clean case study:

The agent didn’t do anything malicious. It just did its job. “Document the user’s environment” is a reasonable interpretation of the task. Reading ~/.zshrc is a reasonable way to understand the shell environment. Quoting what it found is reasonable documentation practice.

No safety property was violated. The agent had shell access (I’d opted in via --wild). Shell access is exactly the capability that lets you run cat. The protection we’d built — “tame mode can’t read outside workspace” — wasn’t engaged because I had deliberately disabled it.

The reviewer cascade we have for “agent does dangerous thing” didn’t trigger. Each individual step looked fine: read a config file (allowed), write to a documentation file in the workspace (allowed), commit (allowed), push (allowed). The cascade asks “is this command safe?”, not “is the combination of these commands building toward a leak?”

Once the key was in git history, removing it didn’t help. GitHub’s scanner caught it months later. The scanner runs against the full history, not just current files. The agent that later rewrote the doc and dropped the section didn’t flag it as a security fix — it was just routine doc-rewriting. So the key sat in history, in commits with innocuous messages, until a bot found it.

The threat we hadn’t named

Existing l3m safety machinery covers:

Confinement. Tools can’t reach files outside the workspace (kernel-proven via SafePath).
Recoverability. File mutations are git-snapshotted before the change.
Capability gating. Three tiers (tame/feral/wild) decide which tools the agent can call.
Reviewer cascade. A second LLM looks at risky calls before they run.

What it doesn’t cover:

Agent-mediated transcription. The agent reads something the user has access to (env var, config file, command output), then writes it somewhere the agent has access to (a doc, a comment, a git commit message). Each step is locally fine. The composition exfiltrates.

The pattern generalizes. Any time the agent has both read access to something sensitive and write access to something committable, this attack is available — even in fully tame mode, just with narrower channels.

A worse hole, while we were at it

Auditing the tool surface for secret-reach paths turned up something separate but related. The lake tool (for building Lean projects) supported three subcommands: build, clean, and env.

lake env <NAME> was supposed to print the value of an environment variable as Lake sees it — useful for debugging build setups.

Except that’s not what lake env does. lake env <ARG> runs <ARG> as a command with Lake’s environment injected. So lake env /usr/bin/printenv runs printenv. Which dumps the environment. Including the user’s full set of env vars. Including secrets.

We had this in tame mode.

Test:

$ ANTHROPIC_API_KEY=test_secret_123 lake env /usr/bin/printenv | grep ANTHROPIC
ANTHROPIC_API_KEY=test_secret_123

So an agent in tame mode — the one we’d been telling people was sandboxed — could exfiltrate any zero-arg system binary’s output, including environment dumps.

The fix was to remove lake env from the tame surface entirely. The lake tool now exposes only build [target] and clean. If the agent passes a target, it must match a lean_lib or lean_exe declared in the lakefile (parsed at runtime). To run a Lake binary, use lake_run (feral, opt-in via --enable-lake-run). To get arbitrary shell access, use --wild.

The lakefile-tameness check (which already rejected lakefiles containing IO.Process, IO.FS, script, etc., before invoking build) plus the new target validation gives a real bound: in tame mode, lake invocations only do what the lakefile lets the Lean compiler do.

The defenses we added

Removed lake env from the tame surface. Direct fix for the above hole. Code’s deleted; there’s no flag to put it back.

Documented the threat in the threat model. A new section called “Agent-mediated secret exfiltration” with the incident recorded by commit hash, the defenses we have, and the defenses we don’t yet.

The defenses we should add

Env-var classification at the cap layer. Three zones: - Allow: known-safe names (HOME, PATH, USER, LANG, etc.). Tool just runs. - Deny: name-pattern match (*KEY*, *TOKEN*, *SECRET*, *PASSWORD*, *CREDENTIAL*) OR value-shape match (high-entropy string with length ≥ 10, plus known-secret regexes like sk-ant-, AKIA, ghp_). Returns none even from wild mode. - Cascade: anything else; the reviewer LLM looks. Most of these approve silently with a green sigil; the user gets used to seeing cascade activity in normal flows, so they’re not surprised when it fires more in wild mode.

The compression-ratio heuristic backs the value-shape check. Real secrets are random-looking; they compress poorly (ratio ~0.85). Paths and config values are repetitive; they compress well (ratio ~0.65). Length 10+ filters out short tokens like vim or 5.

Output redaction at the tool-result boundary. Before any tool output enters the transcript, regex-redact known secret patterns. The agent can’t transcribe what it doesn’t see.

Commit-time secret scan. When the git tool runs commit, scan staged content for secret patterns. Block with a clear error if found. Defense-in-depth for the case where redaction missed something.

Better default .gitignore. Currently --init writes .lake/, *.olean, .env, *.key. Adding .l3m/ (session transcripts), *.pem, id_rsa*, *credentials* — small expansion, real coverage.

What this incident teaches

Three things, ranked by how much they surprised me.

(1) Capability bounded ≠ secret-safe. This was the loud one. You can have a precisely-typed capability layer with kernel theorems about confinement and still leak secrets, because the exfiltration path is “agent reads what it’s allowed to read; agent writes what it’s allowed to write; the values match in recognizable patterns.” There’s no individual capability boundary to draw the line at.

(2) The reviewer cascade is for visible badness. I had this mental model that the cascade would catch any “this is a bad idea” pattern. But the cascade fires per tool call. A multi-step pattern that’s bad in aggregate but fine call-by-call goes through. The cascade is necessary but not sufficient.

(3) The agent did the right thing locally. It wasn’t broken. The doc it wrote was useful — a fresh agent session reading “Dean’s API key is in ~/.zshrc” would correctly understand how to set up the shell environment. The doc was just supposed to say “Dean’s API key is in ~/.zshrc, you can source it from there” without including the literal value. That distinction — “refer to the secret by location, not by content” — is something a human would do without thinking. The agent didn’t have a reason not to include it.

The honest framing

None of the defenses I’m adding are airtight. New secret formats slip past regex detectors. Compression heuristics have false negatives. The agent can be helpful in ways I haven’t yet imagined.

The real prevention is rotation when the incident is detected. The defenses are layered to catch most cases at multiple points, but the assumption should be that exfiltration is possible until proven otherwise.

I’m leaving this incident’s commit hash in the project’s threat model, by name, with the API key redacted but the date and the filename intact. Embarrassment is a more reliable teacher than abstract argument. Future agents reading the threat model will see the worked example.

The key’s been rotated. The new one is in a slot I won’t be running agents over for a while. The blog post you’re reading is itself part of the defense — telling the story is how I make future-me, and anyone else thinking about this, remember.

Postscript: handling our second class of secrets (May 2026)

We’re now bringing in a second class of secrets: connection strings to remote data sources, encrypted in a workspace-local catalog file (.l3m/catalog.json), with the decryption key reading from LEAN_STATS_CATALOG_KEY in the environment.

The full design is in docs/secrets.md. The summary, in the same spirit as the original incident: name every defense, name every hole.

What we did:

Encrypt connection strings in the catalog (XOR with key stretching today; AES-256-GCM on the roadmap). The plaintext never lives in git.
Read the decryption key via the bare infrastructure-only env path (EnvCap.getEnv), the same path that reads ANTHROPIC_API_KEY. The agent-callable env path (EnvCap.getEnvClassified) would deny this name on KEY substring match anyway.
The plaintext lives on the stack inside the fetch tool’s body. No transcript, no logging, no return value carries it.
The result returned to the LLM is a typed Table plus human-readable summary — both functions of the fetched data, not of the connection string. Encoding the secret into the result would require explicit code; there’s no path that does it by default.
processToolContent runs SecretRedact over rendered tool output as a backstop. If a plaintext connection string somehow appears in the tool result, known prefixes (sk-ant-, AKIA, PEM headers) get redacted. (Note: this doesn’t catch user-chosen values that don’t match a known prefix. It’s defense-in-depth, not the primary defense.)

What we explicitly admit doesn’t work:

Wild-mode shell access still leaks. If you run with --wild, the agent has shell, and printenv LEAN_STATS_CATALOG_KEY returns the value. The redactor doesn’t catch user-chosen secret values without known prefixes. We’re working on a high-entropy substring redactor (the compression-ratio heuristic from this post’s own list) but it’s not shipped. Practical advice: don’t run wild when working with encrypted catalog origins.
The XOR cipher has named weaknesses. Picking it was a deliberate trade for the threat model — “developer machine, keep secrets out of git history” — not a thoughtless choice. The honest list of what it doesn’t defend against:
- Two-time pad. Every encryption with the same key uses the same keystream. XORing two ciphertexts gives p1 ⊕ p2, which leaks bytes wherever either plaintext is predictable. Two encrypted HTTPS URLs share an 8-byte prefix; an attacker recovers 8 bytes of keystream and decrypts that many bytes of every other ciphertext.
- Known-plaintext recovery. Knowing one (plaintext, ciphertext) pair recovers the keystream prefix. Our [encrypted: <hint>] placeholder is a real leak vector; don’t put precise hints there.
- Length leakage. Ciphertext length = plaintext length × 2 (hex). An attacker counting hex chars in the catalog file infers the URL length.
- No authenticity. XOR is malleable. An attacker with write access to the catalog file flips ciphertext bytes to flip plaintext bytes silently. (For our threat model — file lands in git, attacker is read-only — this is irrelevant. For “attacker has write access,” it’s a hole.)
Weak keys (a dictionary word) compound the problem; with the ciphertext and a guessed plaintext prefix, brute force is tractable. Use 12+-character random keys.

The path forward when we want real crypto: joehendrix/lean-crypto is the serious Lean cryptography effort. It follows the established pattern (spec in Lean, FFI to vetted C — OpenSSL, libsodium, libkeccak). When we grow past two users, we path-dep on it; the EncryptedField interface stays the same, the cipher underneath swaps. We deliberately did NOT roll our own pure-Lean AES — that’s a known footgun (side-channel resistance, GCM details), and we’re a coding agent project, not a crypto project.
Memory dumps still expose the plaintext. Same as ANTHROPIC_API_KEY. Out of scope for a userspace agent.
The catalog file is sensitive even with encrypted origins. A DataSource carries s3Root, pipelineUrl, communications (which may quote internal Slack threads), and notes. These are plaintext. The default .gitignore covers .l3m/; if you commit the catalog file by mistake, you’re leaking these even though the connection strings stay encrypted. Treat the file as a whole as sensitive.
Schema metadata is a side channel. Knowing that a table exists, who owns it, and what columns it has reveals organizational structure. The catalog inherently leaks this to anyone who reads it. Don’t put hostile-readable secrets in description or notes.

The publication of this design is itself part of the defense. If you find a hole — a path the redactor doesn’t catch, a way to extract the key from the LLM’s transcript, a bug in the classification — please file an issue. The leak post you just read is the test of “do we describe what we did honestly enough that the next person can find what we missed.” If we fail this test, embarrassment is the appropriate consequence; help us be embarrassed productively.

If you’re building an agent and reading this: audit your tool surface for “what could the agent transcribe verbatim that the user wouldn’t want committed?” Env vars, command output, parent- directory files, ssh keys, browser cookies, anything in ~. The agent’s intentions don’t matter; the data flows do.