When your safe agent helpfully leaks your API key
An incident report, with what we missed and what we did about it.
GitHub’s secret scanner pinged me last week. My Anthropic API key was in a public repo. The scanner’s notification ladder eventually reached me, and Anthropic auto-revoked the key.
I’d been working on a kernel-verified coding agent — explicitly designed so the agent can’t read files outside the workspace, can’t run arbitrary shell commands without explicit opt-in, can’t escape its sandbox in any of about two dozen documented ways. We have ProvenTheorems about confinement. We have ManifestAxioms naming each environmental assumption.
My API key is in the repo’s git history.
This is how that happened.
The mechanism
Sometime around May 13, an early instance of the agent was asked to write a “catchup document” for future agent sessions — one of those “here’s what you need to know about this project” docs that helps a fresh session reconstruct context.
The agent was thorough. It included a section called “Secrets to know,” which read:
- Dean’s Anthropic API key is in
~/.zshrc:export ANTHROPIC_API_KEY='sk-ant-api03-...'(Yes, it’s in a config file. Dean is fine with it — personal machine.)
I committed the doc. Pushed it. The repo was public-by-default on my GitHub account. The scanner noticed weeks later, the API provider revoked the key, the trouble ticket came through.
The agent was running in --wild mode at the time, so it had shell access. It probably ran something like cat ~/.zshrc to understand the user’s setup, saw the export, and faithfully included it in the doc.
What’s interesting
A few things make this a clean case study:
The agent didn’t do anything malicious. It just did its job. “Document the user’s environment” is a reasonable interpretation of the task. Reading ~/.zshrc is a reasonable way to understand the shell environment. Quoting what it found is reasonable documentation practice.
No safety property was violated. The agent had shell access (I’d opted in via --wild). Shell access is exactly the capability that lets you run cat. The protection we’d built — “tame mode can’t read outside workspace” — wasn’t engaged because I had deliberately disabled it.
The reviewer cascade we have for “agent does dangerous thing” didn’t trigger. Each individual step looked fine: read a config file (allowed), write to a documentation file in the workspace (allowed), commit (allowed), push (allowed). The cascade asks “is this command safe?”, not “is the combination of these commands building toward a leak?”
Once the key was in git history, removing it didn’t help. GitHub’s scanner caught it months later. The scanner runs against the full history, not just current files. The agent that later rewrote the doc and dropped the section didn’t flag it as a security fix — it was just routine doc-rewriting. So the key sat in history, in commits with innocuous messages, until a bot found it.
The threat we hadn’t named
Existing l3m safety machinery covers:
- Confinement. Tools can’t reach files outside the workspace (kernel-proven via SafePath).
- Recoverability. File mutations are git-snapshotted before the change.
- Capability gating. Three tiers (tame/feral/wild) decide which tools the agent can call.
- Reviewer cascade. A second LLM looks at risky calls before they run.
What it doesn’t cover:
Agent-mediated transcription. The agent reads something the user has access to (env var, config file, command output), then writes it somewhere the agent has access to (a doc, a comment, a git commit message). Each step is locally fine. The composition exfiltrates.
The pattern generalizes. Any time the agent has both read access to something sensitive and write access to something committable, this attack is available — even in fully tame mode, just with narrower channels.
A worse hole, while we were at it
Auditing the tool surface for secret-reach paths turned up something separate but related. The lake tool (for building Lean projects) supported three subcommands: build, clean, and env.
lake env <NAME> was supposed to print the value of an environment variable as Lake sees it — useful for debugging build setups.
Except that’s not what lake env does. lake env <ARG> runs <ARG> as a command with Lake’s environment injected. So lake env /usr/bin/printenv runs printenv. Which dumps the environment. Including the user’s full set of env vars. Including secrets.
We had this in tame mode.
Test:
$ ANTHROPIC_API_KEY=test_secret_123 lake env /usr/bin/printenv | grep ANTHROPIC
ANTHROPIC_API_KEY=test_secret_123
So an agent in tame mode — the one we’d been telling people was sandboxed — could exfiltrate any zero-arg system binary’s output, including environment dumps.
The fix was to remove lake env from the tame surface entirely. The lake tool now exposes only build [target] and clean. If the agent passes a target, it must match a lean_lib or lean_exe declared in the lakefile (parsed at runtime). To run a Lake binary, use lake_run (feral, opt-in via --enable-lake-run). To get arbitrary shell access, use --wild.
The lakefile-tameness check (which already rejected lakefiles containing IO.Process, IO.FS, script, etc., before invoking build) plus the new target validation gives a real bound: in tame mode, lake invocations only do what the lakefile lets the Lean compiler do.
The defenses we added
Removed lake env from the tame surface. Direct fix for the above hole. Code’s deleted; there’s no flag to put it back.
Documented the threat in the threat model. A new section called “Agent-mediated secret exfiltration” with the incident recorded by commit hash, the defenses we have, and the defenses we don’t yet.
The defenses we should add
Env-var classification at the cap layer. Three zones: - Allow: known-safe names (HOME, PATH, USER, LANG, etc.). Tool just runs. - Deny: name-pattern match (*KEY*, *TOKEN*, *SECRET*, *PASSWORD*, *CREDENTIAL*) OR value-shape match (high-entropy string with length ≥ 10, plus known-secret regexes like sk-ant-, AKIA, ghp_). Returns none even from wild mode. - Cascade: anything else; the reviewer LLM looks. Most of these approve silently with a green sigil; the user gets used to seeing cascade activity in normal flows, so they’re not surprised when it fires more in wild mode.
The compression-ratio heuristic backs the value-shape check. Real secrets are random-looking; they compress poorly (ratio ~0.85). Paths and config values are repetitive; they compress well (ratio ~0.65). Length 10+ filters out short tokens like vim or 5.
Output redaction at the tool-result boundary. Before any tool output enters the transcript, regex-redact known secret patterns. The agent can’t transcribe what it doesn’t see.
Commit-time secret scan. When the git tool runs commit, scan staged content for secret patterns. Block with a clear error if found. Defense-in-depth for the case where redaction missed something.
Better default .gitignore. Currently --init writes .lake/, *.olean, .env, *.key. Adding .l3m/ (session transcripts), *.pem, id_rsa*, *credentials* — small expansion, real coverage.
What this incident teaches
Three things, ranked by how much they surprised me.
(1) Capability bounded ≠ secret-safe. This was the loud one. You can have a precisely-typed capability layer with kernel theorems about confinement and still leak secrets, because the exfiltration path is “agent reads what it’s allowed to read; agent writes what it’s allowed to write; the values match in recognizable patterns.” There’s no individual capability boundary to draw the line at.
(2) The reviewer cascade is for visible badness. I had this mental model that the cascade would catch any “this is a bad idea” pattern. But the cascade fires per tool call. A multi-step pattern that’s bad in aggregate but fine call-by-call goes through. The cascade is necessary but not sufficient.
(3) The agent did the right thing locally. It wasn’t broken. The doc it wrote was useful — a fresh agent session reading “Dean’s API key is in ~/.zshrc” would correctly understand how to set up the shell environment. The doc was just supposed to say “Dean’s API key is in ~/.zshrc, you can source it from there” without including the literal value. That distinction — “refer to the secret by location, not by content” — is something a human would do without thinking. The agent didn’t have a reason not to include it.
The honest framing
None of the defenses I’m adding are airtight. New secret formats slip past regex detectors. Compression heuristics have false negatives. The agent can be helpful in ways I haven’t yet imagined.
The real prevention is rotation when the incident is detected. The defenses are layered to catch most cases at multiple points, but the assumption should be that exfiltration is possible until proven otherwise.
I’m leaving this incident’s commit hash in the project’s threat model, by name, with the API key redacted but the date and the filename intact. Embarrassment is a more reliable teacher than abstract argument. Future agents reading the threat model will see the worked example.
The key’s been rotated. The new one is in a slot I won’t be running agents over for a while. The blog post you’re reading is itself part of the defense — telling the story is how I make future-me, and anyone else thinking about this, remember.
If you’re building an agent and reading this: audit your tool surface for “what could the agent transcribe verbatim that the user wouldn’t want committed?” Env vars, command output, parent- directory files, ssh keys, browser cookies, anything in ~. The agent’s intentions don’t matter; the data flows do.