Thoughts on Working with LLMs

This describes my experience using best-in-class LLMs for several projects: writing fiction, vibe coding a Japanese tutor, writing an academic paper, and restructuring a Lean codebase. I have separate posts on writing fiction with LLMs and the Lean project. This post is about the general workflow lessons that apply regardless of domain.

I was using GPT 5.4, Gemini 3.0 Pro, and Claude Opus 4.6. But honestly, the most important tool was git. Without that, this would have all been too painful to do. I’ll share my git workflow in a separate document – it has details that are important but not interesting to talk about. All three models handle git well enough, as long as you keep reminding them to follow the instructions. They can cover for each other’s mistakes and understand who has what checked out.

Small Project: Vibe Code If You Can

For simple tasks, just fire up an LLM, vibe code what you want, and move on. The Japanese tutor would have been 50 programmer-years of work a decade ago. I built it in a few sessions without knowing a word of Japanese, because the hard part – actually understanding and generating Japanese – is done by the LLM at runtime. I just vibed on colors, interface decisions, and microphone permissions.

Here’s what vibe coding actually looks like, from my session log:

For some reason, it isn’t working from my phone. (Android)

The page loads, but the mike button isn’t working.

Actually, the mic flashes on and then goes dark.

nope. still only a flash

same behavior

installed chrome. can talk to it. got ‘speech ended.’ So we seem to be live!

Oh, that is sooooo cute!!!!

That’s the whole workflow. Describe what’s broken, the LLM fixes it, test again. The entire tutor – backend, frontend, speech recognition, multi-language support – was built this way. The hardest problem was Android microphone permissions.

Medium Project: Manage Workers

The academic paper was different. I was writing outside my core expertise, so I needed the LLMs to actually contribute, not just polish my text. I assigned each model to each step: literature search, theorem proof, background argument, paper content, write a grumpy referee’s report, respond to the referee.

Tasks that would normally take a PhD student a week took a minute or two. Altogether, about a day to get a paper into decent shape – maybe a month or two of work with a human coauthor.

A novella worked similarly. I gave GPT a one-page premise – an alien scout arrives at Coney Island to announce Earth’s star is scheduled for disassembly – and asked for six alternative outlines. Picked one, iterated on characters and voice, and GPT drafted fifteen chapters. Meanwhile, Claude built the hard-SF world files – destruction wave physics, scanning booth mechanics, alien body design – and ran science audits against the manuscript. Gemini caught a DNA weight calculation off by a factor of a thousand. Four days from concept to polished manuscript, and most of my time was spent choosing between options the models generated.

At this scale, you divide labor by strength – one model for energy and voice, another for consistency and science, a third for audit – and talk each one through what to do next. Git keeps everything safe, but the models aren’t colliding because each task fits in a single session. The management problem is light.

When the problem scales up, you have to get serious about what management actually means.

Large Projects: Factor Everything

A large project exceeds any model’s context window. You wouldn’t put your entire application in one file and ask an LLM to “find the bug.” You’d point it at the module with the failing test and the spec it should conform to.

The solution is always the same: factor the project so that each task only requires reading a few files. This is the single most important thing you can do for a large project. It converts one impossible task into hundreds of tractable ones. And it lets you parallelize: fire off an agent per module, and you’re done in the time it takes to check one.

The Repo Is the Communication Channel

When you have multiple models working in the same repository, the most important thing is not that they can work concurrently. It’s that they can leave durable signals for each other without needing to be in the room at the same time.

I think of this as stigmergy – the same principle that lets ants coordinate through pheromone trails without direct communication. In the repo, models leave task files, review memos, and claim links in a shared root directory. The next model to arrive runs ls, sees what’s waiting, and picks up the work. No chat replay. No context briefing. The filesystem is the message queue.

In practice: GPT writes a task memo with thirty flagged lines grouped by chapter, commits it to master. I tell Claude “freshen from master.” Claude sees the memo, executes it, commits the results. No copy-pasting between chat windows. The repo carried the message.

This matters because every LLM session starts from zero. The model doesn’t remember the previous conversation, the previous session’s decisions, or even the work it did an hour ago if the context got compacted. The workflow isn’t just convenient – it’s memory infrastructure for participants who have no memory.

At session start, a model re-reads the workflow document, checks the project root for signals, and runs git log to see what happened while it was “away.” That’s all it takes to get back up to speed. The linear history means the sequence of changes tells a coherent story. No ambiguity about what happened or in what order.

The human author is a first-class participant in this system, not outside it. Same branch protocol, same handoff rules, same provenance tracking. That keeps things legible. And the author’s coordination cost is nearly zero: telling a model to “freshen from master” is enough to route any waiting task files its way.

Prompts Are Source Code

Early on, I lost work when a file got corrupted and had to reconstruct from memory. After that, I made a rule: save every prompt.

I’m treating what I type as source code. It should be saved.

My prompts are the irreplaceable part. Everything an LLM writes can be regenerated. My decisions, corrections, and “no, not that” moments are the actual source of truth.

There’s a legal dimension too. A prompt log is a dated lab notebook. It records when ideas originated and who proposed them. If you intend to copyright or patent the result, this is the artifact that establishes your creative contribution.

I also had each LLM maintain its own log. When a new session started with no memory of previous work, the log was how the model caught up – onboarding documentation for the same developer with amnesia.

When you’re managing multiple LLMs, your prompts are the spec. If a model produces bad output, the first question is whether the prompt was clear. If you didn’t save the prompt, you can’t debug your own management.

Don’t Micro-Manage, Meta-Manage

You could read each spec file yourself and assign each check by hand. But that doesn’t scale. It’s better to have an LLM write instructions for itself or other LLMs:

Fire off agents to check consistency between each spec file and each implementation module.

This is also when you discover rate limits. All three providers throttle you, and I burned through my Anthropic balance fast enough to go negative before the billing caught up. Budget accordingly.

The real technique is to create exercises – structured audit tasks with clear inputs and expected outputs – as documents. Then have each model complete the same exercise independently. Where they disagree, you have a signal that needs human attention. Where all three agree, merge with confidence.

But be careful: a sweep is a force multiplier, and force multipliers don’t care whether you are right. If a model misunderstands a rule and applies the misunderstanding across fifty files, you’ve just automated the creation of fifty bugs. The model needs to understand the reason for the rule, not just the rule itself. “Don’t use Earth technology names” is a rule. “Use the words the characters would actually use” is the reason. The reason produces better fixes.

My role shifted from “person who does the work” to “person who designs the exercises and reviews the disagreements.” The LLMs did the reading, checking, and fixing. I decided what to check and adjudicated conflicts. That’s management.

And since the models are on the same machine, they’re more tightly coupled than human developers. They can peek at each other’s branches and work in progress in real time. Set up the git structure to take advantage of this and the coordination problems are actually easier than with a human team.

What Changes at Scale

The boundary between vibe coding and serious management will keep moving as models improve. But I expect it will always exist. Good engineering practices – modular design, clear interfaces, version control, test suites – will always help. The difference is that LLMs will actually follow them consistently. Write tests first? Just ask. Check code coverage? Done. Write it in five languages and confirm they agree? Merely expensive. Prove theorems about your code? A task too tedious for humans to consider, but LLMs will do it without complaining.

These practices also solve the context-window problem. Factor your project so each task fits in one window, and a model with a million-token limit can work on a project with ten million tokens of state. Good engineering isn’t just a human virtue. It’s the mechanism that lets bounded models work on unbounded problems.

Favorite Interaction

When I first joined Amazon, every Uber driver had a cute Amazon story. Now Amazon is just part of daily life. But we’re in the early LLM era, so everyone has their favorite LLM story. Here’s mine.

After creating a first version of a novel, nobody liked the opening. I gave GPT the task of outlining a novel set in this world rather than about this world. It came back with something much more interesting than what I’d planned, and figured out how to salvage my existing draft as a second book.

I asked Claude what he thought. “Oh, this is much better than our current outline.” I said we should switch. Claude said, “No! We don’t want to be set back a year. We can salvage the book as it is.” I told him we’d been working on it for one week and about fifty dollars of his time. He said, “OK, we are doing the new outline.”

I loved this. His instinct was totally human – if we were both humans, his objection would have been dead right. But we aren’t, and it wasn’t.

So we rewrote it as two books. Then wrote a third to close out the story. Then threw all of that away and wrote it as a single novel. If humans had been involved, the only way to survive that many rewrites is if nobody on the team owned a gun.

That willingness to discard work is the most fundamental shift. When a rewrite costs a week and fifty dollars instead of a year and a salary, you can afford to be wrong. You can prototype three architectures and pick the best one, instead of committing to the first one that compiles.