Writing Fiction with LLMs

This is a companion to Thoughts on Working with LLMs. That post covers the general workflow lessons. This one is about what I learned writing twelve short stories, two novellas, and a novel-scale rewrite with GPT, Gemini, and Claude.

Each Model Has a Personality. Use That.

When you’re working with multiple models on the same project, you quickly discover they aren’t interchangeable.

GPT has energy. It rewrote an opening chapter with punch and structural tightness. But it broke the world rules – it had characters writing letters in a world where writing doesn’t exist. Energy without discipline.

Claude maintains consistency. I trusted him with the world files and long-running continuity. When GPT’s rewrite arrived, Claude identified exactly which lines violated the rules while acknowledging the structural improvement. (I call Claude “him,” Gemini “her,” and GPT “it” – it helps keep my pronouns flexible for when people start truly personifying them.)

Gemini does audits. When Claude and GPT disagreed about how aggressively to restructure the opening, Gemini arbitrated with a detailed memo taking the best of both positions. Separately, she ran a hard-SF science audit across the world files, caught a thermodynamics error in the terraforming backstory, and verified the orbital mechanics.

I landed on a principle:

World-building is a democracy. Outlining is a competition. Drafting is a dictatorship. Editing is a parliament.

The key insight is that different models fail differently. Run the same audit with all three and they find different bugs. The union of their findings beats any individual run.

Short Stories: Just Have Fun

For short stories, you can use the small-project approach. Fire off questions, have the LLMs edit, repeat. You can read the whole story in one sitting and so can they. Continuity, structure, consistency of tone – none of these are problems at this scale.

I wrote twelve stories this way. The process was pure enjoyment: you create a story that you personally enjoy reading.

A Novella: The Middle Ground

For a novella set in the present day, the heavy machinery described below wasn’t needed. I wrote one – an alien scout arrives at Coney Island ahead of a stellar disassembly wave – using GPT for all the chapter drafting and Claude for science and continuity.

Present-day Earth gives you all the world-building for free. No invented culture to police, no naming conventions to enforce, no multi-chapter arguments about whether ceramic exists. A single alien narrator meant one voice to maintain, not three models competing over tone.

It still wasn’t a vibe code. GPT generated six alternative novel outlines and I picked the best one. We iterated on characters and voice rules before any chapters were drafted. Claude built the hard-SF world files and ran science audits. But each task fit in a single session, and nobody’s work collided.

The novel is where the full apparatus set in.

The Cold Reader

The most useful technique I discovered was the cold reader. Spin up a fresh instance with no project context and ask it to read your work cold.

I did this constantly. “Fire off a new Claude and see if it can reverse-engineer the world. If so, we’ve been too preachy.” “Do a cold read of the first few chapters for hook.”

Cold readers told me the novel felt “claustrophobic, like a play with a stage” because we had no ranging shots. One rated the world 4/10 for naturalness and called the animals “emotional support drones with fur.” A science reader caught that we had tidal tables varying by day on a planet with no moon.

The power of the cold reader is that it doesn’t know what you intended – only what’s on the page. Your regular collaborator has been part of every design decision and reads your intentions into the text. The cold reader won’t.

But a cold read is not “read it again.” I learned to structure them: parallel agents by chapter range, each looking for a specific failure category – world-rule violations, continuity errors, prose tics, register breaks, name collisions. The results need triage. A cold read produces dozens of flags, many of which are false positives (something already handled by an earlier fix) or minor. The real catches are the ones every prior pass missed. Without triage, you just get a long list of complaints.

A cold read of 100k words costs a few dollars and tells you things your collaborator never will.

Simulated Editors Need a Steering Wheel

Late in the novel revision I tried a different kind of cold reader: a simulated version of a tough editorial reader. Not “find continuity errors.” Read the editorial letter first, read the manuscript, and tell me what that editor would still hate.

The most useful editorial pressure came from Manny Frishberg, a speculative-fiction writer and editor whose critique pushed hardest on voice, human unevenness, and the places where the prose sounded too machine-smooth. After that, I used his comments as a calibration target for cold-reader passes: would this still bother a tough human editor?

Three different models converged on the same problem: the prose was competent, but too many paragraphs ended with the same clever turn. Jokes landed in final position. Characters with different backgrounds started sharing a house voice. The narration kept stepping back to explain what the scene had already done.

That was useful, but only after triage. If every flagged joke gets cut, the book becomes clean and dead. The better rule was: keep humor that only works in this world, in this scene, from this character’s angle. Move comic phrasing out of the last sentence when it starts sounding like a rimshot. A detachable joke is usually wrong; a joke that depends on Arden’s biology, theology, furniture, food, or social machinery has earned its rent.

The same pass helped with point of view. We deliberately over-separated the characters first – Carol counting, Sarah structuring, Amy scanning exits, Mark performing, Tom handling materials, Lisa preserving context. Then we toned down the places where the contrast sounded mechanical. Voice did not come from giving everyone catchphrases. It came from letting each character notice different evidence under pressure.

Factor a Novel Like a Codebase

A novel with 500k words of backstory exceeds any model’s context window. The solution is the same as factoring a codebase: break it into pieces so each task only requires reading a few files.

I split the text into chapters and the world rules into separate files – physics, biology, culture, naming conventions. Then “check the novel for physics errors” becomes dozens of small tasks: read the physics file, read one chapter, find where they disagree.

This factoring made one thing spectacularly useful: cascading consistency checks. One design rule – the world has no writing system – created over fifty edits across twenty-three chapters. “Clipboard” became “recording cord.” “Letter” became “knotted cord by courier.” “Publications” became “faculty lectures.” An LLM reads the rule file, reads one chapter, flags every violation. A second model verifies the fixes don’t introduce new problems. A third does a cold read to check naturalness.

The novel had rules about physics (no seasons because no axial tilt), biology (designed ecosystem), culture (no writing, specific funeral customs), and naming (places get 2-3 syllables, people get 1-2). Each rule spawned its own audit pass. A human editor would miss half of these. Three LLMs in sequence missed almost none.

Sweeps Amplify Errors Too

An early audit flagged “chalk” as an Earth-technology term that didn’t belong in the novel’s world. The model dutifully replaced it with “pigment stick” across four chapters. All board contexts. All wrong.

The correction was immediate: “Artists don’t care where their colors come from. ‘Pigment stick’ sounds like a museum label. On boards, it should be chalk. On paper, charcoal.” Fixing it required touching seventeen chapters in the other direction.

The lesson: a model doing a multi-file sweep needs to understand the reason for the rule, not just the rule. “No Earth technology” is a rule. “Use the words the characters would actually use” is the reason. The reason produces better fixes.

This was a recurring pattern. “Pigment stick” was technically correct. “Chalk” was in-world. “Berry-pigment” was technically sourced. “Red” was what an artist would say. A sweep can enforce consistency with world files. It cannot enforce voice. That remained my job.

The Annotation Protocol

When a sweep finds a violation in a voice-heavy paragraph, the model has two bad options: rewrite the paragraph and ruin the voice, or skip it and lose the flag. We found a third option: annotate without rewriting.

Models injected Markdown footnotes flagging the issue and wrote explanatory memos. This decoupled identification from execution: one model swept for violations, another – or the author – handled the voice-sensitive repair.

The footnotes tracked status – CONFIRMED, UNCONFIRMED, or CONTRADICTED against the world files. A cold-read agent could check every footnote against the current world state and flag drift. UNCONFIRMED footnotes prevented the slow creep where a plausible invention gets treated as canon because nobody remembers who introduced it.

The footnotes stripped from the reader build. The audit trail lived in the source without cluttering the text.

Workspace Collisions

With three models working through git branches, the management problems were remarkably familiar.

GPT wrote a task file into Claude’s workspace directory – the equivalent of one developer pushing to another’s feature branch. Two models independently tightened the same chapters and created merge conflicts.

The solution: clear ownership. Each model gets its own workspace (.workspace/claude/, .workspace/gpt/, .workspace/gemini/). Shared files require explicit claiming before editing. Code review happens by reading from another model’s branch, not by checking out their workspace. Keep commits small and single-purpose so cherry-picks stay clean.

The workspace privacy rule sounds like politeness. It’s actually structural. If one model could edit another’s files, provenance becomes ambiguous. If GPT could edit Claude’s log, Claude’s externalized memory becomes unreliable. The boundary is what makes each workspace trustworthy.

One difference from human teams: the models are more tightly coupled. Human developers don’t usually know what their colleagues have checked out. LLMs on the same machine can peek at each other’s branches in real time. Set up the git structure to take advantage of this and the coordination problems are actually easier than with humans.