The templates were lying — not maliciously, but in that quiet way LLMs lie when they don’t have enough signal. Filling in emotions nobody felt, theming where there was no theme, claiming “tests pass” without evidence. The fabrication wasn’t the kind you catch with a fact-checker. It was softer than that.
That’s what kicked off this batch of work. Every draft template in the system — devblog, changelog, PR description, ADR, release notes, sprint report, standup — had its own version of the same vulnerability. I’d been thinking about fabrication as a binary: either the model invents a test plan or it doesn’t, either it makes up a metric or it doesn’t. A second-opinion review from a codex agent reframed the problem. Soft fabrication — manufactured emotion, vague benefits, thematic hand-waving — is the same failure mode, just harder to see. The agent surfaced it as a unifying observation across all seven templates, and it was right.
Not every suggestion was right, though. The codex agent also proposed softening the literary scaffolding in the devblog template — dialing back the named voice influences (Atwood, Fowler, Orosz) on the theory that they over-constrain the LLM. I pushed back. Those names aren’t mandatory moods; they’re directional nudges the model actually reads as guidance. The real fix for thin-entry overreach wasn’t blanket voice-softening — it was a new rule: don’t fabricate emotion when entries lack it. The agent was sparring, not adjudicating. Being explicit about what I rejected and why is part of the value.
The devblog template got the most structural attention. I read a real output from another project’s worktree to ground the diagnosis. The tells were obvious once you looked: invisible agent (“I” everywhere, even when the work was clearly collaborative), tour-guide pacing, no surfaced disagreement, every paragraph ending with resolution. So the template grew a fourth voice — the Collaborator — and new anti-patterns for hero voice and tour-guide voice. I considered making collaboration awareness opt-in, but the bias toward solo-developer narrative is so default-on in LLM output that the nudge has to be the default too.
Then came the other six templates, each with its own calibration problem. Changelogs and release notes needed to stay neutral — readers want a list of facts, and injecting operator-voice there would feel performative. PR descriptions got a size-adaptation rule and, more importantly, test-plan honesty: “Tests pass” as an unverified claim is an actual failure mode in LLM-authored PRs, and naming it explicitly is the cheapest counter. ADRs gained a Status field with supersession semantics — without it, decision logs just accumulate with no way to capture reversals. Sprint reports got friction and carry-over sections; standups got an “asks for help” prompt. I resisted the temptation to add a structured agent_involvement field to entries — too much taxonomy, too much fabrication incentive. The existing notes field handles this fine when people actually use it.
Two smaller things landed alongside the template work. .timbersignore moved from inside .timbers/ to the repo root, because every other ignorefile in the ecosystem (.gitignore, .dockerignore, .npmignore) lives at the root and nobody would think to look inside a dot-directory for it. Caught it before any release tag, so zero migration cost. And lockfiles — package-lock.json, go.sum, Cargo.lock, and friends — got added to the default skip-rules, because lockfile-only commits carry zero design intent. I surveyed real commit history to ground the defaults. There’s an edge case where a lockfile-only change is substantive (a manual transitive override, a security patch), but the file-level filter only fires on lockfile-only commits, and the status surface catches drift.
A <pr-authoring> section landed in the prime workflow too, coaching agents to draft PR bodies from timbers entries by default instead of reasoning ad-hoc. The key addition isn’t the “do this” — it’s the signal for what to do when a section comes back empty. Missing Design Decisions means the entries were thin. That’s feedback, not a license to fill the gap with vague restatements of the title.
The insight that sticks: fabrication is a spectrum, not a switch. The dramatic kind — invented metrics, hallucinated test plans — is easy to spot and easy to guard against. The quiet kind — manufactured affect, implied consequences, benefits that sound right but reference nothing — passes review because it reads like good writing. Every template patch in this batch is, at bottom, a clause that says: if the entries don’t contain it, the output doesn’t get to invent it.
All seven templates shipped in v0.19.0. The system feels more honest now — not in a dramatic way, but in the way that matters: the outputs are harder to distrust.
This post was written with AI assistance based on structured development log entries.