2026-05-03

How we built v4.0 — engineering notes from a 12-week quiet phase

The shipped product is the agent-native author release. This post is the build log behind it — why the main branch went into patches-only mode for 12 weeks, why the 5 pillars share one safety contract instead of becoming five separate minor releases, what dogfooding caught that 262 unit tests didn't, and how an agent-paired-with-tests workflow compresses the inner loop without compromising the outer one.

The quiet phase, and why it's not "shipping slowly"

v3.x cadence in early 2026 was bracingly fast: 5 minor releases in ~2 months (3.5 / 3.5.5 / 3.6 / 3.6.1 / 3.6.2). Each was substantial — semantic search, AutoGit, MCP server, 14 BYOK providers, full reveal.js slideshow. The user-perceived signal flipped halfway through the run, though. Around 3.6.0 people stopped saying "exciting new feature" and started saying "did the 3.5 → 3.6 update break X?" The release notes looked like progress; the bug-report inbox looked like regression.

We went quiet on main right after 3.6.2. Not because we ran out of features to ship — the v4 pillars were already half-built on feature branches — but because shipping each pillar as a separate minor would have multiplied the instability surface by five. The bargain: 12 weeks of zero marketing wins, in exchange for one clean architectural cut where every pillar trusts the same primitives.

The patches-only rule wasn't symbolic. v3.6.x got real bug fixes during those 12 weeks. The only thing barred from main was scope-expanding feature work.

Five pillars, one safety contract

The v4 surfaces look distinct: an Inline Agent Panel, a YAML recipe runner, a JSONL trace view, an MCP federation flag, an Ollama auto-detect. Different shapes, different file paths.

They aren't different code. They're five entry points into one machinery:

One RunHandle that mints the run id, creates .solomd/agent-runs/<run-id>/, persists run.md + trace.jsonl, finalises status / tokens / cost on every exit path including panic.
One trace.jsonl schema — prompt / model_call / tool_call / tool_result / git_commit / done — written by both panel chats and recipes, replayable from the same MCP tool by either.
One write-cap registry in agent_tools that write_note / append_to_note consult on every call. Recipes register a per-run cap; panel chats register no cap (interactive — user-paced). Same code path, different policy.
One resolve_in_workspace that every read / write / glob path goes through. .. rejected upfront, absolute paths rejected upfront, deepest existing ancestor canonicalised before reattaching the leaf. Bound to one function so security review is bound to one function.
One AutoGit branch sandbox shared between recipes (per-run agent branches) and the existing v2.2 per-save snapshots — same libgit2 calls, same vendored binary, same status checks for "is the working tree dirty right now?"

Adding a feature in v4.x means picking a new entry point. It does not mean building new safety primitives. That's the compounding return on the quiet phase.

What 262 tests missed

We had 262 unit + integration tests by the end of the quiet phase. They all passed. They all kept passing through the 12 weeks. They were not enough.

Things they caught: every tool dispatch path, every YAML parse error, every cron expansion edge case, every Ollama provider alias, every --workspace flag combination. Path traversal — until we strengthened it. The agent_tools workspace-escape regression test is the cleanest case study: we wrote the test, it failed, we fixed the bug, the test stayed.

Things they missed:

The macOS launch sequence where set_focus() fires before NSApplication has finished launching, so the activation request gets dropped and the app opens behind whatever was previously frontmost. No unit test exercises NSApplication. No integration test reproduces "the user's Finder is in front."
The sidebar resize handle that's 5px wide in CSS but 100% wide because a global :deep(*) rule on the sidebar's children matched it too. Both selectors had the same specificity; source order won. No test renders the sidebar with a hover and inspects the bounding box of the handle.
The race between await invoke('ai_chat') resolving and the spawned task emitting solomd://ai-error. For a fast 404 from Ollama, the error event arrived in JS before the IPC response with the request id, so the listener's id-match check dropped the event. No test simulates the IPC arrival ordering of two messages from the same backend task.
The seven input surfaces where pressing Enter to commit a pinyin candidate fired send / rename / open instead of letting the IME handle it. We added e.isComposing guards to all of them — but only after a CJK user actually opened the panel and tried.

The pattern: tests verify internal contracts — does the function return what we think it does, given inputs we expect. They do not verify environment contracts — does the OS deliver the activation event, does the IPC layer order messages we don't control, does the input method emit the keydown we assume. Those bugs surface only when a real person runs the real binary on a real OS with a real IME.

The dogfood window

Roughly four weeks before tag we cut v4-beta builds and ran them ourselves on real vaults. Not a focus group — just the maintainer, with the binary, doing the thing. Notes loaded from the same workspace where these blog posts live. Recipes scheduled against actual daily notes.

That window contributed about a third of v4.0's commits. Every bug listed above was caught in it. So were a dozen smaller things — the Agent Panel default-off setting (the marquee feature was hidden by default, which we noticed on a fresh install only after migration testing); the AutoGit sandbox sweeping uncommitted edits into agent commits; the silent-error UX where a misconfigured provider hung the panel forever with no toast.

The discipline that made the dogfood window worth the four weeks: every observation gets a fix-and-test in the same commit. No "filed it; will revisit." No "small thing, queue it for next patch." The dogfood log directly produced the test additions in the security commit. The IME guards landed with the surfaces all in one diff. The agent panel migration shipped with the marker logic and a unit test for the migration's idempotency.

The agent-paired-with-tests inner loop

SoloMD v3.6 became part of how we built v4.0. The dogfood cycle inside the project's own vault used itself: a v4-beta panel chatting with a vault that contained the v4 design docs, with the recipe runner running an actual weekly review on actual weekly notes.

The loop:

Spot a thing — UX wart, suspicious behavior, missing affordance.
Write the failing test first or reproduce the symptom by hand. (Tie goes to the test.)
Fix; verify the test passes; verify the symptom is gone in the running build.
Commit with a message that describes the user-visible symptom, not the internal change. Future-you reads the symptom; future-you searches for symptoms.

Two things were non-negotiable about that loop. First: self-test before declaring done. A code change that compiles is not done. A code change that has unit tests passing is not done. A code change is done when the user-visible symptom is verified gone in the actual running binary, on the actual platform you ship to. Step 3 existed because step 2's "tie goes to the test" cannot verify environment contracts.

Second: commit messages are the search index. Every fix in v4.0 has a commit message that names the symptom in user words ("stuck on streaming…", "menu bar keeps showing previous app"). Six months from now when someone hits the same symptom, they grep the symptom and find the fix and the test. Commit messages that name internal refactors are searchable only by people who already understand the internals.

Multi-agent code review at the end

Two days before tag we ran a multi-agent code review of the entire v4 diff (~20K lines). Four parallel agents, each with a focused brief: frontend bugs, backend bugs, cross-cutting consistency, security. They returned ~24 findings. Twelve made it in before tag. One was deferred (a low-impact cancel-flag wiring that needed a pub(crate) visibility change in another agent's territory). The rest were stylistic nits and we declined them.

The review process is interesting in its own right but less so than the structural choice: parallel agents on non-overlapping scopes means you can run a thorough review of a 20K-line diff in 8 minutes without the agents racing to edit the same file. The coordination cost is the prompt — each agent gets a clear file allowlist, a precedent list of fixes already shipped (so it doesn't propose them), and an output cap.

What it caught that we hadn't: the resolve_in_workspace path-traversal regression where the parent-doesn't-exist branch fell back to the unresolved candidate. The static analysis couldn't see it; a human reviewer focused on the surface contract probably wouldn't either. The agent had a precedent ("the in-tree mcp-server::safety::resolve_in does this correctly") and used it to flag the divergence.

What we said no to

Saying no is part of the process. The roadmap memorialises these so they keep getting said no to:

Bundled local LLM runtime. Ollama already covers it. Re-implementing means we maintain another inference loop, another model file format, another quantisation pipeline. BYOK is the path.
Online recipe marketplace. Server ops + moderation = a different company. Cookbook ships in-tree and the cycle is the GitHub PR.
Multi-user / team agents. "One window, one writer." The CRDT collab path leads to a different product.
Copilot ghost-text. Different brand; dilutes the writer's voice. Our agents work at vault granularity, with batched reviewable writes.
Plugin marketplace. Trilium-style scripting API maybe; an Obsidian-style plugin ecosystem no — see principle #6 ("combination > single feature").

Every one of these had a reasonable argument for shipping. Each was rejected on a principle, not a case-by-case judgement. Principles are the only way to say no consistently — case-by-case becomes "what was I in the mood for that day," which becomes feature creep.

What's transferable

This isn't a methodology post. SoloMD's specific cadence (12-week quiet phase before a major) isn't right for every project. But three things were genuinely load-bearing and probably transferable:

Make pillars share primitives, not just live alongside. The discipline isn't "ship 5 things"; it's "ship 5 things on top of one shared substrate you've thought about for 12 weeks." The cost is paid up front in design; the return is paid every release after.
Dogfood window is non-negotiable. Pre-tag, four weeks, real vault, real binary. Skip it and the bug list becomes the v4.0.1 bug list, which becomes the v4.0.2 bug list. Better to absorb the cost in one cycle and ship clean.
Tests verify internal contracts; people verify environment contracts. Build for both. Don't mistake a green CI for a working product.

The next post will be a v4.x retrospective in 8 weeks — if dogfood discipline held, if shared primitives compounded the way the design promised, and if the no-list stayed on the no-list.

Code: github.com/zhitongblog/solomd. v4.0 release: whats-new · launch post. Bugs: issues.