TL;DR

Remember that vibe-coded macOS app? It grew to 20,000 lines of Swift, 61 releases, and one very persistent human. Vibe coding stopped working around feature #5, so I switched to spec-driven development: a constitution with core principles, plus numbered specs per feature. The AI finally had something to follow instead of guess.

From Vibe Coding to Spec-Driven Development

In late January, I wrote about vibe coding a macOS app, a little side-panel that polls Azure DevOps for pipeline statuses and open PRs. The whole thing was built by babysitting Claude while doing other work, and it turned out to be a nice proof-of-concept for AI-assisted development.

What I did not expect was that I’d still be working on it two months later. What started as a simple ADO dashboard grew into something with an RSS reader, a full embedded terminal, live meeting transcription, Obsidian integration, and an AI-powered meeting summariser. At the time of writing: 20,000 lines of Swift across 50 files, 61 releases. Still just one person and an AI assistant.

In this post I want to share what happened between that first vibe-coded prototype and a tool I now actually rely on daily. I ran into a debugging problem that required 13 steps to solve, I discovered a methodology that made AI-assisted coding a lot more predictable, and I learned where the limits still are.

The Problem with Vibe Coding

In the first post, I concluded that “good software development practices turned out to be key” (SOLID, Agile, DRY). I still believe that. But as the codebase grew, the vibe-coding approach started to creak.

Adding the RSS reader was the turning point. It’s not a complicated feature: subscribe to feeds, parse XML, show articles. But when I asked the AI to build it, the result was fine-ish. Some things worked, others didn’t, and debugging the gaps was surprisingly painful because the code didn’t follow any particular architecture. Each prompt produced something slightly different, and the AI had no memory of why previous decisions were made.

I wrote about good abstractions being critical for AI-assisted development, and this was the moment where that became very concrete for me. The AI was struggling not because the task was hard, but because there was no shared structure to work within.

That’s when I found Spec Kit.

Spec Kit: Structure for AI-Assisted Development

Spec Kit is GitHub’s open-source toolkit for what they call Spec-Driven Development. The idea is straightforward: instead of prompting an AI with “build me an RSS reader,” you first write a specification. User stories with priorities, acceptance scenarios in Given/When/Then format, a technical plan (file structure, data models, API design), and then a task breakdown referencing specific files.

The tooling adds slash commands to your AI assistant (/speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement) that guide this process. There’s also a “constitution” file where you define your project’s principles: we use SwiftUI, services are singletons with @Published properties, secrets go in the Keychain.

I was skeptical at first. It felt like over-engineering for a side project. But the results were pretty immediate!

When I re-did the RSS reader through a proper spec, six user stories, each with acceptance criteria, a plan specifying XMLParser with delegate pattern and a FeedStore model, the AI produced 2,100 lines across 5 files that just worked. Not mostly-worked, but actually worked. The Given/When/Then scenarios effectively became test cases the AI could verify against.

And it scaled. When I added an embedded terminal (spec 008), the plan referenced the constitution. When I added terminal search (spec 013), it built on the terminal spec. Each feature was consistent with everything before it because the principles were explicit, not implicit. As I noted in an earlier post on Agents.md files, context files that are written by the developer tend to improve outcomes. The constitution is exactly that: developer-written context that the AI actually follows.

The process also scaled down. Terminal search took about 30 minutes start to finish, clickable URLs maybe 15. Not every feature needs a 500-line specification. Sometimes the spec is just three acceptance scenarios and a plan that says “use SwiftTerm’s built-in SearchService.”

What Got Built

The side-panel that started as an ADO pipeline watcher now has quite a bit more going on: an RSS reader with OPML import/export and category management, a full zsh terminal embedded via SwiftTerm with ANSI colors, search, clickable URLs, font zoom, and multi-tab support. It has live meeting transcription using WhisperKit for fully on-device speech recognition, capturing both system audio and microphone simultaneously, with speaker attribution. There’s an AI meeting summariser that sends transcripts to Azure OpenAI for action items, and Obsidian integration for quick notes and transcript export.

Here’s what that growth trajectory looked like. The chart tracks cumulative lines of code produced alongside the running token consumption. The dips correspond to major refactoring sessions: net negative on line count but positive for architecture.

Takeaway: growth is not linear. The dips show that AI-generated code requires regular architectural intervention, just like human-written code.

The daily commit data tells a richer story. Feature days (green) show clean growth: mostly insertions, few deletions. Refactoring days (orange) nearly always have deletions matching insertions, indicating restructuring rather than net growth. Half the project’s active days were refactoring days, and this is with AI writing the code. AI-generated code needs the same architectural discipline as human-written code, possibly more, because the AI writes fast but doesn’t maintain structural coherence across features unless you enforce it.

Takeaway: half the project’s active days were refactoring days. AI writes code fast, but keeping it coherent is still a human job.

Some features were specced but never built. Deferred transcription, a clipboard manager, terminal-to-AI piping. The spec process makes it cheap to explore ideas on paper, which also means you sometimes decide not to build them. I consider that a feature of the process rather than a shortcoming.

The build pipeline itself deserves a mention. make release auto-bumps the version, runs SwiftLint, builds, installs to ~/Applications/, commits, and tags. 61 releases in 7 weeks, roughly 1.2 per day. When releasing costs nothing, you release constantly.

Where Things Got Hard

The 13-Step Scroll Bug

The most interesting technical problem was a scroll bug in the embedded terminal. Scrolling up in TUI applications produced garbled, corrupted display. Scrolling down was fine.

That asymmetry turned out to be the critical clue, but it took me 13 steps to get there!

Some context first. The app embeds a terminal using SwiftTerm, a Swift library that emulates a terminal inside a native macOS view. Under the hood, it communicates with a shell (like zsh) through a PTY (pseudo-terminal): a kernel-level pair of endpoints that lets the app pretend to be a physical terminal. One end talks to the shell, the other to the UI. When you type a character, the app writes it to the PTY. When the shell produces output, the app reads it from the PTY and renders it on screen. If you’ve ever used a terminal emulator like iTerm2, Alacritty, or the built-in Terminal.app, they all work this way.

TUI applications (think htop, vim, or file browsers like ranger) take this further. They send ANSI escape sequences through the PTY to control the terminal: move the cursor, change colors, scroll the display. These are standardised byte sequences starting with ESC [ (called CSI, Control Sequence Introducer). For example, CSI 3 A means “move cursor up 3 lines.” The terminal emulator has to parse and execute each of these correctly, or the display breaks.

It started with basic event delivery. SwiftUI was intercepting scroll events before they reached the terminal view, which I fixed by installing an NSEvent local monitor, essentially a low-level hook that catches input events before SwiftUI’s gesture system eats them. Then the trackpad was flooding the PTY at 125 Hz (the MacBook trackpad’s native reporting rate), overwhelming the TUI app with redraw requests faster than it could process them. I tried several throttling approaches, including one inspired by how Alacritty (a popular GPU-accelerated terminal) handles scroll input, which accumulates fractional pixel deltas before converting to discrete scroll lines. Each approach improved things slightly but didn’t solve the core problem.

I also tried swizzling SwiftTerm’s scrollWheel handler. Swizzling is an Objective-C runtime technique where you replace a method’s implementation at runtime by swapping function pointers in the class’s method table. It’s a last-resort escape hatch: useful when a library doesn’t expose the method you need to override (in this case, SwiftTerm’s scroll handler wasn’t marked open for subclassing). That eliminated a race condition but still didn’t fix the garbling. I tried scheduling triple delayed redraws after each scroll event, but that was treating symptoms rather than root causes.

Eventually I went to the maintainers of ov, who confirmed their application was sending correct escape sequences. Then I wrote Python scripts that spawned a PTY directly, sent raw escape sequences, and captured the byte output. That’s when the picture became clear.

When you scroll up in a TUI, the application responds by sending CSI T (SD, Scroll Down), and yes, the naming is counterintuitive: the content scrolls down to reveal what’s above. SwiftTerm handled CSI T with a cell-by-cell copyFrom operation, iterating through every cell in the scroll region and copying it one at a time. Its sibling command CSI S (SU, Scroll Up) used splice, an efficient operation that moves whole line objects in one step. The copyFrom approach was not only slower but also non-atomic: if a display refresh fired mid-copy, you’d see a half-rendered screen where some lines had moved and others hadn’t.

The fix was adding a marginMode check to cmdScrollDown, exactly mirroring what cmdScrollUp already did. Three lines of code! Finding those three lines required understanding the full stack from macOS trackpad hardware through NSEvent, SwiftUI’s hosting layer, the PTY, escape sequence parsing, buffer manipulation, and CoreText rendering.

The AI was helpful at each individual step: reasoning about event rates, explaining escape sequences, analysing SwiftTerm’s source. But the systematic “narrow the suspect list” methodology was human-driven. The AI couldn’t hold the full mental model of 8 abstraction layers simultaneously, so I had to do the protocol analysis manually.

After fixing the root cause there were 6 more iterations of sensitivity tuning. The key insight was that scroll responsiveness comes from a low trigger threshold and high send frequency, not from burst count. Sending multiple events per cycle just floods the PTY all over again.

What Actually Works and What Doesn’t

AI-assisted development through specs works remarkably well for feature implementation. Well-structured acceptance criteria produce consistent, correct code. The RSS reader, a permission wizard, Obsidian integration, these came out of spec sessions largely complete and working.

It also works well for research and investigation. The embedded terminal investigation compared five approaches with a structured decision matrix. The WhisperKit security analysis mapped every data structure the library exposes, confirming the audio never leaves the device. These were collaborative: I described the question, the AI researched and drafted, I corrected and expanded.

Where it falls apart is deep debugging across abstraction layers. The AI can analyse individual layers but can’t hold the full picture. Platform-specific quirks like TCC behaviour, NSPanel lifecycle, and SwiftUI-AppKit bridging need to be discovered by running the actual app on actual hardware. The AI can suggest approaches, but many of the bugs I hit weren’t in any training data because they’re specific to macOS 26.

The AI is also bad at knowing when to stop. It’ll happily generate specs for 20 features. Deciding which 8 to actually build, that’s product judgment and it requires a human.

Where the AI Reasons Hardest

One thing I was curious about was how much reasoning the AI actually does, and how that varies by the kind of work. VS Code stores Copilot chat sessions locally, so I went digging. Across the project’s lifetime: 22 chat sessions, 685 conversational turns, roughly 450 MB of session data.

The most revealing pattern was how differently the AI behaves depending on the type of task.

Takeaway: debugging consumes 4x more tokens per turn than implementation, and takes 6x longer. The type of work determines how hard the AI has to think.

Debugging is by far the most reasoning-intensive activity at 1,739 tokens per turn, compared to 424 for straightforward implementation. It also takes 6× longer per turn (25 minutes vs. 4 minutes) and consumes 3.4× as many thinking tokens relative to output. This matches what I experienced with the scroll bug: the AI can reason about individual layers well, but when a problem spans multiple abstraction layers, it generates enormous volumes of speculative reasoning as it tries to narrow down where the issue might be.

Research sits in the middle at 1,267 tokens per turn, but there were only 14 research turns total, small investigative bursts that paid off quickly. Refactoring is the leanest at 461 tokens per turn: structured, well-scoped transformations where the AI knows exactly what to do.

The pattern generalises. When the task is well-defined and self-contained, the AI reasons efficiently. When the task is open-ended or crosses boundaries, the reasoning cost explodes. This has practical implications: if you can decompose a problem into smaller, well-scoped pieces before handing it to the AI, you’ll get faster and cheaper results than asking it to figure out the scope itself.

What Makes a Turn Expensive?

Not all turns are created equal. Some consume a handful of tokens, others burn through thousands. The difference comes down to how many tool rounds the AI performs: reading files, writing code, running checks.

Takeaway: 15% of turns consume 41% of all tokens. The expensive turns are where the AI reads, writes, and validates across multiple files.

Turns with 20+ tool rounds consume 2,085 tokens on average and account for 41.3% of all token usage, despite being only 15% of turns. These are the heavy-lifting moments: the AI opening multiple files to understand context, writing code across several locations, then checking that everything compiles. Below 5 tool rounds, token costs stay manageable (174–498 per turn). Those lighter turns tend to be simple questions, single-file edits, or quick lookups.

The implication is that token cost isn’t driven by how much you ask the AI to write, it’s driven by how much the AI needs to read and verify in order to write confidently. A one-line fix in a file the AI already understands costs almost nothing. The same one-line fix in unfamiliar code can trigger a chain of file reads and cross-references that burns through tokens fast.

The Thinking Model Shift

Midway through the project I switched from Claude Opus 4.5 to Claude Opus 4.6, a model that reasons significantly more before responding. The shift is clearly visible in the data.

Takeaway: thinking-heavy models spend more tokens reasoning before responding. The code they produce is noticeably more consistent as a result.

Under Opus 4.5, the thinking-to-output ratio sat around 0.28: for every output token, the model did about a quarter token of reasoning. Under Opus 4.6, that ratio jumped to 2.1, meaning the model now reasons more than twice as much as it writes. On some days the ratio peaked at 5.3×, five tokens of reasoning for every token of output.

The trade-off is straightforward: thinking-heavy models use more tokens per turn, but the output requires fewer corrections and less back-and-forth. Whether that trade-off is worth it depends on your tolerance for iteration. In my experience, paying for reasoning up front was cheaper than paying for rework after the fact.

Conversation Length and Session Structure

Two patterns in the data affect how you structure AI-assisted work sessions.

First, context inflation. As conversations grow longer, the AI consumes more tokens per turn because it has more history to process on every request.

Takeaway: after turn 50, per-turn token costs nearly double. Long conversations get expensive.

The first 10 turns average 581 tokens per turn. After turn 50, that jumps to roughly 1,100 tokens, nearly 2× the startup rate. By turn 90–99 it reaches 1,342 tokens. This isn’t the model getting verbose. It’s the accumulated conversation context forcing the model to process more information on every request. At some point it becomes cheaper to start fresh than to continue.

Second, conversation warmup. Every new session pays a startup tax: the AI has to re-read files, re-establish context, and orient itself before producing useful output.

Takeaway: short sessions waste up to 78% of time on warmup. The sweet spot is 30–80 turns.

For short sessions (6–9 turns), warmup can consume 47–78% of total session time. For longer sessions (100+ turns), warmup drops below 3% because productive time amortises it. But long sessions hit the context inflation problem from above.

These two forces pull in opposite directions: long sessions waste tokens on inflated context, short sessions waste time on warmup. From this data, the sweet spot appears to be sessions of 30–80 turns. Long enough to amortise the startup cost, short enough to keep per-turn costs manageable.

Reflections

In the abstractions post I asked: “Can anyone, human or AI, work with this without hand-holding?” That question turned out to be the right framing for this whole project.

Vibe coding, prompting loosely and hoping for the best, works fine for prototypes. For anything beyond that, you need structure. Not heavyweight process, but enough shared context that the AI can produce consistent output. A constitution, specs with acceptance criteria, a file-level task breakdown.

There is a gap between “AI can write code” and “AI can build software” and that is exactly what the spec process fills. The code still needs a human, the product decisions still need a human, the deep debugging still needs a human. But the volume of correct, consistent code that one person can produce in 7 weeks is something that would have been unthinkable not so long ago.

If you’re interested in the project, the source code, or the spec process, feel free to reach out. You can find my contact information in the footer.