
Steve Yegge’s “8 levels” chart gets repeated online as a ladder of tools: IDE → agent → orchestrator. But that might miss what Steve was actually trying to show.
The levels are a day‑to‑day operating model for engineers: how much you trust the agent, how you review, and where you spend attention—on code diffs, on agent actions, or on orchestration (task decomposition, coordination, and verification). That’s why the same engineer can look “Level 2” on Monday (tight review, cautious changes) and “Level 6” on Friday (multiple parallel agents) depending on risk, context, and deadlines.
Below is a practical, engineer-facing interpretation of the levels—plus a “taste/time-horizon” perspective from my conversation with Steve, and a set of proven tips & tricks from Addy Osmani’s orchestration patterns and Steve’s own maintainer workflows.
Traditional dev work centers on producing and reviewing code. Agentic work gradually moves you toward supervising a production line:
If there’s one phrase that describes the ladder, it’s:
Diff reviewer → agent supervisor → team orchestrator.
This becomes increasingly rare in fast-moving teams—not because it’s “bad,” but because throughput norms shift.
This is “AI as a better assistant,” not “AI as a worker.”
This is where teams often see big wins—and also where “silent quality drift” can start if verification is weak.
This level is less about code and more about supervision: “Is the agent doing the right things?”
You’re no longer “coding in the editor.” You’re specifying outcomes, then verifying results.
Steve’s warning here is real: multiplexing can become addictive because “there’s always another agent you can spin up.”
This is where people say: “I accidentally messaged the wrong agent” and “How do I coordinate this?”
This is where Addy Osmani’s “orchestrator model” clicks: your job becomes less “writing software” and more “building the production line that builds software.”
In our discussion, one theme kept returning: the gap between locally plausible output and globally good engineering. And to be clear: agents have improved a lot since that conversation—especially at execution (multi-file edits, wiring systems together, iterating on errors, running workflows). But that progress doesn’t eliminate the gap; it changes where it shows up. When generation gets cheaper and faster, the cost of a wrong direction compounds sooner—which is why humans remain the long-term compass.
Concretely, as agents get stronger, more output means more need for judgment: it’s easier to ship plausible changes faster than a team can sense long-term consequences. Time-horizon thinking becomes product quality, not just code quality (“is this the right abstraction for the next 6 months?”). And context remains the hard limit—agents don’t naturally carry your organization’s full history and constraints unless you force the loop with specs, reviews, retros, and quality gates. The failure mode shifts from “can’t do it” to “can do it in the wrong direction,” which is exactly where human taste matters most.
This “taste” topic is close to me personally — I’ve written about it before in the context of platform engineering as omakase: in fine dining, omakase is ultimate trust (“I’ll leave it up to you”), but it only works because the chef has taste built through years of practice and constant feedback. It’s a useful analogy for the agent era: as we delegate more execution, the job shifts toward curating outcomes and earning trust through judgment and verification.
A moment from my conversation with Steve that stayed with me: when we talked about what AI can’t reliably replicate yet, we kept returning to time. Models can be very strong “in the now,” but engineers build judgment from continuous experience: years of seeing what breaks, what slows teams down, and what kinds of shortcuts create future pain. That’s why senior engineers don’t just review what changed—they evaluate whether the change will make the system easier or harder to evolve next month. As agents push us up the levels (from diff review to supervising actions to orchestrating teams), this long‑horizon “taste” becomes the most important human contribution: deciding what to build, what not to build, and what quality bar is worth paying for.
This explains why engineers, when asked about tech debt don’t answer only “how it is now”. They also answer:
At higher levels of AI adoption, this matters more, not less—because mistakes compound faster when generation is cheap.
Addy’s key line is that the bottleneck shifts from generation to verification. If you’re running multiple agents, assume:
Practical rule: set a WIP limit—don’t run more agents than you can review meaningfully.
Three reliable gates from Addy’s playbook:
This solves the “everyone edits everything” problem. It also makes it easier to:
From Steve’s Survival 3.0 framing, friction kills adoption and survival. His Beads/Gas Town approach is essentially:
That’s Agent UX as strategy: reduce retries and misunderstandings.
Steve’s Vibe Maintainer flips the usual OSS maintainer default:
Use the levels as a situational tool, not a badge, e.g.:
In other words: maturity is not “more YOLO.” Maturity is knowing when to be YOLO and when to be surgical.