In software engineering, tests and code exist in tension. Unit tests verify the program is correct. The program, in turn, validates that the tests make sense. They reinforce each other. Neither is complete without the other.
I’ve been applying this same adversarial principle to creative work with AI, and it’s producing noticeably better results than single-agent prompting.
The single-agent problem
A single AI agent thinks linearly, one token at a time. Ask it to build a landing page, and it’ll produce something reasonable. But a good landing page isn’t just one skill. It’s copywriting, web design, conversion rate optimization, brand compliance, marketing strategy, and sometimes legal considerations, all at once.
No single pass through a context window can hold all of those disciplines in focus simultaneously. The agent will nail the copy but forget the CRO fundamentals, or get the design right but drift off brand voice. Something always slips.
Setting up the adversarial team
The fix is to stop asking one agent to do everything and instead assemble a team where each member brings a deep specialization.
I have a main orchestrator agent spawn sub-agents (or use Claude Code’s team features), and each team member pulls in a dedicated skill file loaded with context for their domain. A copywriting agent might have 500 examples from top copywriters, excerpts from books, your favorite and least favorite examples. A web design agent has example pages, layout patterns, accessibility standards. A branding agent carries your full brand guidelines, voice documentation, and imagery specs.
These skill files can be massive and detailed. That’s the point. You’re front-loading each agent’s short-term memory with deep expertise before it ever looks at your work. I touched on this idea of building AI-operable systems in a previous post, and the skill file approach takes it even further.
The rubric and scoring loop
Each specialized agent receives the current draft of whatever you’re building and evaluates it through its own lens. The CRO agent, for example, might score against a rubric like:
- Is the value proposition clear above the fold?
- Are CTAs bold with action-oriented copy?
- Are social proof elements (ratings, testimonials) visible?
- Where is the pricing positioned?
- Is there urgency (countdown timer, limited availability)?
- Is the page scannable with clear visual hierarchy?
It scores each dimension, produces an overall rating out of 10, and returns the score along with its top recommendations for improvement.
Every agent does this independently, through its own lens. The copywriter scores the writing. The designer scores the layout and visuals. The brand agent checks voice and visual consistency. Each one comes back with a number and a list of suggestions.
Convergence through conflict
This is where it gets interesting. These agents don’t naturally agree. Good copywriting might clash with brand voice. Bold CRO tactics might conflict with clean design sensibility. Compliance requirements can undercut persuasive copy.
They’re in genuine tension, just like real team members with different expertise.
The orchestrator’s job is to synthesize:
- Collect all scores. If any agent scores below 9 out of 10, another iteration is needed.
- Read the feedback from all agents and identify the most impactful changes.
- Revise the deliverable, balancing competing recommendations.
- Send it back out for another round of scoring.
Each cycle tightens the work. The copy gets sharper and the design gets more intentional. Objections get handled. Details that a single-pass agent would miss get caught by one specialist or another.
The GAN connection
This plays on one of my favorite concepts in AI: the generative adversarial network. In a classic GAN, one model generates images while a second model tries to determine if each image is real or AI-generated. They train against each other. The generator improves because the discriminator keeps catching it, and the discriminator improves because the generator keeps getting better at fooling it.
What makes GANs clever is that they create a self-improving feedback loop without needing manually labeled training data. The adversarial structure itself is the training signal.
What I’m describing with agent teams operates at a higher level, LLMs in role-based scenarios providing structured feedback to each other. But the principle is the same: tension between evaluators and creators drives quality upward through iteration.
What this actually looks like
Over the past couple weeks, I’ve used this pattern for:
- Landing pages for my business. Multiple sales pages where CRO, copywriting, brand, and design agents each scored and refined the work through several iteration cycles.
- A full blog redesign pulling in SEO, marketing strategy, brand identity, and web design as separate evaluation lenses. I’ve been using this kind of growth engineering with Claude Code approach across a lot of my marketing work.
- A short playbook on using AI for business, where editorial, subject matter, and audience-fit agents each had their say.
- Software where domain expertise agents (say, one that understands CPG accounting) worked alongside a coding agent to build something neither could have built alone.
In each case, the final product had a completeness that single-pass generation just doesn’t produce. You notice it. Fewer holes, fewer “oh we forgot about that” moments.
The cost
Let’s be honest about the trade-offs. This approach burns through tokens. A landing page might take 30 to 40 minutes of agent runtime with multiple research phases, iteration loops, browser screenshots for visual verification, and re-scoring cycles.
That’s a lot compared to a single prompt that returns something in 30 seconds. But 30 minutes for a landing page that’s been reviewed by the equivalent of five specialists? I’ll take that trade every time.
You’re trading tokens for quality assurance. The same way a real team costs time and money to review each other’s work, the agent team costs compute. But the output is closer to what a real team would produce.
Same brain, different books
I keep coming back to this thought. These agents are fundamentally the same model. Claude is Claude, whether it’s playing the copywriter or the CRO specialist. The difference is what you loaded into its context window before it started working.
It’s like having the same person walk into the room, but each time they’ve just finished reading five different books. The copywriting agent just absorbed every example and principle you could fit in. The brand agent just re-read your entire brand bible. They bring different perspectives because they’re primed with different information, not because they’re different intelligences.
That framing is why I think this works so well. You’re giving the same capable reasoner different source material to reason from, and the disagreements that emerge are real, not manufactured.
Running a company of agents
Working this way is starting to feel less like programming and more like management. You delegate work, wait for feedback, reconcile conflicting opinions, make a call, and send it back for another round. I wrote about the early stages of this shift in The AI CEO, and it’s accelerating faster than I expected.
In some of these cases, you’re delegating to a team. It won’t be long before you’re delegating to departments. Fully AI departments with dozens or hundreds of agents that have been sub-delegated to operate on specific pieces of a larger project.
I’m already routinely running five to ten agents against the same deliverable. Scale that up and you start to see the shape of something that looks a lot like an org chart, except every box is an AI agent with a specialized skill set.
Try it
If your AI tool of choice supports agents and sub-agents, try this. Even a rough version works:
- Pick a deliverable: a landing page, a blog post, a piece of code.
- Identify three or four disciplines that matter for quality. Copy, design, SEO, whatever fits.
- Create a skill prompt for each discipline, as detailed as you can make it.
- Have each specialist score the work on a 1 to 10 rubric with specific recommendations.
- Iterate until every specialist scores a 9 or above.
You’ll burn more tokens and it’ll take longer. But I haven’t gone back to single-pass generation for anything that matters. Once you’ve seen what a team of agents produces compared to one agent winging it, the difference is hard to unsee. The same idea that makes software testing indispensable, that adversarial pressure produces better results, turns out to work just as well when the thing being tested is a creative deliverable instead of a codebase.
