TechnicalFebruary 11, 202610 min read

How We Built a Three-Agent Editorial Panel (and Why It Works)

By Carlos Jorge & Tom Meredith

Isometric illustration of three geometric reviewer shapes voting on a document

We replaced a single confidence score with three specialized AI agents that vote on every piece of content before publication. Here is the architecture, the consensus algorithm, the failure modes, and what it costs.

The Problem with Confidence Scores

Our first approach to autonomous content quality was a formula-based confidence score from 0.0 to 1.0. Simple rules: above 0.8, auto-publish. Between 0.6 and 0.8, ask Claude. Below 0.6, save as draft.

Three problems emerged immediately:

Single-dimensional. One number cannot capture whether content is factually accurate, complete for families, AND written in the right voice. A guide can score 0.85 overall while having incorrect lift prices.
Arbitrary thresholds. Why 0.8? Why not 0.75? Every threshold felt like a guess.
No self-improvement. Content that failed the threshold just failed. There was no mechanism to fix it and try again.

We needed something that evaluated content from multiple perspectives and could iterate on failures.

The Architecture

Three agents run in parallel. Each votes APPROVE, IMPROVE, or REJECT. A 2/3 majority is required for publication. If the content does not pass, the agents' feedback is aggregated and used to improve the content, then the panel runs again. Maximum 3 iterations. If it still fails after 3 rounds, the content is saved as a draft for human review.

Why Three Agents with Different Perspectives

Each agent maps to a specific family need:

TrustGuard: "Can families trust this?"

TrustGuard cross-references every claim against the research sources. Are the lift counts correct? Do the prices match current data? Are safety warnings appropriate? It flags anything it cannot verify.

This agent protects against the most dangerous failure mode of AI content: confident misinformation. Families are making expensive trip decisions based on these guides. A wrong childcare age range or an outdated lift price is not just a quality issue. It is a trust violation.

FamilyValue: "Can families use this to plan?"

FamilyValue evaluates completeness against a strict checklist. Is the childcare section detailed enough? Are beginner terrain options described? Would a parent reading this have enough information to make a booking decision?

It checks for 9 required sections: quick take, family metrics, getting there, where to stay, lift tickets, on-mountain, off-mountain, ski calendar, and FAQs. If a section is vague or missing specific details (named hotels, actual prices, specific restaurants), FamilyValue votes IMPROVE.

VoiceCoach: "Does this encourage, not intimidate?"

VoiceCoach enforces editorial voice. Travel writing has a tendency toward breathless hype. VoiceCoach ensures every guide sounds like a knowledgeable friend, not a marketing department. Warm, accessible, substance over style.

The three perspectives are deliberately non-overlapping. TrustGuard does not care about voice. VoiceCoach does not care about price accuracy. FamilyValue does not care about tone. This separation of concerns means each agent can be deeply focused on its domain.

The Consensus Algorithm

The voting logic is straightforward:

3/3 approve: Publish immediately.
2/3 approve: Publish. The improving agent's feedback is logged but does not block publication.
1/3 approve, 2 improve: Iterate. Combine feedback from both improving agents, apply improvements via Claude, re-run the panel.
0/3 approve: Iterate with all feedback combined.
Any reject: Treat as a strong improve signal. Iterate.
3 iterations, still <2/3 approve: Save as draft with all agent notes attached.

The iteration loop is the key innovation. Content does not just pass or fail. It improves.

The Code Pattern

The approval primitive is structured as three layers:

Layer 1: Atomic evaluation functions. Each agent has its own evaluation function that takes content, sources, and resort data, returning a structured result:

``` EvaluationResult: agent_name: str # "TrustGuard", "FamilyValue", "VoiceCoach" verdict: str # "approve", "improve", "reject" confidence: float # 0.0-1.0 issues: list[str] # Specific problems found suggestions: list[str] # How to fix reasoning: str # Overall assessment ```

Layer 2: Panel orchestration. Runs all three evaluations in parallel, tallies votes, and determines whether the 2/3 threshold is met.

``` PanelResult: votes: list[EvaluationResult] approved: bool approve_count: int improve_count: int reject_count: int combined_issues: list[str] combined_suggestions: list[str] ```

Layer 3: Approval loop. Orchestrates the iteration cycle: run panel, check result, improve content if needed, re-run, up to 3 iterations.

``` ApprovalLoopResult: final_content: dict approved: bool iterations: int panel_history: list[PanelResult] final_issues: list[str] ```

This layered structure means each piece is independently testable. You can unit-test TrustGuard's evaluation without running the full panel. You can test the iteration logic with mock evaluation results.

Budget Controls

The entire pipeline runs on approximately $5 per day. Here is the per-resort cost breakdown:

Research (3 API sources: semantic search, Brave, Tavily): ~$0.20
Content generation (Claude): ~$0.80
Approval panel (3 agents, 1-3 iterations): $0.15-$0.60
Content improvement (if needed): $0.10-$0.30
Decision calls (Claude Sonnet for lighter tasks): ~$0.05
Total per resort: ~$1.30-$1.95

Budget enforcement happens at three points: before the pipeline starts (can we afford any work today?), before each resort (can we afford one more?), and after each API call (log the cost for tracking). When the daily budget is exhausted, processing stops. The budget resets the next day.

Each panel run costs approximately $0.15-$0.20 for the three Claude calls. With worst-case 3 iterations, that is $0.45-$0.60 for the editorial review of a single resort guide. Compare that to a human editor at $30-$50 per article.

Failure Modes (What Happens When Agents Disagree)

The panel is designed for disagreement. Here are the real failure modes we have encountered:

Stale source data. Twice, TrustGuard approved content where a resort had changed ownership and the new operator had not updated their website. All research sources agreed on the outdated information. A human would have caught the discrepancy from context clues the agents missed. Our solution: content mentioning safety-critical information (avalanche conditions, childcare licensing, medical facilities) gets a human spot-check. About 30 minutes per week on top of an otherwise autonomous process.

Voice vs. completeness tension. VoiceCoach sometimes voted IMPROVE on content that FamilyValue had approved because detailed information read as dry. The iteration loop resolved this naturally: the improved version kept the detail but rewrote it in a warmer tone. The agents' different priorities created better content than either perspective alone.

Iteration cap reached. In hundreds of articles, the 3-iteration cap has been reached twice, both times due to insufficient research data rather than editorial issues. When the underlying information is thin, no amount of rewriting fixes it. These drafts got flagged for human review, and in both cases the correct action was to wait for better source data before publishing.

The Discovery Connection

This architecture did not emerge from a whiteboard. It came from talking to families.

Our methodology is rooted in continuous discovery. Before we built the editorial panel, we interviewed families about what they needed from a ski resort guide. The consistent feedback: accuracy matters more than polish (TrustGuard), specific details matter more than general impressions (FamilyValue), and tone matters because travel writing that sounds like a brochure erodes trust (VoiceCoach).

The three agents are not arbitrary. They map directly to what families told us they needed. Discovery shaped the architecture.

What This Pattern Means for Other Industries

The three-agent editorial panel is a reusable pattern. The specific agents change based on the domain:

Real estate: FactChecker (are the square footage and taxes correct?), BuyerValue (can someone make a viewing decision?), BrandVoice (does it match the agency's tone?)
E-commerce: AccuracyGuard (are specs and prices correct?), CustomerFocus (does it answer buying questions?), ToneCoach (is it persuasive without being pushy?)
Healthcare: MedicalAccuracy (are the facts clinically sound?), Accessibility (can a patient understand it?), Empathy (does the tone support rather than alarm?)

The architecture stays the same: multiple specialized agents, parallel evaluation, 2/3 majority vote, iterative improvement, and a draft fallback. The domain knowledge lives in the agent prompts, not the orchestration code.

Build Your Own

If your team produces high volumes of structured content, this pattern probably applies. Describe your content bottleneck and get a free Automation Blueprint that maps what an autonomous pipeline could look like for your use case.

Related Case Study

SnowThere

Fully autonomous agent pipeline. Research, generate, review, publish daily. Three-agent editorial panel ensures quality. 116 resorts across 16 countries, zero editors.

116 Resorts, 0 EditorsRead the full story

Have a similar challenge?

Describe your bottleneck and get a free Automation Blueprint in 60 seconds.

Get My Automation Blueprint Book a Conversation