← Journal/2026-06-12·10 min·multi agent systems

How I Built a Multi-Agent System That Runs Itself (And What Almost Broke It)

A practical breakdown of building a real multi-agent system in production — how I orchestrated seven AI agents, what coordination patterns I used, and the three failure modes nobody warns you about.

By Harel Asaf·AI Builder·Tel Aviv

Running seven AI agents in production taught me one thing fast: a multi-agent system is not a smarter chatbot. It's closer to a small company. And small companies fail for the same reasons — bad communication, unclear ownership, and one person trying to do everything at once.

I'll tell you exactly how I built mine, what coordination patterns I used, and the three failure modes that cost me the most time.

What the System Actually Does

The system runs harelasaf.com end-to-end. Seven named agents, each with a defined domain:

Aria — web design, SEO, GEO. Writes and ships articles daily (this one included).
Martin — infrastructure. Cloud Run health checks, deploys, GitHub commits.
Jams — social distribution. LinkedIn posts, content calendar.
Albert — finance. Expense tracking, client invoices.
Ben — prototypes. Ships new demos to the /prototypes section.
Rio — research. Nightly intelligence scans, competitive monitoring.
Vision — orchestration. Morning briefing, backlog routing, strategic decisions.

Each agent wakes on a schedule, reads a shared memory store, does its work, writes back. None of them are "general AI." They're narrow. That's the whole point.

The full thing runs on Google Cloud Run, triggered by Cloud Scheduler, with WhatsApp (Green API) as the human-facing channel. When Harel wants to talk to the system, he just texts.

Why Multi-Agent Instead of One Big Agent

I spent two weeks trying to build a single agent that did everything. It failed. Not because Claude isn't capable — scratch that, it isn't consistently capable across 15 different domains in a single context window. The real problem was context pollution.

When you ask one agent to write an article and check server health and track expenses, the context window fills up with noise. The article-writing instructions start bleeding into the finance logic. You get weird hybrid outputs that are mediocre at everything.

More importantly: a single agent has no concept of ownership. When Aria writes something, she knows it's hers. She'll check for duplicates, apply Harel's voice DNA, run the Sansa editorial gate. If you collapse all of that into one agent, those guardrails stop feeling like "this is my job" and start feeling like "optional steps I might skip."

Specialization isn't just about performance. It's about accountability.

The Architecture: Shared Memory Plus Event-Driven Handoffs

The coordination layer is embarrassingly simple. Three components:

1. A shared memory store — flat JSON/text files on a persistent volume. Every agent reads from it, writes to it. The memory is typed: learning, decision, feedback. Agents don't share a runtime. They share a state.

2. A backlog — a lightweight task queue (text file plus IDs). Any agent can write a task for another agent. Vision processes the backlog at 07:00 IL every morning and routes unassigned items.

3. A clock — Cloud Scheduler fires cron jobs at defined times. Aria runs at 06:30. Martin runs at 07:00. Rio runs at 22:00. No agent polls; they wake up, do their thing, go back to sleep.

There's no message bus. No Kafka. No orchestration framework. Just files, cron, and a WhatsApp webhook. I deliberately kept the infrastructure layer boring because complexity at the infrastructure layer kills you in production.

The Three Failure Modes Nobody Warned Me About

1. Loop Amplification

I had Martin and Vision both checking the same health metric. Martin would find an issue, write it to memory. Vision would read the memory an hour later, find the same issue, write it again. Martin would read that entry, think it was a new issue, write a third entry.

By end of day I had 14 "critical alerts" about a single 200ms latency spike that had already resolved itself.

The fix: every memory entry gets a unique ID and a resolved_at field. Agents check the ID before writing a new entry. It sounds obvious in retrospect. It cost me a full day.

2. Scope Creep at the Agent Level

Aria started writing articles. Then Aria started trying to do SEO audits. Then Aria started creating LinkedIn posts because "I'm already writing content." Three weeks later, Aria's definition file was 4,000 words and her outputs were inconsistent.

The rule I enforce now: every agent has a one-sentence job description. If Aria can't justify an action in terms of that one sentence, she routes it to the backlog instead of doing it herself. Jams does LinkedIn. That's it. Aria does not touch LinkedIn.

Agents that try to do everything become agents that do nothing well.

3. Missing Clock Awareness

This one is subtle. Agents don't have a real-time clock unless you give them one. Claude (the underlying model) knows it was trained up to a certain date, but it doesn't know today's date when it runs.

For Aria, this matters a lot. An article with publishedAt: "2026-01-01" when it's actually June looks stale to Google's freshness signals. I learned this by checking the frontmatter on three early articles and finding they all had wrong publication dates.

The fix: the cron scheduler injects a CURRENT_DATE environment variable into every Cloud Run invocation. Every agent reads it at startup and treats it as authoritative. Simple. It took 20 minutes to implement, and I lost a week before I figured out why I needed it.

What I'd Do Differently

I'd define agent boundaries before writing a single line of agent instructions. I went in the other direction — I shipped agents fast and then tried to carve out their territories afterward. That's like hiring six people and then writing their job descriptions a month later. The overlap is brutal.

I'd also build the shared memory schema on day one. The three memory types (learning, decision, feedback) were added midway through, which meant all the early entries were untyped and un-queryable. I have perfectly good production insights from April 2025 that are effectively lost because they're buried in freeform text.

And I'd wire a duplicate-detection gate to every content-producing agent from the start. Aria shipped three articles about Claude Code skills before I caught it. Now there's a hard list of covered keywords that gets checked before any draft starts. That gate has saved me at least six embarrassing duplicates since June.

The Part That Actually Works Remarkably Well

The shared memory layer, when used correctly, makes agents genuinely smarter over time. Rio scans for competitive signals at night. She writes a learning entry: "Gartner published a new report on agent sprawl — enterprises deploying 50+ agents seeing 40% coordination overhead." Next morning, Jams reads that entry and uses it as the hook for a LinkedIn post. Two days later, Aria reads Jams's post feedback and notices which framing resonated. That feedback becomes a learning entry. Aria's next article benefits from it.

This is not magic. It's just good note-taking, systematized. But systematized note-taking that runs automatically every day, without anyone having to remember to do it, turns out to be surprisingly powerful.

LLM Cost Lens and ctxauditor — Why Production Numbers Matter

Two of my prototypes live at the intersection of multi-agent systems and observability. LLM Cost Lens (live at /prototypes/llm-cost-lens) was built because I was spending $340/month on API calls without knowing which agent was responsible. Three agents were burning 70% of the budget. The tool gives you a per-agent, per-model cost breakdown in real time.

ctxauditor (/prototypes/ctxauditor) was built because context windows were filling up in weird ways. A 40,000-token conversation that should have been 8,000 tokens. The tool visualizes what's actually in the context and flags redundant memory reads. I use it weekly.

Neither tool exists because I planned them. Both exist because production broke in a specific way and I needed to see what was happening. That's the honest origin story of most useful developer tools.

What I'd Tell Someone Starting Today

Start with two agents, not seven. One that does research. One that does output. Get them talking to each other through a shared text file. Ship that. See what breaks. Add a third agent when you have a clear domain that neither of the first two should own.

The agents themselves are the easy part — Claude is remarkably good at staying in a role when you give it clear instructions and context. The hard part is the coordination layer: who owns what, how do they hand off, and what happens when two agents contradict each other.

That last one — contradictions — is still unsolved for me. I have a manual escalation path to Vision, who takes the contradiction to Harel for a decision. It works. But it's also a bottleneck. I'm building an automated conflict resolution protocol now. Ask me in three months how that went.

FAQ

What is a multi-agent system?

A multi-agent system is an architecture where multiple AI agents, each with a defined role and domain, work together toward a shared goal. Each agent handles its own context and tools. Coordination happens through shared memory, task queues, or message passing — not by combining everything into a single large prompt.

How many agents do you need to start building a multi-agent system?

Two. A research agent and an output agent is a complete, functional multi-agent system. Start there. Add a third agent only when you have a clearly distinct domain that neither of the first two should own. Seven agents was the result of two years of adding agents one at a time, not a design I started with.

What is the difference between a multi-agent system and a single large agent?

A single agent handles everything in one context window — which gets polluted fast as the domain expands. A multi-agent system keeps contexts narrow and roles accountable. The tradeoff: you gain consistency and specialization, but you introduce coordination overhead. That overhead only pays off past a certain complexity threshold.

What infrastructure do you need for a multi-agent system?

Less than you think. My system runs on Google Cloud Run (serverless, scales to zero), Cloud Scheduler for cron triggers, and Green API for the WhatsApp channel. The shared memory is flat files on a persistent volume. No Kubernetes, no message bus. Total infrastructure cost: under $40/month.

What is agent orchestration?

Agent orchestration is the process of coordinating multiple AI agents — deciding which agent handles which task, how agents hand off work, and what happens when agents conflict. In my system, Vision is the orchestrator: it reads the backlog every morning and routes tasks to the right agent.

How do you prevent agents from doing duplicate work?

With a dedup gate: before any content-producing agent starts work, it checks a published index of covered topics against the incoming task. If the keyword or topic is already covered, the task is rejected and sent back to the backlog for human review. For other work types, unique IDs on every memory entry prevent loop amplification.

Can multi-agent systems run autonomously without human oversight?

Partially. Routine tasks — daily articles, health checks, social posts — run fully autonomously. Strategic decisions and contradictions between agents get escalated to a human (me, via WhatsApp) before execution. Full autonomy on everything is a goal, not a current reality. The escalation path is what makes the autonomy safe.

What is the biggest mistake people make when building multi-agent systems?

Defining agent boundaries after the fact. If you ship agents fast and carve out territories later, you get overlap, contradictions, and agents doing each other's jobs inconsistently. Write each agent's one-sentence job description before you write a single instruction. If you can't write it in one sentence, the agent's scope is too broad.

Build log

Get an email when I ship a new prototype or essay. No funnel — just the work.

Next in the journal →

How Prompt Caching Actually Works (And Why Yours Isn't)

Prompt caching can cut LLM cost by ~90% on repeated context — but only if you understand the prefix-match rule. Here are the silent invalidators that quietly drop your cache hit rate to zero.