Our first attempt at building AI agents cost us $150 in four hours.

That is not a typo. One hundred and fifty dollars. In an afternoon. On a system that was supposed to save us money.

The mistake was obvious in hindsight. We had handed the AI model every decision. Every API call, every data formatting step, every routing choice, every classification task. The model was brilliant at all of it. It was also wildly expensive at all of it, because we were paying premium token rates for work that a Python script could do for free.

That afternoon changed how we build everything. Twelve months later, we run 20 agents across three OpenClaw instances for about $1 a day in AI credits. Same quality of output. Same operational coverage. Ninety nine percent cheaper.

Here is the architecture that got us there.

The Expensive Lesson

When you first start building AI agents, the temptation is to let the AI do everything. It feels like magic. You describe what you want in plain English, the model figures out how to do it, and the result appears. No coding. No flowcharts. No debugging. Just vibes and tokens.

The problem is that most of what an agent does on a daily basis is not reasoning. It is plumbing.

Checking whether a team member logged time today is not an AI problem. It is a database query. Formatting a Slack message with the results is not an AI problem. It is string concatenation. Sending a reminder at 3:30 PM is not an AI problem. It is a cron job. Pulling contact information from Hunter.io, validating it with ZeroBounce, and organizing it in a structured format is not an AI problem. It is a series of API calls with error handling.

When we let the AI model handle all of that, we were paying for reasoning on tasks that required zero reasoning. Every token spent on "look up this API endpoint and format the JSON response" was a token wasted on something a ten line Python function could do faster, cheaper, and more reliably.

The $150 afternoon was the cost of learning that lesson. It was worth every penny because we never made that mistake again.

The Architecture: Scripts Handle the Predictable, AI Handles the Unpredictable

Every agent we build now follows the same pattern. We start in Claude Code and build out the core functionality as deterministic Python scripts. The logic is explicit. The behavior is predictable. There are no hallucinations because there is no model involved in the mechanical parts.

Then we bring AI in surgically, only for the parts that genuinely require intelligence.

Here is how that breaks down in practice:

Scripted (zero AI cost):

API calls to ClickUp, Front, Gmail, Google Drive, HubSpot, and every other integration
Data formatting, cleaning, and transformation
Routing logic (which agent handles which type of request)
Scheduling and cron jobs (morning briefings, end of day summaries, compliance checks)
SLA timer calculations and escalation triggers
Spam classification using rule based pattern matching
File organization and folder management
Health checks and service monitoring
Log aggregation and audit trail generation

AI powered (costs tokens, but worth it):

Drafting email replies that match a specific person's writing style
Analyzing meeting transcripts for action items, sentiment, and risk signals
Generating briefings that synthesize information from multiple sources
Evaluating whether a prospect contact is a genuine decision maker
Diagnosing novel system failures that do not match known patterns
Content generation (blog drafts, summaries, reports)
Complex classification tasks where rules cannot capture the nuance

The ratio in our system is roughly 80/20. Eighty percent of what our agents do every day is scripted. Twenty percent involves an AI model. That ratio is why the daily cost dropped from "this will bankrupt us" to about a dollar.

Tiered Models: Not All AI Tasks Are Equal

Even within the 20% that requires AI, not every task needs the same model.

We run a tiered routing strategy that matches the complexity of the task to the cost of the model:

Free (local models via Ollama): Summarizing text, cleaning scraped data, chunking documents for our knowledge base, basic classification, embedding generation. We run Qwen 2.5 and nomic embed text locally on the same Mac Studio that hosts everything else. These models handle thousands of operations a day at zero marginal cost. When Qwen 3.5 dropped recently, we found it performing at roughly the level of mid tier cloud models from six months ago. That is a significant step up for a model running on local hardware for free.

Cheap (Gemini Flash): Standard classification, email triage, template driven content fills, and any task where speed matters more than nuance. Costs as low as $0.30 per million input tokens. We use this tier for high volume, moderate complexity work that needs more intelligence than a local model but does not justify a premium model.

Mid tier (Gemini Pro): Email drafts, meeting briefings, coordination across agents, and tasks that require genuine reasoning but are not client facing. This is the workhorse tier for most of the AI powered 20%.

Premium (reserved for high stakes output): Client facing email drafts, complex diagnostic analysis, and anything where getting it wrong has real consequences. Used sparingly and only after a cheaper model has already compressed and organized the context.

The core principle: cheap models gather and organize. Expensive models judge and create.

Context Distillation: The Hidden Cost Multiplier

The single biggest cost driver in AI agent systems is not which model you use. It is how much context you send it.

A typical agent task might involve: the current email thread (20 messages), the client's communication history (dozens of previous interactions), relevant meeting transcripts, the client's project status from ClickUp, and any previous corrections the human made to similar drafts. If you feed all of that raw to a premium model, you are paying for the model to read through pages of context before it even starts thinking about the actual task.

Our system distills context before it reaches the expensive model. A cheap model (or a script, when the extraction is mechanical) reads all of that raw input and produces a focused brief: who the email is from, what they want, what the relevant history is, what tone previous corrections suggest, and what constraints apply. The premium model only sees that brief.

This keeps our premium model costs 70 to 95% lower than they would be if we fed everything raw. The quality of the output does not suffer because the distillation step preserves everything the model actually needs to do its job. It just strips out the noise.

If you are running agents and your token costs are climbing, look at your context sizes before you look at your model choices. Compressing input is almost always a bigger lever than downgrading models.

The Alternative Architecture: AI Driven SOPs

Our approach is not the only way to do this.

We have spoken with another agency running a comparable number of agents on OpenClaw who took a fundamentally different path. Instead of scripting the deterministic logic, they built their agents around AI driven standard operating procedures. Each agent receives a detailed SOP: "this is how we handle time tracking compliance," "this is what happens after a meeting ends," "this is the process for triaging a new client email." The AI interprets and executes against those instructions, making more autonomous decisions within the boundaries of the SOP.

Their agents write daily self assessments. They document what worked, what failed, and what they would do differently. Over time, those assessments feed back into the system and the agents genuinely improve. One of their agent logs noted: "Attempted time entry creation with non existent properties without schema validation first. Should have tested single record before batch operation to catch property errors." That is a real learning loop.

The tradeoffs are clear:

The SOP approach gives you flexibility and self improvement. Agents can adapt to novel situations without new code. The self assessment loop means they get better over time. And the SOP format is accessible to non developers who can read and edit plain English instructions.

The scripted approach gives you predictability and cost control. Agents do exactly what you told them to do, every time. There are no surprise behaviors, no token spikes from an agent deciding to "think harder" about a routine task, and no risk of the model misinterpreting an SOP in a creative but wrong way.

Both approaches produce working agency operations systems. Both serve real clients. The right choice depends on your tolerance for unpredictability, your budget for AI credits, and whether you have someone on the team who can write Python.

We chose scripts because we wanted the lowest possible operating cost and the highest possible predictability. We are not against the SOP approach. We just learned the hard way that when you give AI models more decision surface, you pay for it in tokens and in the occasional surprise.

What This Means for Your Agency

If you are thinking about building AI agents for your agency, here are the practical takeaways:

Start with the scripts, not the AI. Before you involve any model, ask yourself: does this task actually require intelligence, or does it require execution? If a series of API calls and some conditional logic can handle it, script it. Save the AI for the parts where a human would need to think.

Tier your models like you tier your team. You would not assign a senior strategist to format a spreadsheet. Do not assign a premium AI model to summarize a meeting transcript. Match the cost of the model to the complexity of the task.

Compress before you spend. Every token you send to a premium model costs money. Distill your context first. A cheap model or a simple script can extract the relevant information and throw away the noise before the expensive model ever sees it.

Measure your actual costs from day one. We track token usage, cost per agent, cost per task, and cost per day across every instance. If you do not measure it, you cannot optimize it. The dashboard does not need to be fancy. A daily log that shows which agents consumed how many tokens is enough to spot the problems.

Do not be afraid of the hybrid. Some agents in our system are 95% scripted with a tiny AI component for the one step that needs reasoning. Others are more balanced. There is no rule that says every agent has to follow the same architecture. Match the approach to the task.

The $150 afternoon felt like a disaster at the time. In retrospect, it was the most valuable four hours we spent on the entire project. It forced us to think about AI agents as an engineering problem, not a prompting problem. And that shift in thinking is the difference between a system that costs a fortune to run and one that costs a dollar a day.

AgencyBoxx runs 50+ services on dedicated hardware for a fraction of what most agencies spend on a single SaaS subscription. Book a Walkthrough to see the architecture in action.