PLAYBOOK // AGENTS

Agentic systems: human-in-the-loop done right

A practical pattern for SMBs shipping LLM workflows without accepting 3-15% silent error rates.

2026-04-17 ·8 min read ·AISO-DEV agentic-systemsplaybookhuman-in-the-loop

TL;DR

Current LLM error rates on real business data run 3-15% depending on task.
Shipping fully autonomous agents to SMBs on day one is how AI programs blow up.
The pattern: input → LLM + tools → human review queue → action. Log every step.
Auto-approve only after an observed workflow has stable accuracy over a meaningful sample.
Build the review UI on day one. It’s not optional - it’s the product.

Why we don’t ship fully autonomous agents

LLMs hallucinate. Retrieval misses. Tools fail. The error rate on serious production data is somewhere between 3% and 15% - and that’s with careful prompting.

For a 10-person SMB processing 200 support tickets a week, 5% error equals 10 bad replies a week. That’s a fire. Worse - the errors are invisible. By the time you notice, you’ve trained your customers to distrust you. That risk is exactly what an AI readiness audit is built to surface before code gets written.

So we don’t ship “the AI handles it.” We ship “the AI drafts it, a human ships it, and we measure.” Then, where the data supports it, we let specific branches auto-approve. This is the spine of how we approach AI agent development on every SMB engagement.

The pattern (five parts)

1. Input gate

Every agent has one job. Not “handle customer issues.” Rather: “triage inbound support tickets into [billing / technical / sales / other] and draft a reply in the right category.”

Narrow the input. Narrow the output. Everything after this step gets easier.

2. LLM step (with tools)

The model does the reasoning and optionally calls tools:

Search your knowledge base.
Look up the customer record.
Read the last N tickets from this customer.
Check inventory, order status, whatever’s relevant.

The output is a structured draft - never a free-text reply. Structured output forces the model to commit to fields you can validate.

Example for support triage:

{
  "category": "billing",
  "sub_reason": "failed_payment",
  "confidence": 0.86,
  "draft_reply": "…",
  "recommended_action": "resend_invoice",
  "sources": ["ticket_1023", "customer_482.billing_history"]
}

3. Review queue (this is where people get it wrong)

The draft lands in a review UI, not an inbox.

Rules for the review UI:

One item on screen at a time. Not a list.
Big, visible Approve / Edit / Reject buttons.
The draft pre-populated in the actual tool (the helpdesk reply field, the CRM note, the email template).
Source citations inline - the reviewer can click the ticket, the KB article, the customer record without leaving the queue.
One-second keyboard shortcuts for approve / reject.
Logged decision + reviewer ID + timestamp.

The metric that matters: time to approve. If it’s under 15 seconds per item, people stay in flow. Over 45 seconds, they abandon. Build the UI to the sub-15-second bar.

4. Action + logging

Approved drafts execute (send the email, update the CRM, post the Slack message). Rejected drafts log the reject reason. Edited drafts log the diff - that’s training data for the next prompt iteration.

Everything writes to a single event log you can query:

agent_id | input_id | prompt_version | output | reviewer | decision | latency | cost

You need this log because the answer to “is the agent working?” is not a vibe - it’s a 30-day accuracy chart filtered by reviewer.

5. Evaluation + auto-approval

After two to four weeks of shadow running, you have enough decisions to compute per-branch accuracy.

Category classification: 98% accurate across 500 items → auto-approve the category, still review the draft reply.
Draft reply quality: reviewer approves as-is 92% of the time on “failed_payment” sub-reason → auto-approve that narrow branch with spot-check sampling.
Another sub-reason at 78%: keep in review. Work on the prompt.

Auto-approval is a decision driven by data, not aspiration. And it’s always reversible - a single failure pattern can put a branch back into review in seconds.

What a week looks like

Monday. 200 tickets through the agent over the weekend. Reviewer queue at 160 (auto-approved 40 billing-resends). Reviewer clears the queue in under 90 minutes.

Friday. Weekly review of the evaluation dashboard. Two sub-reasons have dropped accuracy (seasonal new SKU confusing the model). Prompt gets a context update. A/B the update against last week’s prompt on the shadow set before rolling it to production.

That’s the loop. Boring, measurable, and safe.

Common mistakes we see

Review UI as a spreadsheet. A list of 400 rows with a “Send” button is not a review UI. It’s a guilt generator. Reviewers tune out. Quality drops. You blame “the agent.”
No structured output. The model returns a paragraph, the reviewer has to read the whole thing, decide, maybe rewrite. Latency per item climbs. Build the structure.
No logging. You can’t tune what you can’t measure. Every approval, reject, and edit writes to the log. Non-negotiable.
One prompt, frozen. Prompts drift. Models change. Workflows evolve. Version your prompts like code.
Auto-approving too early. 200 items is not a sample. 2,000 with consistent accuracy across sub-branches is.
Ignoring cost. Per-call cost matters at scale. Log it. Pick the cheapest model that hits the quality bar per branch.
Shipping the dashboard later. The dashboard is the product for the operator. Ship it week one.

Auto-approval is the destination, not the starting line. When clients bring us in for AI implementation services, this is the curve we plan against: shadow run, accuracy gates, narrow auto-approval, then expand.

Tooling notes

Model orchestration: LangGraph, Haystack, or plain code + a framework. We’ve shipped all three. For under ~10 steps with clear branching, plain code + typed schemas wins.
Review UI: Astro + a simple queue component works fine. Or embed the queue inside the tool people already use (Slack, Linear, the helpdesk).
Evaluation: Braintrust, Langfuse, or a Postgres + Metabase combo. Don’t sleep on the Postgres + Metabase path - it’s cheap and honest.
Models: Claude for complex reasoning, GPT for general, smaller models for classification. Don’t overpay for classification tasks.

Our own agent stack (since people ask)

The AISO Orchestrator runs 17 agent types across research, copywriting, engineering, QA, and operations. Every non-trivial action routes to either an auto-approver skill or a human. Three auto-approve stages exist (safe SEO changes, non-content code changes, clearly technical specs). Every other output gets a human sign-off - in practice, Greg or the relevant specialist. The pattern above is the pattern we run against ourselves.

Got a workflow that should be an agent? Scope an agent build →

Not sure if it’s agent-ready? Free AI Readiness Audit →

HOW // WE HELP