Excellent
All guides

AI agents

Evaluating AI agents for business operations

How to tell a real agent from a chat box — separation of duties, evidence trails, budgeting, and the questions that filter the noise.

10 minute read


Every product launched in 2026 calls itself "agentic." Most are not. The difference matters — because the actual capability you're buying is somewhere between "rebranded autocomplete" and "an autonomous worker that can move money."

This is the framework we use to tell them apart.

The two-axis map

Plot every "AI feature" on two axes:

  • Authority — does the AI just suggest, or does it act?
  • Accountability — is there a separate party that verifies the action?

That gives you four quadrants:

Low authorityHigh authority
No accountabilityCopilotLoose cannon
Verifier requiredReviewable draftReal agent

You want products in the bottom-right. Most products are in the top-left. The top-right is what makes the news when it goes wrong.

What real agent capability looks like

Six properties, in priority order.

1. Named roles, not "the AI"

A real agent system has roles you can name:

  • Shipper — does the work.
  • Verifier — checks it.
  • Triager — sorts the inbound.
  • Scribe — writes the artifacts.
  • Planner — sequences the work.

If the product just calls everything "the AI," ask which role you're getting. If they can't answer, the system doesn't have roles — it has one undifferentiated model with prompts.

2. Separation of duties

The shipper can't approve the shipper. The verifier is a different identity, ideally a different model or a different prompt with a different system context. This isn't a process — it's a property of the platform. The MCP server should refuse to mark work done when the actor is the same one that moved it to review.

If a single agent can finish its own work, you don't have separation of duties — you have a checkbox.

3. An evidence trail you can read after the fact

Every action the agent took should be queryable: which model, which tool, what arguments, what return value, what cost. The auditors aren't going to accept "we trust the agent." They'll accept a log.

The minimum bar: per-session, per-tool, per-cost timeline of every autonomous action.

4. Budgets the agent can't exceed

A real agent system has a budget — in dollars per session, tokens per task, or tool-calls per goal. The system refuses to continue past the cap and asks for supervision. If your AI feature can run indefinitely with no cap, it's a foot-gun.

5. Goals, not tasks

A copilot lives in a chat window and reacts to the last message. An agent has a durable goal that persists across sessions. "Triage the inbound queue daily" is a goal. "Summarize this email" is a task. The presence of goals is the cleanest tell between the two.

6. Inspectable prompts and tools

You should be able to see the system prompts, the tool definitions, and the policy file that governs what the agent can do. Vendors that hide the prompts hide the failure modes. If you can't read it, you can't reason about it.

How to test a vendor's claim

A practical evaluation, in four steps.

Step 1 — Read the audit log

Ask the vendor for a sample audit log from a real workflow. Look for: actor IDs, model names, tool calls with arguments, costs, and a clear before/after delta. If the "log" is INFO: AI processed task, walk away.

Step 2 — Try to make the agent approve its own work

Sign in as one user. Mark a task in progress. Move it to review. Try to move it to done as the same identity. The platform should refuse. If it doesn't, the verification gate is a UI convention, not a property of the system.

Step 3 — Look for the budget knob

Find the place where you set "this role can spend at most $X per session" or "at most Y tool calls per goal." If there isn't one, the vendor's confidence in the agent is unearned.

Step 4 — Read the prompts

If the prompts are inspectable, read them. They tell you exactly what the agent is told to do, what tools it can call, and what guardrails are nominal vs. structural. If the prompts are hidden, the failure modes are hidden with them.

Red flags

  • "Just trust the AI." No.
  • "Magic" or "intelligence" in the spec sheet. Spec sheets do not contain magic.
  • A single AI module that does everything. That's marketing, not architecture.
  • No way to disable the AI per-record or per-workflow. You will need to.
  • Inference is mandatory in the vendor's cloud. Your customer data shouldn't be the price of admission.

What Excellent's agent layer looks like

For comparison, the version we've built:

  • Five named roles — Shipper, Verifier, Triager, Scribe, Planner — each with a system prompt and a tool allowlist.
  • Structural separation of duties — the MCP server refuses task_update to done if the actor matches the shipper. The gate is in the database, not the UI.
  • Per-session Lens analytics — every tool call, every model, every cost, queryable in SQL.
  • Per-role budgets — a worker's run is capped in dollars; over-budget triggers human review.
  • Durable goals — agents pursue persistent goals across sessions, not chat-shaped tasks.
  • Inspectable everything — prompts, tools, and policy files are part of your local workspace, not vendor-side.

You don't have to take our word for any of that — every one of those bullets is visible in the product and the docs.


Next up: the evaluating an AI back office comparisons score specific vendors against this same six-point framework, so you can apply it without re-doing the homework.

Done reading. Ready to own the stack?

Excellent is in early access. Join the waitlist — we onboard cohorts every couple of weeks.