№ 009customer service · routing filed may '26

Building the escalation classifier in an afternoon.

Fifty labelled tickets, one prompt, one router — that's the whole build.

This is the setup for an escalation classifier built as a single Anthropic prompt. Not an ML project. An afternoon.

What you'll have when you finish: a labels.csv with 50 hand-labeled tickets, a 30-line classifier prompt with hard JSON schema output, a webhook endpoint that routes every new ticket to bot/human/engineering, a Postgres routing log, and a Langfuse dashboard showing routing accuracy week-over-week.

Accounts you'll need: console.anthropic.com · plain.com or usepylon.com for the ticketing webhook · supabase.com for Postgres · langfuse.com for accuracy tracking. Total cost at 200 tickets/day: ~$5/mo in API calls.

01

The stack.

  • 01Anthropic API — Sonnet 4.6, JSON outputdaily
  • 0250 labelled tickets — from the last monthonce
  • 03Plain or Pylon — ticketing webhook targetdaily
  • 04Postgres — routing decisions logdaily
  • 05Langfuse — accuracy trackingweekly
02

How to apply it.

  1. 0145 min

    Pull 50 tickets, label by hand.

    Last 30 days. Random sample. Three severity buckets. Three is the right number — not five, not seven.

    routine

    bot

    Docs answers, how-to, FAQ. The deflection bucket.

    urgent

    human

    Enterprise, churn-risk language, billing disputes, repeat tickets within 24h.

    on_fire

    engineering

    Prod down, data loss, security report, legal language, "need a human now."

    Label by the action you actually took, not the action you wished you'd taken. Reality beats best practice every time.

  2. 0260 min

    Write the prompt — JSON output, hard schema.

    30 lines. Define the three buckets explicitly. List the labeled tickets as examples. Force JSON output: { severity, route, confidence, reasoning }. Reasoning is one sentence; it's for your audit, not the model.

  3. 0330 min

    Test on a holdout of 10.

    Ten more tickets you didn't include in the prompt. Run them. You already know the answer — that's the point.

    Target: 9/10 right. If lower, the labels in your prompt aren't clean. Fix the labels, not the prompt structure.

  4. 0445 min

    Wire the webhook.

    Plain or Pylon webhook on new ticket → your endpoint → classifier → routing decision back to the ticketing tool. Plain's API takes one line. Pylon's takes two.

    Log every decision to Postgres before acting. The log is what saves you the first time the model is wrong.

  5. 052 weeks

    Daily review for two weeks.

    Each morning: pull yesterday's classifications. Read 10 at random. Flag anything wrong. Update the labels in the prompt, not the prompt structure.

    After two weeks the classifier stabilizes. Stop reviewing daily; switch to weekly audits.

03

What we stopped doing.

  • ×ML pipelines for what a prompt does. No training set is needed beyond the 50 examples.
  • ×Confidence thresholds. Claude is already doing this internally. Adding your own threshold just adds noise.
  • ×Six severity levels. Three is right.
  • ×Re-training weekly. The classifier is stable once dialled. Update labels, not structure.
  • ×Routing decisions before logging them. Log first. Act second.
  • ×Hiding the decision from the agent. Show them the model's reasoning. They catch issues you'd miss.
04

The take.

Routing is one decision repeated thousands of times. One prompt does it. The 4-hour build pays for itself in the first week.

Steal one thing: the daily 10-ticket review. It's the only difference between a classifier that holds at 94% and one that drifts to 70% by month three.

Next in the libraryBrowse all 12 guides →

Need this done for you? The author works on this exact thing with audit clients at austinaiguy.com.