№ 009customer service · routing filed may '26

Building the escalation classifier in an afternoon.

Fifty labelled tickets, one prompt, one router — that's the whole build.

Austin AI Guy·11 min read·ships in ~4 hrs

This is the setup for an escalation classifier built as a single Anthropic prompt. Not an ML project. An afternoon.

What you'll have when you finish: a labels.csv with 50 hand-labeled tickets, a 30-line classifier prompt with hard JSON schema output, a webhook endpoint that routes every new ticket to bot/human/engineering, a Postgres routing log, and a Langfuse dashboard showing routing accuracy week-over-week.

Accounts you'll need: console.anthropic.com · plain.com or usepylon.com for the ticketing webhook · supabase.com for Postgres · langfuse.com for accuracy tracking. Total cost at 200 tickets/day: ~$5/mo in API calls.

The stack.

01Anthropic API — Sonnet 4.6, JSON outputdaily
0250 labelled tickets — from the last monthonce
03Plain or Pylon — ticketing webhook targetdaily
04Postgres — routing decisions logdaily
05Langfuse — accuracy trackingweekly

How to apply it.

0145 min

Pull 50 tickets, label by hand.

Last 30 days. Random sample. Three severity buckets. Three is the right number — not five, not seven.

routine
→ bot

Docs answers, how-to, FAQ. The deflection bucket.

urgent
→ human

Enterprise, churn-risk language, billing disputes, repeat tickets within 24h.

on_fire
→ engineering

Prod down, data loss, security report, legal language, "need a human now."

Label by the action you actually took, not the action you wished you'd taken. Reality beats best practice every time.
0260 min

Write the prompt — JSON output, hard schema.

30 lines. Define the three buckets explicitly. List the labeled tickets as examples. Force JSON output: { severity, route, confidence, reasoning }. Reasoning is one sentence; it's for your audit, not the model.
0330 min

Test on a holdout of 10.

Ten more tickets you didn't include in the prompt. Run them. You already know the answer — that's the point.

Target: 9/10 right. If lower, the labels in your prompt aren't clean. Fix the labels, not the prompt structure.
0445 min

Wire the webhook.

Plain or Pylon webhook on new ticket → your endpoint → classifier → routing decision back to the ticketing tool. Plain's API takes one line. Pylon's takes two.

Log every decision to Postgres before acting. The log is what saves you the first time the model is wrong.
052 weeks

Daily review for two weeks.

Each morning: pull yesterday's classifications. Read 10 at random. Flag anything wrong. Update the labels in the prompt, not the prompt structure.

After two weeks the classifier stabilizes. Stop reviewing daily; switch to weekly audits.

What we stopped doing.

×ML pipelines for what a prompt does. No training set is needed beyond the 50 examples.
×Confidence thresholds. Claude is already doing this internally. Adding your own threshold just adds noise.
×Six severity levels. Three is right.
×Re-training weekly. The classifier is stable once dialled. Update labels, not structure.
×Routing decisions before logging them. Log first. Act second.
×Hiding the decision from the agent. Show them the model's reasoning. They catch issues you'd miss.

The take.

Routing is one decision repeated thousands of times. One prompt does it. The 4-hour build pays for itself in the first week.

Steal one thing: the daily 10-ticket review. It's the only difference between a classifier that holds at 94% and one that drifts to 70% by month three.

Decision rules three signals worth acting on

If you see… Do this Don't do this

Accuracy < 88% on weekly sample Add 10 fresh labeled examples to the prompt and redeploy. Keep the old version in git. Rewrite the whole prompt.

Urgent rate > 5% of total tickets Tighten the urgent definition in the prompt with two narrower examples. Re-route urgent to bot to reduce visible volume.

Confidence < 0.6 routes climbing The override is doing its job — let confidence drive to human. Read those tickets weekly. Lower the confidence threshold.

Three drop-ins. The labeling template, the classifier prompt, the webhook handler.

The labeling template.

CSV format. 50 rows. Spend 45 minutes here, save quarters later.

labels.csv — 50 rows minimum

ticket_id,subject,first_message,tier,severity,route,why_this_label
TKT-1001,Login broken,"Can't log in. Tried 3 times.",free,routine,bot,"docs answer, no impact"
TKT-1002,Production down,"Our entire app is 500ing.",enterprise,on_fire,engineering,"prod outage on paying account"
TKT-1003,Billing question,"Why was I charged $90?",pro,urgent,human,"billing dispute, real money"
TKT-1004,How do I export?,"Looking for the export button.",free,routine,bot,"docs answer"
TKT-1005,Migration broken,"Just upgraded plan and now all my data is gone.",pro,on_fire,engineering,"data loss claim, escalate"
...

RULES
  - 50 rows minimum, 100 ideal.
  - Random sample across last 30 days. Don't cherry-pick.
  - Label by what you ACTUALLY did, not best practice.
  - severity: routine | urgent | on_fire
  - route:    bot | human | engineering
  - why_this_label: one sentence. Used in the prompt.

The classifier prompt.

30 lines, JSON output. Examples embedded from labels.csv.

You are a customer-service ticket classifier. Output JSON
in this exact shape — nothing else:

{
  "severity":   "routine" | "urgent" | "on_fire",
  "route":      "bot" | "human" | "engineering",
  "confidence": 0.0 - 1.0,
  "reasoning":  "<one sentence, max 20 words>"
}

DEFINITIONS
───────────
on_fire
  Production down, data loss, security report, legal language,
  billing dispute over $1k, or explicit "need human now."

urgent
  Enterprise (ARR > $50k/yr), repeat ticket within 24 hours,
  churn risk language, or anything that mentions a competitor
  by name in a frustrated context.

routine
  Everything else. Docs answers, how-to, FAQ, feature requests.

ROUTE
  on_fire   -> engineering
  urgent    -> human
  routine   -> bot

EXAMPLES — past tickets and the correct call
────────────────────────────────────────────
[paste 50 rows from labels.csv here, formatted as:]

  Ticket: "{first_message}"
  Tier: {tier}
  Correct severity: {severity}
  Correct route: {route}
  Why: {why_this_label}

INPUT
─────
Ticket: "{{ticket_body}}"
Tier:   {{tier}}
ARR:    ${{arr_usd}}

The webhook handler.

Edge function or serverless. Log first, route second.

// webhook.ts — ticket router

export async function POST(req) {
  const ticket = await req.json();

  // 01 — idempotency
  const key = `${ticket.id}:${ticket.updated_at}`;
  if (await alreadySeen(key)) return new Response("dupe", { status: 200 });

  // 02 — classify
  const decision = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 200,
    system: classifierPrompt,
    messages: [{ role: "user", content: JSON.stringify(ticket) }],
  });

  const result = JSON.parse(decision.content[0].text);

  // 03 — log BEFORE acting
  await db.routing_log.insert({
    ticket_id:  ticket.id,
    severity:   result.severity,
    route:      result.route,
    confidence: result.confidence,
    reasoning:  result.reasoning,
    raw_prompt: classifierPrompt.length, // for drift check
    at:         new Date(),
  });

  // 04 — act
  await ticketingTool.assign({
    ticket_id: ticket.id,
    queue:     queueFor(result.route),
    tags:      [`auto:${result.severity}`],
  });

  return new Response("ok", { status: 200 });
}

Related stackThe support stack →

Next in the libraryBrowse all 12 guides →

Need this done for you? The author works on this exact thing with audit clients at austinaiguy.com.

Building the escalation classifier in an afternoon.

The stack.

How to apply it.

Pull 50 tickets, label by hand.

→ bot

→ human

→ engineering

Write the prompt — JSON output, hard schema.

Test on a holdout of 10.

Wire the webhook.

Daily review for two weeks.

What we stopped doing.

The take.

Customer-tier weighting.

Multi-label in one call.

Shadow-mode for new categories.

Auto-PR when accuracy drops.

The labeling template.

The classifier prompt.

The webhook handler.