Building the escalation classifier in an afternoon.
Fifty labelled tickets, one prompt, one router — that's the whole build.
This is the setup for an escalation classifier built as a single Anthropic prompt. Not an ML project. An afternoon.
What you'll have when you finish: a labels.csv with 50 hand-labeled tickets, a 30-line classifier prompt with hard JSON schema output, a webhook endpoint that routes every new ticket to bot/human/engineering, a Postgres routing log, and a Langfuse dashboard showing routing accuracy week-over-week.
Accounts you'll need: console.anthropic.com · plain.com or usepylon.com for the ticketing webhook · supabase.com for Postgres · langfuse.com for accuracy tracking. Total cost at 200 tickets/day: ~$5/mo in API calls.
The stack.
- 01Anthropic API — Sonnet 4.6, JSON outputdaily
- 0250 labelled tickets — from the last monthonce
- 03Plain or Pylon — ticketing webhook targetdaily
- 04Postgres — routing decisions logdaily
- 05Langfuse — accuracy trackingweekly
How to apply it.
- 0145 min
Pull 50 tickets, label by hand.
Last 30 days. Random sample. Three severity buckets. Three is the right number — not five, not seven.
routine→ bot
Docs answers, how-to, FAQ. The deflection bucket.
urgent→ human
Enterprise, churn-risk language, billing disputes, repeat tickets within 24h.
on_fire→ engineering
Prod down, data loss, security report, legal language, "need a human now."
Label by the action you actually took, not the action you wished you'd taken. Reality beats best practice every time.
- 0260 min
Write the prompt — JSON output, hard schema.
30 lines. Define the three buckets explicitly. List the labeled tickets as examples. Force JSON output:
{ severity, route, confidence, reasoning }. Reasoning is one sentence; it's for your audit, not the model. - 0330 min
Test on a holdout of 10.
Ten more tickets you didn't include in the prompt. Run them. You already know the answer — that's the point.
Target: 9/10 right. If lower, the labels in your prompt aren't clean. Fix the labels, not the prompt structure.
- 0445 min
Wire the webhook.
Plain or Pylon webhook on new ticket → your endpoint → classifier → routing decision back to the ticketing tool. Plain's API takes one line. Pylon's takes two.
Log every decision to Postgres before acting. The log is what saves you the first time the model is wrong.
- 052 weeks
Daily review for two weeks.
Each morning: pull yesterday's classifications. Read 10 at random. Flag anything wrong. Update the labels in the prompt, not the prompt structure.
After two weeks the classifier stabilizes. Stop reviewing daily; switch to weekly audits.
What we stopped doing.
- ×ML pipelines for what a prompt does. No training set is needed beyond the 50 examples.
- ×Confidence thresholds. Claude is already doing this internally. Adding your own threshold just adds noise.
- ×Six severity levels. Three is right.
- ×Re-training weekly. The classifier is stable once dialled. Update labels, not structure.
- ×Routing decisions before logging them. Log first. Act second.
- ×Hiding the decision from the agent. Show them the model's reasoning. They catch issues you'd miss.
The take.
Routing is one decision repeated thousands of times. One prompt does it. The 4-hour build pays for itself in the first week.
Steal one thing: the daily 10-ticket review. It's the only difference between a classifier that holds at 94% and one that drifts to 70% by month three.
After 30 days of clean routing, these earn their place.
Customer-tier weighting.
Pass ARR and contract tier to the classifier. Enterprise lowers the bar for "urgent." This is honest — it's contractual. Log the weighting in the trace so audits stay clean.
Multi-label in one call.
Combine severity + topic + sentiment in a single classifier output. One Anthropic call, three labels back. Saves cost. Saves latency. Each label is independently auditable.
Shadow-mode for new categories.
When you add a new severity or topic, run it in shadow for a week — classifier outputs it, but routing ignores it. Promote to live routing only after 90% accuracy on the new category.
Auto-PR when accuracy drops.
Wire your audit script to open a PR against the prompt file when weekly accuracy drops below 90%. The PR includes the missed tickets and proposed label updates. Human approves; ship.
Five symptoms with the fix that works.
№ 01False positives — too many "urgent."+
№ 02False negatives on enterprise.+
№ 03Misroutes after a tool update.+
№ 04Latency climbed.+
№ 05Cost climbing.+
ticket_id + version. Reject duplicates.Three drop-ins. The labeling template, the classifier prompt, the webhook handler.
The labeling template.
CSV format. 50 rows. Spend 45 minutes here, save quarters later.
labels.csv — 50 rows minimum ticket_id,subject,first_message,tier,severity,route,why_this_label TKT-1001,Login broken,"Can't log in. Tried 3 times.",free,routine,bot,"docs answer, no impact" TKT-1002,Production down,"Our entire app is 500ing.",enterprise,on_fire,engineering,"prod outage on paying account" TKT-1003,Billing question,"Why was I charged $90?",pro,urgent,human,"billing dispute, real money" TKT-1004,How do I export?,"Looking for the export button.",free,routine,bot,"docs answer" TKT-1005,Migration broken,"Just upgraded plan and now all my data is gone.",pro,on_fire,engineering,"data loss claim, escalate" ... RULES - 50 rows minimum, 100 ideal. - Random sample across last 30 days. Don't cherry-pick. - Label by what you ACTUALLY did, not best practice. - severity: routine | urgent | on_fire - route: bot | human | engineering - why_this_label: one sentence. Used in the prompt.
The classifier prompt.
30 lines, JSON output. Examples embedded from labels.csv.
You are a customer-service ticket classifier. Output JSON
in this exact shape — nothing else:
{
"severity": "routine" | "urgent" | "on_fire",
"route": "bot" | "human" | "engineering",
"confidence": 0.0 - 1.0,
"reasoning": "<one sentence, max 20 words>"
}
DEFINITIONS
───────────
on_fire
Production down, data loss, security report, legal language,
billing dispute over $1k, or explicit "need human now."
urgent
Enterprise (ARR > $50k/yr), repeat ticket within 24 hours,
churn risk language, or anything that mentions a competitor
by name in a frustrated context.
routine
Everything else. Docs answers, how-to, FAQ, feature requests.
ROUTE
on_fire -> engineering
urgent -> human
routine -> bot
EXAMPLES — past tickets and the correct call
────────────────────────────────────────────
[paste 50 rows from labels.csv here, formatted as:]
Ticket: "{first_message}"
Tier: {tier}
Correct severity: {severity}
Correct route: {route}
Why: {why_this_label}
INPUT
─────
Ticket: "{{ticket_body}}"
Tier: {{tier}}
ARR: ${{arr_usd}}The webhook handler.
Edge function or serverless. Log first, route second.
// webhook.ts — ticket router
export async function POST(req) {
const ticket = await req.json();
// 01 — idempotency
const key = `${ticket.id}:${ticket.updated_at}`;
if (await alreadySeen(key)) return new Response("dupe", { status: 200 });
// 02 — classify
const decision = await anthropic.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 200,
system: classifierPrompt,
messages: [{ role: "user", content: JSON.stringify(ticket) }],
});
const result = JSON.parse(decision.content[0].text);
// 03 — log BEFORE acting
await db.routing_log.insert({
ticket_id: ticket.id,
severity: result.severity,
route: result.route,
confidence: result.confidence,
reasoning: result.reasoning,
raw_prompt: classifierPrompt.length, // for drift check
at: new Date(),
});
// 04 — act
await ticketingTool.assign({
ticket_id: ticket.id,
queue: queueFor(result.route),
tags: [`auto:${result.severity}`],
});
return new Response("ok", { status: 200 });
}Need this done for you? The author works on this exact thing with audit clients at austinaiguy.com.