The overnight research agent that doesn't run away.
Hard limits, tool gates, and the kill-switch we wish we'd added day one.
This is the setup for a research agent that runs unattended and stops itself if anything looks wrong. Five tools, one goal, three hard limits, three independent kill-switches, a full trace.
What you'll have when you finish: a scheduled overnight job that reads 40 sources, summarizes the new material to a JSON schema, drops a brief in your inbox by 6am, costs ~$2/run, and never bills past $5.
Accounts you'll need: console.anthropic.com · inngest.com · browserbase.com · supabase.com · langfuse.com. All free or under $30/mo to start.
The stack — five tools, ranked.
- 01 Anthropic API — Sonnet 4.6 + tool use daily
- 02 Inngest — durable queue, dedup, retry daily
- 03 Browserbase + Playwright — sandboxed browsing daily
- 04 Supabase Edge Functions — tool endpoints weekly
- 05 Langfuse — traces, cost, replay daily
Anthropic does the thinking. Inngest holds the job and survives a crash. Browserbase keeps the agent off your machine. Supabase hosts the tool functions. Langfuse is how you sleep at night.
How to apply it.
-
01goal
Write the goal as a contract.
One sentence. One verb. One scope. One output format. Example:
"Read the 40 URLs in source_list.json, summarize any article published in the last 24 hours that mentions one of the 12 tickers in tickers.json, output one JSON object per match to store_summary()."
Now write three explicit success checks: (a) at least one valid record OR an explicit "no matches" record with timestamp, (b) every output parses against your JSON schema, (c) total output under 2,000 tokens.
If any check fails, the agent calls
flag_for_review()and stops. "Done" isn't a state for an LLM — these three checks are. -
02tools
Build the tool set — five tools, hard limits.
Define exactly five tool functions in your Supabase Edge Functions project. The full vocabulary the agent has:
—
fetch_url(url)— fetch a single URL
—parse_article(html)— extract title, body, published_at (throws if body < 100 chars)
—search_internal(query)— check for duplicates
—store_summary(json)— write a validated record
—flag_for_review(reason)— stop the run, surface to a humanAllow-list the domains in
fetch_url. Add a regex check at the top of the function: the URL must match your allow-list or the function throws. This single line of code prevents the largest class of agent failures.Register the schema with Anthropic via tool_use. The exact JSON schema is in the Build-along tab.
-
03limits
Hard limits — three numbers.
Any one trips and the run terminates with a partial result, not silently. The Inngest function carries these as run-level config so a code change can't accidentally raise them.
Max steps30
Tool calls + LLM turns combined. Past this, the goal is wrong, not the model.
Max wall-clock20 min
Most overnight jobs finish in 6–8 min. 20 is the kill point.
Max cost$5
Per run. The 3am bill is what you're really afraid of.
-
04switches
Kill-switches — three independent ones.
Independent means each can fire without the others. If you wire them through a single check, you have one switch, not three.
Budgetper-run + per-day
Code-level check that aborts the run. Not a provider alert.
ContentPII & copied text
Regex on output. Catches the run that quietly went off-script.
Timewindow or wait
Run started after the window closes? Skip. Try tomorrow's slot.
-
05trace
Wire tracing on every call.
Sign up at langfuse.com, create a project, copy the public key and secret into your
.env.Install the SDK (
npm i langfuse) and wrap every LLM call and every tool call:langfuse.trace({ name, input, output, metadata: { run_id, goal_hash, date } })Tag every trace with three fields: run_id (uuid per run), goal_hash (hash of the goal contract — so you can diff runs after editing the prompt), date.
This is non-optional. If something looks off in the morning brief, the trace is the only debugging surface you have.
What we stopped doing.
- ×Adding tools "in case." Every tool widens the action space. The agent will find the bad combination at 3am.
- ×Letting it browse the open web. Always allow-list. The internet is full of pages that look like prompts.
- ×Running without a budget cap. Provider alerts are after the fact.
- ×Multi-agent for single-agent jobs. Most "swarm" setups are one prompt with extra steps.
- ×Skipping traces because "it'll be fine." The traces are the product. The summary is the by-product.
- ×Long system prompts. Anything over 800 words and the model starts inventing flexibility you didn't grant.
The take.
An agent that runs unattended is a contract with your future self. One goal. Bounded tools. Three kill-switches. Full trace. Everything else is decoration that breaks first.
If you only steal one thing, make it the allow-list on outbound calls. It's the cheapest safeguard and the one that prevents the largest class of failures.
Don't touch these until your basic agent has run clean for two weeks. They earn their place after you've proven the routine.
Plan with Haiku, act with Sonnet.
Use Claude Haiku 4.5 for the planning loop and Sonnet 4.6 only for the steps that hit tools. Drops cost by ~40% on long runs. The planner doesn't need the full model — it needs decisiveness.
Wire this through your tool-use schema by routing the "what should we do next?" call to Haiku and the "execute this tool" call to Sonnet. Two API keys, one job.
Schedule with dedup keys.
Inngest cron + a dedup key like research-agent:{date}:{goal_hash} means re-runs of the same job within the window are silent no-ops. Saves you from the day you accidentally trigger the cron three times.
Replay traces for prompt iteration.
Langfuse stores the full call chain. When you change the system prompt, replay the last seven nights against the new version — same inputs, new prompt. You can quantify the change before it touches production.
Add a human gate at one place.
Not five. One. The single point where, if the agent isn't sure, it pages you instead of guessing. Confidence thresholds are a footgun — a hard "ask" prompt is honest.
The right place: just before the agent writes externally — sends an email, posts to Slack, files an issue. Read-only steps never need a human gate. Write steps almost always do.
Graduate agents into functions.
After 30 days of stable traces, the steps the agent always does the same way can become deterministic function calls. The "agent" shrinks to the steps that actually need judgment. Cost drops, latency drops, reliability climbs.
Agents don't fail by erroring. They fail by quietly succeeding at the wrong thing. Five symptoms, with the cause underneath and the fix that works.
№ 01 Cost spiked overnight. +
№ 02 Agent solves the wrong problem. +
№ 03 Same task running three times in parallel. +
idempotency_key: research:{date} to the function trigger. Inngest collapses dupes inside the window.№ 04 Browserbase rate-limited. +
fetch_url to 5 concurrent. Latency goes up by a minute. Reliability goes up by a lot.№ 05 Tool returns empty after a site change. +
parse_article. If output is under 100 chars, throw — don't return empty. The agent will flag the source for review.Three drop-in scaffolds. The goal contract, the tool schema, the run-level guardrails. Paste them into your agent and you're 70% done.
The goal contract.
One sentence + success criteria. Put this at the top of your system prompt.
GOAL ──── [ONE SENTENCE. ONE VERB. ONE SCOPE. ONE OUTPUT FORMAT.] Example: "Read the 40 URLs in source_list.json, summarize any article published in the last 24 hours that mentions one of the 12 tickers in tickers.json, output one JSON object per matching article to store_summary()." SUCCESS CRITERIA — all must be true ────────────────────────────────────── 1. At least one valid summary written, OR an explicit "no matches found" record with timestamp. 2. Every summary JSON parses against schema.json. 3. Total output under 2000 tokens. 4. Every fetch_url call hit a whitelisted domain. FAILURE BEHAVIOR ──────────────── If any success criterion fails, call flag_for_review() with the reason and STOP. Do not retry. Do not "be helpful."
The tool schema.
Five tools, JSON-schema shape. Anthropic tool-use format. Nothing else.
{
"tools": [
{
"name": "fetch_url",
"description": "Fetch a single URL. URL MUST match domain allow-list.",
"input_schema": {
"type": "object",
"properties": { "url": { "type": "string" } },
"required": ["url"]
}
},
{
"name": "parse_article",
"description": "Extract title, body, published_at from HTML. Throws if body < 100 chars.",
"input_schema": {
"type": "object",
"properties": { "html": { "type": "string" } },
"required": ["html"]
}
},
{
"name": "search_internal",
"description": "Search prior summaries for duplicates.",
"input_schema": {
"type": "object",
"properties": { "query": { "type": "string" } },
"required": ["query"]
}
},
{
"name": "store_summary",
"description": "Write a summary record. Validated against schema.",
"input_schema": {
"type": "object",
"properties": {
"ticker": { "type": "string" },
"summary": { "type": "string" },
"source_url": { "type": "string" },
"published_at": { "type": "string" }
},
"required": ["ticker","summary","source_url","published_at"]
}
},
{
"name": "flag_for_review",
"description": "Stop the run and surface a reason for human review.",
"input_schema": {
"type": "object",
"properties": { "reason": { "type": "string" } },
"required": ["reason"]
}
}
]
}
The run-level guardrails.
Inngest function wrapper. Three hard limits, three kill-switches. Copy and edit the constants.
// research-agent.ts — Inngest function
const LIMITS = {
maxSteps: 30,
maxWallClockMs: 20 * 60 * 1000, // 20 minutes
maxCostUsd: 5.00,
};
const KILL = {
budget: (cost) => cost > LIMITS.maxCostUsd,
content: (out) => /[\w.+-]+@[\w-]+\.[a-z]{2,}/i.test(out), // PII smell
time: () => new Date().getHours() >= 23 || new Date().getHours() < 5,
};
export const researchAgent = inngest.createFunction(
{ id: "research-agent", concurrency: 1 },
{ cron: "0 4 * * 1-5" }, // 4am, Mon-Fri
async ({ event, step }) => {
const runId = crypto.randomUUID();
const startedAt = Date.now();
let costUsd = 0;
let stepCount = 0;
while (stepCount < LIMITS.maxSteps) {
if (Date.now() - startedAt > LIMITS.maxWallClockMs) throw new Error("time cap");
if (KILL.budget(costUsd)) throw new Error("budget cap");
const turn = await callClaude({ runId, costUsd });
costUsd += turn.cost;
stepCount += 1;
if (KILL.content(turn.output)) throw new Error("content gate");
if (turn.done) return turn.result;
}
throw new Error("step cap");
}
);
Need this done for you? The author works on this exact thing with audit clients at austinaiguy.com.