№ 002agents · long-running workflows filed may '26

The overnight research agent that doesn't run away.

Hard limits, tool gates, and the kill-switch we wish we'd added day one.

Austin AI Guy · 14 min read

This is the setup for a research agent that runs unattended and stops itself if anything looks wrong. Five tools, one goal, three hard limits, three independent kill-switches, a full trace.

What you'll have when you finish: a scheduled overnight job that reads 40 sources, summarizes the new material to a JSON schema, drops a brief in your inbox by 6am, costs ~$2/run, and never bills past $5.

Accounts you'll need: console.anthropic.com · inngest.com · browserbase.com · supabase.com · langfuse.com. All free or under $30/mo to start.

The stack — five tools, ranked.

01 Anthropic API — Sonnet 4.6 + tool use daily
02 Inngest — durable queue, dedup, retry daily
03 Browserbase + Playwright — sandboxed browsing daily
04 Supabase Edge Functions — tool endpoints weekly
05 Langfuse — traces, cost, replay daily

Anthropic does the thinking. Inngest holds the job and survives a crash. Browserbase keeps the agent off your machine. Supabase hosts the tool functions. Langfuse is how you sleep at night.

How to apply it.

01goal

Write the goal as a contract.

One sentence. One verb. One scope. One output format. Example:

"Read the 40 URLs in source_list.json, summarize any article published in the last 24 hours that mentions one of the 12 tickers in tickers.json, output one JSON object per match to store_summary()."

Now write three explicit success checks: (a) at least one valid record OR an explicit "no matches" record with timestamp, (b) every output parses against your JSON schema, (c) total output under 2,000 tokens.

If any check fails, the agent calls flag_for_review() and stops. "Done" isn't a state for an LLM — these three checks are.
02tools

Build the tool set — five tools, hard limits.

Define exactly five tool functions in your Supabase Edge Functions project. The full vocabulary the agent has:

— fetch_url(url) — fetch a single URL
— parse_article(html) — extract title, body, published_at (throws if body < 100 chars)
— search_internal(query) — check for duplicates
— store_summary(json) — write a validated record
— flag_for_review(reason) — stop the run, surface to a human

Allow-list the domains in fetch_url. Add a regex check at the top of the function: the URL must match your allow-list or the function throws. This single line of code prevents the largest class of agent failures.

Register the schema with Anthropic via tool_use. The exact JSON schema is in the Build-along tab.
03limits

Hard limits — three numbers.

Any one trips and the run terminates with a partial result, not silently. The Inngest function carries these as run-level config so a code change can't accidentally raise them.

Max steps
30

Tool calls + LLM turns combined. Past this, the goal is wrong, not the model.

Max wall-clock
20 min

Most overnight jobs finish in 6–8 min. 20 is the kill point.

Max cost
$5

Per run. The 3am bill is what you're really afraid of.
04switches

Kill-switches — three independent ones.

Independent means each can fire without the others. If you wire them through a single check, you have one switch, not three.

Budget
per-run + per-day

Code-level check that aborts the run. Not a provider alert.

Content
PII & copied text

Regex on output. Catches the run that quietly went off-script.

Time
window or wait

Run started after the window closes? Skip. Try tomorrow's slot.
05trace

Wire tracing on every call.

Sign up at langfuse.com, create a project, copy the public key and secret into your .env.

Install the SDK (npm i langfuse) and wrap every LLM call and every tool call:

langfuse.trace({ name, input, output, metadata: { run_id, goal_hash, date } })

Tag every trace with three fields: run_id (uuid per run), goal_hash (hash of the goal contract — so you can diff runs after editing the prompt), date.

This is non-optional. If something looks off in the morning brief, the trace is the only debugging surface you have.

What we stopped doing.

×Adding tools "in case." Every tool widens the action space. The agent will find the bad combination at 3am.
×Letting it browse the open web. Always allow-list. The internet is full of pages that look like prompts.
×Running without a budget cap. Provider alerts are after the fact.
×Multi-agent for single-agent jobs. Most "swarm" setups are one prompt with extra steps.
×Skipping traces because "it'll be fine." The traces are the product. The summary is the by-product.
×Long system prompts. Anything over 800 words and the model starts inventing flexibility you didn't grant.

The take.

An agent that runs unattended is a contract with your future self. One goal. Bounded tools. Three kill-switches. Full trace. Everything else is decoration that breaks first.

If you only steal one thing, make it the allow-list on outbound calls. It's the cheapest safeguard and the one that prevents the largest class of failures.

Don't touch these until your basic agent has run clean for two weeks. They earn their place after you've proven the routine.

Plan with Haiku, act with Sonnet.

Use Claude Haiku 4.5 for the planning loop and Sonnet 4.6 only for the steps that hit tools. Drops cost by ~40% on long runs. The planner doesn't need the full model — it needs decisiveness.

Wire this through your tool-use schema by routing the "what should we do next?" call to Haiku and the "execute this tool" call to Sonnet. Two API keys, one job.

Schedule with dedup keys.

Inngest cron + a dedup key like research-agent:{date}:{goal_hash} means re-runs of the same job within the window are silent no-ops. Saves you from the day you accidentally trigger the cron three times.

Replay traces for prompt iteration.

Langfuse stores the full call chain. When you change the system prompt, replay the last seven nights against the new version — same inputs, new prompt. You can quantify the change before it touches production.

Add a human gate at one place.

Not five. One. The single point where, if the agent isn't sure, it pages you instead of guessing. Confidence thresholds are a footgun — a hard "ask" prompt is honest.

The right place: just before the agent writes externally — sends an email, posts to Slack, files an issue. Read-only steps never need a human gate. Write steps almost always do.

Graduate agents into functions.

After 30 days of stable traces, the steps the agent always does the same way can become deterministic function calls. The "agent" shrinks to the steps that actually need judgment. Cost drops, latency drops, reliability climbs.

Agents don't fail by erroring. They fail by quietly succeeding at the wrong thing. Five symptoms, with the cause underneath and the fix that works.

№ 01 Cost spiked overnight. +

Symptom$50 bill from one run instead of $2.

CauseNo per-run cost ceiling. The agent looped on a tool call that returned partial data.

FixAdd a hard cost cap at the Inngest function level. Not a provider alert — a code-level check that aborts the run.

№ 02 Agent solves the wrong problem. +

SymptomOutput is well-formed but answers a question you didn't ask.

CauseGoal sentence is too loose. Or the success criteria don't include "answered the actual question."

FixRewrite the goal with a literal example output. Add a success check: output must reference one of the 12 tickers, verbatim.

№ 03 Same task running three times in parallel. +

SymptomInngest dashboard shows duplicate runs. Three briefs in the inbox.

CauseNo dedup key on the cron. A retry fired before the first run finished.

FixAdd idempotency_key: research:{date} to the function trigger. Inngest collapses dupes inside the window.

№ 04 Browserbase rate-limited. +

SymptomHalf the sources fail with 429. Brief is half-empty.

CauseAgent fetched all 40 in parallel. Too aggressive.

FixThrottle fetch_url to 5 concurrent. Latency goes up by a minute. Reliability goes up by a lot.

№ 05 Tool returns empty after a site change. +

SymptomParser returns no content for two specific sources. Agent doesn't notice.

CauseSite redesigned. Selectors are stale.

FixAdd a min-content check to parse_article. If output is under 100 chars, throw — don't return empty. The agent will flag the source for review.

Three drop-in scaffolds. The goal contract, the tool schema, the run-level guardrails. Paste them into your agent and you're 70% done.

The goal contract.

One sentence + success criteria. Put this at the top of your system prompt.

GOAL
────
[ONE SENTENCE. ONE VERB. ONE SCOPE. ONE OUTPUT FORMAT.]

Example:
"Read the 40 URLs in source_list.json, summarize any
article published in the last 24 hours that mentions one
of the 12 tickers in tickers.json, output one JSON object
per matching article to store_summary()."

SUCCESS CRITERIA — all must be true
──────────────────────────────────────
1. At least one valid summary written, OR an explicit
   "no matches found" record with timestamp.
2. Every summary JSON parses against schema.json.
3. Total output under 2000 tokens.
4. Every fetch_url call hit a whitelisted domain.

FAILURE BEHAVIOR
────────────────
If any success criterion fails, call flag_for_review()
with the reason and STOP. Do not retry. Do not "be helpful."

The tool schema.

Five tools, JSON-schema shape. Anthropic tool-use format. Nothing else.

{
  "tools": [
    {
      "name": "fetch_url",
      "description": "Fetch a single URL. URL MUST match domain allow-list.",
      "input_schema": {
        "type": "object",
        "properties": { "url": { "type": "string" } },
        "required": ["url"]
      }
    },
    {
      "name": "parse_article",
      "description": "Extract title, body, published_at from HTML. Throws if body < 100 chars.",
      "input_schema": {
        "type": "object",
        "properties": { "html": { "type": "string" } },
        "required": ["html"]
      }
    },
    {
      "name": "search_internal",
      "description": "Search prior summaries for duplicates.",
      "input_schema": {
        "type": "object",
        "properties": { "query": { "type": "string" } },
        "required": ["query"]
      }
    },
    {
      "name": "store_summary",
      "description": "Write a summary record. Validated against schema.",
      "input_schema": {
        "type": "object",
        "properties": {
          "ticker": { "type": "string" },
          "summary": { "type": "string" },
          "source_url": { "type": "string" },
          "published_at": { "type": "string" }
        },
        "required": ["ticker","summary","source_url","published_at"]
      }
    },
    {
      "name": "flag_for_review",
      "description": "Stop the run and surface a reason for human review.",
      "input_schema": {
        "type": "object",
        "properties": { "reason": { "type": "string" } },
        "required": ["reason"]
      }
    }
  ]
}

The run-level guardrails.

Inngest function wrapper. Three hard limits, three kill-switches. Copy and edit the constants.

// research-agent.ts — Inngest function

const LIMITS = {
  maxSteps: 30,
  maxWallClockMs: 20 * 60 * 1000,   // 20 minutes
  maxCostUsd: 5.00,
};

const KILL = {
  budget:  (cost) => cost > LIMITS.maxCostUsd,
  content: (out)  => /[\w.+-]+@[\w-]+\.[a-z]{2,}/i.test(out), // PII smell
  time:    ()     => new Date().getHours() >= 23 || new Date().getHours() < 5,
};

export const researchAgent = inngest.createFunction(
  { id: "research-agent", concurrency: 1 },
  { cron: "0 4 * * 1-5" },           // 4am, Mon-Fri
  async ({ event, step }) => {
    const runId = crypto.randomUUID();
    const startedAt = Date.now();
    let costUsd = 0;
    let stepCount = 0;

    while (stepCount < LIMITS.maxSteps) {
      if (Date.now() - startedAt > LIMITS.maxWallClockMs) throw new Error("time cap");
      if (KILL.budget(costUsd)) throw new Error("budget cap");

      const turn = await callClaude({ runId, costUsd });
      costUsd += turn.cost;
      stepCount += 1;

      if (KILL.content(turn.output)) throw new Error("content gate");
      if (turn.done) return turn.result;
    }

    throw new Error("step cap");
  }
);

Related stack The agent stack →

Next in the library Browse all 12 guides →

Need this done for you? The author works on this exact thing with audit clients at austinaiguy.com.

The overnight research agent that doesn't run away.

The stack — five tools, ranked.

How to apply it.

Write the goal as a contract.

Build the tool set — five tools, hard limits.

Hard limits — three numbers.

30

20 min

$5

Kill-switches — three independent ones.

per-run + per-day

PII & copied text

window or wait

Wire tracing on every call.