№ 002agents · long-running workflows filed may '26

The overnight research agent that doesn't run away.

Hard limits, tool gates, and the kill-switch we wish we'd added day one.

This is the setup for a research agent that runs unattended and stops itself if anything looks wrong. Five tools, one goal, three hard limits, three independent kill-switches, a full trace.

What you'll have when you finish: a scheduled overnight job that reads 40 sources, summarizes the new material to a JSON schema, drops a brief in your inbox by 6am, costs ~$2/run, and never bills past $5.

Accounts you'll need: console.anthropic.com · inngest.com · browserbase.com · supabase.com · langfuse.com. All free or under $30/mo to start.

01

The stack — five tools, ranked.

  • 01 Anthropic API — Sonnet 4.6 + tool use daily
  • 02 Inngest — durable queue, dedup, retry daily
  • 03 Browserbase + Playwright — sandboxed browsing daily
  • 04 Supabase Edge Functions — tool endpoints weekly
  • 05 Langfuse — traces, cost, replay daily

Anthropic does the thinking. Inngest holds the job and survives a crash. Browserbase keeps the agent off your machine. Supabase hosts the tool functions. Langfuse is how you sleep at night.

02

How to apply it.

  1. 01goal

    Write the goal as a contract.

    One sentence. One verb. One scope. One output format. Example:

    "Read the 40 URLs in source_list.json, summarize any article published in the last 24 hours that mentions one of the 12 tickers in tickers.json, output one JSON object per match to store_summary()."

    Now write three explicit success checks: (a) at least one valid record OR an explicit "no matches" record with timestamp, (b) every output parses against your JSON schema, (c) total output under 2,000 tokens.

    If any check fails, the agent calls flag_for_review() and stops. "Done" isn't a state for an LLM — these three checks are.

  2. 02tools

    Build the tool set — five tools, hard limits.

    Define exactly five tool functions in your Supabase Edge Functions project. The full vocabulary the agent has:

    fetch_url(url) — fetch a single URL
    parse_article(html) — extract title, body, published_at (throws if body < 100 chars)
    search_internal(query) — check for duplicates
    store_summary(json) — write a validated record
    flag_for_review(reason) — stop the run, surface to a human

    Allow-list the domains in fetch_url. Add a regex check at the top of the function: the URL must match your allow-list or the function throws. This single line of code prevents the largest class of agent failures.

    Register the schema with Anthropic via tool_use. The exact JSON schema is in the Build-along tab.

  3. 03limits

    Hard limits — three numbers.

    Any one trips and the run terminates with a partial result, not silently. The Inngest function carries these as run-level config so a code change can't accidentally raise them.

    Max steps

    30

    Tool calls + LLM turns combined. Past this, the goal is wrong, not the model.

    Max wall-clock

    20 min

    Most overnight jobs finish in 6–8 min. 20 is the kill point.

    Max cost

    $5

    Per run. The 3am bill is what you're really afraid of.

  4. 04switches

    Kill-switches — three independent ones.

    Independent means each can fire without the others. If you wire them through a single check, you have one switch, not three.

    Budget

    per-run + per-day

    Code-level check that aborts the run. Not a provider alert.

    Content

    PII & copied text

    Regex on output. Catches the run that quietly went off-script.

    Time

    window or wait

    Run started after the window closes? Skip. Try tomorrow's slot.

  5. 05trace

    Wire tracing on every call.

    Sign up at langfuse.com, create a project, copy the public key and secret into your .env.

    Install the SDK (npm i langfuse) and wrap every LLM call and every tool call:

    langfuse.trace({ name, input, output, metadata: { run_id, goal_hash, date } })

    Tag every trace with three fields: run_id (uuid per run), goal_hash (hash of the goal contract — so you can diff runs after editing the prompt), date.

    This is non-optional. If something looks off in the morning brief, the trace is the only debugging surface you have.

03

What we stopped doing.

  • ×Adding tools "in case." Every tool widens the action space. The agent will find the bad combination at 3am.
  • ×Letting it browse the open web. Always allow-list. The internet is full of pages that look like prompts.
  • ×Running without a budget cap. Provider alerts are after the fact.
  • ×Multi-agent for single-agent jobs. Most "swarm" setups are one prompt with extra steps.
  • ×Skipping traces because "it'll be fine." The traces are the product. The summary is the by-product.
  • ×Long system prompts. Anything over 800 words and the model starts inventing flexibility you didn't grant.
04

The take.

An agent that runs unattended is a contract with your future self. One goal. Bounded tools. Three kill-switches. Full trace. Everything else is decoration that breaks first.

If you only steal one thing, make it the allow-list on outbound calls. It's the cheapest safeguard and the one that prevents the largest class of failures.

Related stack The agent stack →
Next in the library Browse all 12 guides →

Need this done for you? The author works on this exact thing with audit clients at austinaiguy.com.