docsAI

LLM cost control — defense in depth

This is the canonical map of how memQL bounds LLM spend so that a runaway loop cannot burn unbounded money regardless of which path triggers it (epic memql#1141).

Why "defense in depth"

Earlier protections were all rate limiters that self-heal: the per-fingerprint loop breaker resets after its cooldown, and the rate ceiling drains its window and re-admits. That is correct for a transient spike, but it means a genuinely stuck loop — one that varies its request body and paces just under the rate ceiling — trickles spend forever (you'd see 429 ... rate ceiling ... blocked repeating indefinitely while money keeps leaving). Every per-path fix (per-turn breaker, produceArtifact re-delegation cap, …) patched one loop; the runaway reappeared via another.

The fix is layered. The lower layers are graceful (throttle, converge, park); the bottom layer is a hard kill-switch that latches and never drains. Because the kill-switch lives at the single chokepoint every chat/messages call passes through and is path-agnostic (it counts calls, not callers), a brand-new runaway path nobody anticipated is bounded automatically.

The layers

LayerMechanismScopeSelf-heals?Where
0. Kill-switchcumulative call + est-$ latch → terminal 402, never drainsprocess (per-scope: #1144)No (latches)component/memql/si_guard.go
1. Rate ceilingcalls/window → synthetic 429process, per-laneyes (drains)si_guard.go
1. Loop breakeridentical-request repeats → 429per-fingerprintyes (cooldown)si_guard.go
2. Automation budgetexecutions/window → skip; bounded fail-openprocess, per-automationyes (window)component/automations/budget.go, cluster_guard.go
3. Loop terminal conditionsper-turn iteration / convergence / wallclock capsper invocationn/aagent + planner loops
4. Per-scope budgetcumulative latch per space / plan-lineageper conversation/planNo (latches)si_guard.go via context (#1144)

Layer 0 is the backstop behind every other layer: even when a higher layer is generous or a loop's terminal condition is loose, the cumulative kill-switch caps total spend.

Layer 0 — the kill-switch (si_guard.go)

guardedHTTPClient wraps the *http.Client of all four chat provider builds (OpenAI + Anthropic, stream + non-stream), so every chat/messages completion leaves the process through one guardedTransport. There it is checked, in order: (0) kill-switch latch → (1) loop breaker → (1) rate ceiling → (3) cumulative accounting. Crossing a cumulative cap latches the breaker open permanently: every further call returns a terminal 402 (non-retryable in both vendor SDKs, so it surfaces as a clear terminal error to the agent/planner loop) and makes no vendor request. It does not drain until the process restarts (or the guard is reset).

The $ figure is a deliberately conservative upper bound estimated from the request only (size + max_tokens, priced at an Opus-tier rate) — the response usage is never visible at the RoundTripper. Over-estimating is the safe direction: it latches sooner.

Process caps default to 0 (unlimited) — a since-boot cumulative latch has no single value safe for both a tiny local run and a long-lived production node, so the process backstop is opt-in. The on-by-default conservative protection is the per-scope latch (Layer 4). Local repro: MEMQL_LLM_MAX_TOTAL_CALLS=20.

Layer 2 — automation budget

A misfiring automation (one that re-fires on its own failure, or a plan-level loop that re-creates a plan each cycle) is a runaway multiplier: each execution can drive fresh LLM calls. Storm detection used to be log-only; now a process-global, cross-executor budget (total + per-automation executions/window) skips the execution once a storm blows past it, and the cluster guard's fail-open path is bounded (it admits a capped number of unguarded executions per window during a DB outage, then fails closed).

Layer 3 — loop terminal-condition audit

Every LLM-driving loop already carries a per-turn terminal condition; the residual gap is cross-turn cumulative spend, which Layers 0 and 4 close.

LoopFilePer-turn capCross-turn / lineage capConvergence guards
Agent tool loopintegrations/agent/streaming.go, nonstreaming.go120 iters (MEMQL_TOOL_LOOP_MAX_ITERATIONS), 180s wallclock (MEMQL_TURN_WALLCLOCK_TIMEOUT_SECONDS)none → Layer 0 / 43 repeat-failures (MEMQL_TOOL_LOOP_MAX_REPEAT_FAILURES), 3 all-errored rounds, 2 produceArtifact re-delegations
Engine SI tool loopcomponent/memql/si_tool_loop.go120 iters (MEMQL_TOOL_LOOP_MAX_ITERATIONS), 8 tool-calls/iternone → Layer 0 / 4all-errored guard, identical-call breaker
Planner decompose loopintegrations/planner/agent_loop.go5 iters/cycle (MEMQL_PLANNER_MAX_ITERATIONS_PER_CYCLE)8 calls + 2M tokens/plan (MEMQL_PLANNER_MAX_INVOCATIONS_PER_PLAN, MEMQL_PLANNER_DEFAULT_TOKEN_BUDGET)2 identical decisions (MEMQL_PLANNER_MAX_IDENTICAL_DECISIONS)
Cognition conductorintegrations/cognition/conductor_consult.gosingle structured call (≤3 branch re-invokes)n/an/a
Suggestcomponent/grpc/ AiSuggestsingle calln/an/a

Verdict: no loop lacks a per-turn terminal condition. The agent / engine tool loops have no native cross-turn cap and rely on Layer 0 (process) and Layer 4 (per-space / plan-lineage) for cumulative bounding.

Environment reference

Layer 0 — kill-switch

envdefaultmeaning
MEMQL_LLM_KILL_SWITCH_ENABLEDtruemaster switch
MEMQL_LLM_MAX_TOTAL_CALLS0 (unlimited)cumulative admitted-call cap
MEMQL_LLM_MAX_TOTAL_COST_USD0 (unlimited)cumulative est-$ cap
MEMQL_LLM_COST_INPUT_PER_MILLION15.0est. input price
MEMQL_LLM_COST_OUTPUT_PER_MILLION75.0est. output price

Layer 1 — rate ceiling + loop breaker

envdefault
MEMQL_LLM_LOOP_GUARD_ENABLEDtrue
MEMQL_LLM_LOOP_MAX_REPEAT8
MEMQL_LLM_RATE_GUARD_ENABLEDtrue
MEMQL_LLM_MAX_CALLS_PER_WINDOW20
MEMQL_LLM_RATE_WINDOW_SECONDS10
MEMQL_LLM_BG_MAX_CALLS_PER_WINDOW40

Layer 2 — automation budget

envdefault
MEMQL_AUTOMATION_BUDGET_ENABLEDtrue
MEMQL_MAX_AUTOMATION_EXECUTIONS_PER_WINDOW600
MEMQL_MAX_AUTOMATION_EXECUTIONS_PER_AUTOMATION120
MEMQL_AUTOMATION_BUDGET_WINDOW_SECONDS60
MEMQL_MAX_UNGUARDED_AUTOMATION_EXECUTIONS_PER_WINDOW50

Layer 3 — loop caps

envdefault
MEMQL_TOOL_LOOP_MAX_ITERATIONS120 (clamp 200)
MEMQL_TOOL_LOOP_MAX_REPEAT_FAILURES3
MEMQL_TURN_WALLCLOCK_TIMEOUT_SECONDS180
MEMQL_PLANNER_MAX_ITERATIONS_PER_CYCLE5
MEMQL_PLANNER_MAX_INVOCATIONS_PER_PLAN8
MEMQL_PLANNER_DEFAULT_TOKEN_BUDGET2000000
MEMQL_PLANNER_MAX_IDENTICAL_DECISIONS2

Layer 4 — per-scope budget

envdefaultmeaning
MEMQL_LLM_SCOPE_GUARD_ENABLEDtrueper-(space, plan-lineage) latch
MEMQL_LLM_SCOPE_MAX_CALLS600cumulative calls per scope
MEMQL_LLM_SCOPE_MAX_COST_USD20.0cumulative est-$ per scope
MEMQL_LLM_SCOPE_IDLE_TTL_SECONDS3600prune scopes idle this long

The per-scope latch is on by default (unlike the process caps): a scope is one conversation/space or plan-lineage, so a runaway within one scope is unambiguous and latching it kills just that loop — other conversations are unaffected and no restart is needed. Caps sit above the 120-iteration per-turn cap so a single deep turn never trips them. Scopes are stamped via ContextWithBudgetScope at the streaming + non-streaming agent loops and the planner decompose loop.

Reproducing a runaway safely (local)

Set tight caps in the cluster env, then trigger any loop and confirm spend is hard-bounded and the loop terminates with a clear signal:

shell
MEMQL_LLM_MAX_TOTAL_CALLS=3 # Layer 0 hard stop after 3 calls
MEMQL_TOOL_LOOP_MAX_REPEAT_FAILURES=1
MEMQL_MAX_AUTOMATION_EXECUTIONS_PER_WINDOW=20

The kill-switch alert LLM KILL-SWITCH LATCHED is the terminal signal; after it, every LLM call returns a 402 and no vendor request is made.