Lab 3 — The Provider-Agnostic Core: Config-Driven Fallback and Graceful Failover (Milestone)
Time: ~5 hrs · Difficulty: Core / Stretch · Builds on: Labs 1–2 and the Month 6 agent
Objective
This is the month’s milestone. You will refactor your Month 6 agent so its entire model layer is pluggable and config-driven — four providers behind one LLMClient interface, selected by config.toml, with zero changes to the agent loop when you swap providers. You will add a fallback chain that tries providers in order and degrades gracefully on rate-limit/timeout/connection failures, terminating at local Ollama (always available, always free). You will add the self-registering tool registry from Lab 2 so new tools drop in without touching the core. Then you will prove graceful failover: run the agent on the primary provider, blackhole the primary by pointing its base URL at an unreachable host, run the same task again, and watch it cascade to Ollama and finish — for $0. Done means a model swap is a ten-minute config edit, and you have the recording to prove the system survives an outage.
Setup
cd ~/agentic/month-07
ls llm/ config.py config.toml strategies.py tools.py prompts/ # all Lab 1-2 artifacts present
ollama serve >/dev/null 2>&1 &
mkdir -p sandbox
printf 'def add(a, b):\n return a + b\n' > sandbox/calc.py
printf 'def greet(name):\n return f"hi {name}"\n' > sandbox/util.py
( cd sandbox && git init -q && git add -A && git commit -qm "seed" )
Checkpoint: git -C sandbox log --oneline shows the seed commit and ls sandbox/*.py lists two files. This folder is the agent’s jail. Confirm Labs 1–2 still pass: uv run pytest -q.
If not: no commit means git init didn’t run or there was nothing staged — re-run the Setup block. If pytest fails, a Lab 1–2 artifact is missing or broken; fix that first, because this lab assembles all of them.
Background
Recall first (from memory): From Lab 1, what type does every provider’s complete() return? From the README, which errors should make a fallback chain cascade to the next provider, and which should it surface without cascading? And in one line: what makes the Month 6 agent loop stop? Answer all three before starting.
flowchart TD
A["FallbackClient.complete()"] --> B["Try primary (e.g. openai)"]
B --> C{"Error?"}
C -->|"None"| H["Return ModelReply, record served_by"]
C -->|"429 / timeout / connection"| D["Try next: ollama (local, free)"]
C -->|"400 / fatal"| G["Re-raise, do not cascade"]
D --> E{"Ollama OK?"}
E -->|Yes| H
E -->|No| F["RuntimeError: chain exhausted"]
Notice: the chain only walks forward on retryable-elsewhere errors; a fatal 400 stops it. Because local Ollama is the last link, the blackholed-primary demo still finishes — for $0.
You have all the parts: providers behind an interface (Lab 1), a registry + config + tool registry + versioned prompts (Lab 2). This lab assembles them into the refactored agent and adds the one genuinely new piece — the fallback chain (README §8). The chain is an ordered list of LLMClients; you try each in turn, cascading only on retryable-elsewhere errors (429, timeout, connection refused) and surfacing everything else. Ollama goes last because it is local, free, and needs no network or quota — it is the safety net that makes the whole thing survive an outage at $0. Re-read README §8 on error classification before you start: cascading on a fatal 400 just burns the chain and hides your bug.
Steps
The new skill of this lab is the fallback chain with correct error classification. Step 1 is the worked example (the
FallbackClient); Step 1b is faded (you finish one test case); Step 6 is the independent capstone (drive the live failover demo end-to-end).
1. Stage 1 — Worked example (I do): build the fallback chain
Create llm/fallback.py. It takes the ordered list of clients (built from config) and tries them until one succeeds. Read it closely — the load-bearing detail is the error classification: RETRYABLE + 429 continue down the chain; every other HTTPError re-raises. The last_served_by field is what the agent loop will write into the trace so the failover is provable after the fact.
# llm/fallback.py
from __future__ import annotations
import logging
import requests
from .base import LLMClient, ModelReply
log = logging.getLogger("agent.fallback")
# Errors that mean "this provider can't serve me right now" -> try the next one.
RETRYABLE = (
requests.exceptions.ConnectionError, # blackholed / unreachable host
requests.exceptions.Timeout,
)
def _is_rate_limit(exc: Exception) -> bool:
r = getattr(exc, "response", None)
return r is not None and r.status_code == 429
class FallbackClient:
"""An LLMClient that wraps an ordered chain and fails over gracefully."""
name = "fallback-chain"
def __init__(self, chain: list[LLMClient]) -> None:
if not chain:
raise ValueError("fallback chain is empty")
self.chain = chain
self.last_served_by: str | None = None # recorded into the trace by the loop
def complete(self, messages: list[dict], tools: list[dict]) -> ModelReply:
last_exc: Exception | None = None
for client in self.chain: # ordered: primary -> ... -> ollama
try:
reply = client.complete(messages, tools)
self.last_served_by = client.name
if last_exc is not None:
log.warning("recovered: %s served after failover", client.name)
return reply
except RETRYABLE as e:
last_exc = e
log.warning("provider %s unavailable (%s); falling over", client.name, type(e).__name__)
continue
except requests.exceptions.HTTPError as e:
if _is_rate_limit(e): # 429: retryable-elsewhere
last_exc = e
log.warning("provider %s rate-limited (429); falling over", client.name)
continue
raise # 4xx/5xx that isn't 429: SURFACE, don't cascade
raise RuntimeError(f"all providers exhausted; last error: {last_exc}") from last_exc
Stage 2 — Faded practice (we do): test the classification yourself
The chain logic is testable without any network using fake clients. The cascade test is given in full; the fatal-error-surfaces test is the one that catches the most common fallback bug, so you write its body. Create tests/test_fallback.py with the cascade test and the skeleton, then fill the TODO before checking.
# tests/test_fallback.py — fill the TODO in the second test
import requests, pytest
from llm.base import ModelReply
from llm.fallback import FallbackClient
class Boom: # a provider that always connection-fails
name = "boom"
def complete(self, messages, tools):
raise requests.exceptions.ConnectionError("blackholed")
class Good: # a provider that always succeeds
name = "good"
def complete(self, messages, tools):
return ModelReply(text="ok from good")
def test_cascades_past_dead_provider_to_survivor():
fc = FallbackClient([Boom(), Good()])
reply = fc.complete([{"role": "user", "content": "hi"}], tools=[])
assert reply.text == "ok from good"
assert fc.last_served_by == "good" # the survivor served it
def test_fatal_error_surfaces_not_cascades():
class Fatal:
name = "fatal"
def complete(self, messages, tools):
r = requests.Response(); r.status_code = 400
raise requests.exceptions.HTTPError(response=r)
# TODO: assert that FallbackClient([Fatal(), Good()]).complete([], []) RAISES HTTPError
# (a 400 must NOT fall over to Good). Use pytest.raises.
...
Check your fill-in
```python with pytest.raises(requests.exceptions.HTTPError): FallbackClient([Fatal(), Good()]).complete([], []) # 400 must NOT fall over to Good ```Checkpoint: the full reference test (both cases, for comparison):
Full tests/test_fallback.py
```python
# tests/test_fallback.py
import requests, pytest
from llm.base import ModelReply
from llm.fallback import FallbackClient
class Boom: # a provider that always connection-fails
name = "boom"
def complete(self, messages, tools):
raise requests.exceptions.ConnectionError("blackholed")
class Good: # a provider that always succeeds
name = "good"
def complete(self, messages, tools):
return ModelReply(text="ok from good")
def test_cascades_past_dead_provider_to_survivor():
fc = FallbackClient([Boom(), Good()])
reply = fc.complete([{"role": "user", "content": "hi"}], tools=[])
assert reply.text == "ok from good"
assert fc.last_served_by == "good" # the survivor served it
def test_fatal_error_surfaces_not_cascades():
class Fatal:
name = "fatal"
def complete(self, messages, tools):
r = requests.Response(); r.status_code = 400
raise requests.exceptions.HTTPError(response=r)
with pytest.raises(requests.exceptions.HTTPError):
FallbackClient([Fatal(), Good()]).complete([], []) # 400 must NOT fall over to Good
```
Run uv run pytest -q tests/test_fallback.py — 2 passed. You’ve proven, deterministically, that the chain cascades on a dead provider but surfaces a fatal 400. That second test is the guardrail against the most common fallback bug.
If not: if test_fatal_error_surfaces_not_cascades fails (no exception raised), your complete is cascading on a non-429 HTTPError — only RETRYABLE and 429 may continue; every other HTTPError must raise. If test_cascades... fails on last_served_by, you didn’t set self.last_served_by = client.name before returning the reply.
2. Build the chain from config
Add a builder that turns the config’s primary + fallback list into a single FallbackClient. Append to build.py (from Lab 2):
# add to build.py
from llm.fallback import FallbackClient
def build_chain(cfg) -> FallbackClient:
"""primary first, then each fallback, in order. The chain IS an LLMClient."""
clients = [client_from(cfg.primary)] + [client_from(f) for f in cfg.fallback]
return FallbackClient(clients)
Checkpoint:
uv run python -c "
from config import load_config
from build import build_chain
chain = build_chain(load_config())
print('chain:', [c.name for c in chain.chain])"
You should see chain: ['ollama', 'openrouter', 'ollama'] (primary ollama, then the two fallbacks from Lab 2’s config). The chain is just a list of LLMClients — and FallbackClient is itself an LLMClient, so the agent can hold it exactly as it held a single provider. That uniformity is why the agent loop won’t need to know fallback exists.
If not: a short chain (e.g., only one name) means your config.toml lost its [[fallback]] entries — restore them from Lab 2 step 3. An unknown provider error means import llm.providers isn’t running before build_chain (it’s imported inside build.py; confirm that line survived).
3. Refactor the agent loop to depend only on the interface
Now the payoff. Create agent.py — the Month 6 loop, but every provider concern is gone. It takes any LLMClient (here, the fallback chain), advertises tools.SCHEMAS, dispatches via tools.TOOLS, loads the system prompt by version from config, and records the serving provider and prompt version in the trace.
# agent.py — the Provider-Agnostic Core. No provider names. No if-elif. No isinstance ladder.
from __future__ import annotations
import json, logging, sys, time
from pathlib import Path
import tools # importing registers all @tool functions
from config import load_config, load_prompt
from build import build_chain
from llm.base import LLMClient
logging.basicConfig(level=logging.WARNING, format="%(name)s %(levelname)s %(message)s")
MAX_STEPS = 12
TRACE = Path("trace.jsonl")
def _trace(event: str, **fields):
with TRACE.open("a", encoding="utf-8") as f:
f.write(json.dumps({"ts": time.time(), "event": event, **fields}) + "\n")
def run_agent(client: LLMClient, task: str, system: str, prompt_version: str) -> str:
messages = [{"role": "system", "content": system},
{"role": "user", "content": task}]
_trace("start", task=task, prompt_version=prompt_version)
for step in range(MAX_STEPS):
reply = client.complete(messages, tools.SCHEMAS) # interface call. No provider knowledge.
served_by = getattr(client, "last_served_by", client.name) # which provider actually served
print(f" [model:{served_by}] in={reply.tokens_in} out={reply.tokens_out}", file=sys.stderr)
if not reply.tool_calls: # STOP: no tool call = done
_trace("final", step=step, served_by=served_by, text=reply.text[:500])
return reply.text
messages.append({"role": "assistant", "content": reply.text,
"tool_calls": [{"id": c["id"], "type": "function",
"function": {"name": c["name"], "arguments": c["arguments"]}}
for c in reply.tool_calls]})
for call in reply.tool_calls:
name = call["name"]
args = json.loads(call["arguments"]) # NEVER eval — parse as JSON
print(f" [tool] {name}({args})", file=sys.stderr)
try:
result = tools.TOOLS[name](**args) # registry dispatch; no branch per tool
ok = True
except Exception as e:
result, ok = f"ERROR: {e}", False
_trace("tool_call", step=step, served_by=served_by, tool=name,
args=args, ok=ok, result_size=len(str(result)))
messages.append({"role": "tool", "tool_call_id": call["id"],
"content": json.dumps(result)})
_trace("aborted", reason="hit MAX_STEPS")
return "ABORTED: hit step limit"
if __name__ == "__main__":
cfg = load_config()
client = build_chain(cfg) # a FallbackClient, but typed as LLMClient
system = load_prompt("agent_system", cfg.agent_system_prompt)
TASK = ("List the .py files in the working directory, read each, then write SUMMARY.md "
"with a one-paragraph summary of every .py file. Then run: git add -A and git commit "
"-m 'summary'. Only reply DONE: after the commit succeeds.")
print(run_agent(client, TASK, system, cfg.agent_system_prompt))
You need a run_shell tool for the commit step. Add it to tools.py (allow-listed, argument-list, jailed — straight from Month 6):
# add to tools.py
import subprocess
ALLOWED_SHELL = {"ls", "cat", "git", "python", "wc", "grep"}
@tool({"type": "function", "function": {
"name": "run_shell", "description": "Run one allow-listed command as a list of args, e.g. ['git','add','-A']. Allowed: ls, cat, git, python, wc, grep.",
"parameters": {"type": "object",
"properties": {"command": {"type": "array", "items": {"type": "string"}}},
"required": ["command"]}}})
def run_shell(command: list[str]) -> str:
if not command or command[0] not in ALLOWED_SHELL:
raise ValueError(f"command '{command[:1]}' not allowed")
proc = subprocess.run(command, cwd=ROOT, capture_output=True, text=True, timeout=30)
return ((proc.stdout + proc.stderr).strip() or "(no output)")[:4000]
Checkpoint: grep proves the loop is provider-agnostic:
! grep -Eq 'Ollama|Anthropic|OpenAI|if .*provider ==|isinstance' agent.py && echo "loop is provider-agnostic OK"
It should print loop is provider-agnostic OK. The agent loop names no provider and contains no selection branch — providers, tools, and prompts all arrive via injection, the registry, and config.
If not: if the grep finds a match (no “OK” printed), a provider name or an if provider ==/isinstance leaked into agent.py — that’s an unfinished refactor. Move that knowledge out: providers come from build_chain(cfg), tools from tools.TOOLS/tools.SCHEMAS, the prompt from load_prompt(...). The loop should mention only LLMClient.
4. Happy-path run on the primary provider
rm -f trace.jsonl
uv run python agent.py
Checkpoint: on stderr you see interleaved [model:ollama] in/out and [tool] name(args) lines, ending in a DONE: message. Then verify:
cat sandbox/SUMMARY.md # a paragraph per .py file
git -C sandbox log --oneline # the agent's commit appears
grep -o '"served_by": "[a-z]*"' trace.jsonl | sort -u # which provider served (should be ollama)
grep '"prompt_version"' trace.jsonl | head -1 # the prompt version is recorded
You should see a real SUMMARY.md, a new commit, "served_by": "ollama", and the prompt version in the trace. The trace now answers “which provider and which prompt produced this run?” — exactly what you need to make the failover visible.
If not: if the agent answers in prose and never calls a tool, your survivor model is too weak — use qwen2.5:7b and confirm tools.SCHEMAS is non-empty and passed to complete. If the commit step fails, sandbox isn’t a git repo (re-run Setup) or git isn’t in ALLOWED_SHELL. If served_by/prompt_version are missing from the trace, you’re running an older agent.py — see Troubleshooting.
5. The ten-minute model swap (the definition of done)
Time yourself. Edit only config.toml: change [primary] to a different model you have (ollama pull llama3.1:8b first if needed) — e.g. model = "llama3.1:8b". Re-run uv run python agent.py.
Checkpoint: the agent runs against the new model with zero source changes — the diff is one line of TOML. This is the milestone’s behavioral target: swapping a model is a config edit, not a rewrite. (Set it back to qwen2.5:7b for the best tool-calling in the next step.)
If not: a model ... not found error from Ollama means you didn’t pull it — ollama pull llama3.1:8b first. If you found yourself editing any .py file to make the swap work, the model layer isn’t fully config-driven yet — revisit step 3; only config.toml should change.
6. The graceful-failover demonstration (the heart of the milestone)
Now prove the system survives an outage — for $0. Make your primary an unreachable host and your fallback the local Ollama, so the primary connection-fails and the chain cascades to Ollama, which finishes the task. Edit config.toml:
[primary]
provider = "openai" # pretend this is your real paid primary
model = "gpt-4o-mini"
base_url = "http://10.255.255.1:1" # BLACKHOLE: an unreachable host -> ConnectionError
api_key_env = "OPENAI_API_KEY"
[[fallback]]
provider = "ollama" # the free local survivor
model = "qwen2.5:7b"
base_url = "http://localhost:11434"
Then run, capturing the transcript:
rm -f trace.jsonl
uv run python agent.py 2>&1 | tee failover_demo.txt
Checkpoint: in failover_demo.txt you should see the warning provider openai unavailable (ConnectionError); falling over, then [model:ollama] ... lines as Ollama picks up the same task and drives it to DONE:. Verify the work completed on the survivor:
grep "falling over" failover_demo.txt # the cascade happened
grep -o '"served_by": "[a-z]*"' trace.jsonl | sort -u # should show ollama (the survivor served)
cat sandbox/SUMMARY.md && git -C sandbox log --oneline | head -1 # task completed anyway
The primary was blackholed; the agent did not crash; it failed over to free local Ollama and finished. Cost: $0. The trace records that Ollama served the calls, so the failover is visible after the fact, not just in the live log. This is the milestone: not that the happy path works, but that the unhappy path degrades gracefully.
If not: if the run hangs instead of failing over fast, the blackhole address is timing out slowly — lower the provider timeout or use http://localhost:1 (refuses immediately); Timeout is in RETRYABLE, so it still cascades. If the agent crashes instead of falling over, your error classification is letting the connection error escape — confirm ConnectionError is in RETRYABLE. If served_by shows the blackholed primary, you didn’t build a FallbackClient (you passed a bare provider).
7. Capture the demonstration deliverable
Produce the submission artifact — either a screen recording or the captured transcript plus a short written plan. A recording is best; if you record, narrate these beats:
- Show
config.tomlwith a reachable primary; run the agent; showSUMMARY.md+ commit +served_by: <primary>in the trace. - Edit
config.tomlto blackhole the primary (base_urlto the unreachable host). Show the one-line diff. - Re-run; show the
falling overwarning, the[model:ollama]lines, the completedDONE:, andserved_by: ollamain the trace. - State the cost: $0, because the survivor is local Ollama.
If you don’t record, submit failover_demo.txt plus a DEMO.md written plan describing those four beats and pointing at the trace lines that prove each.
Checkpoint: you have either failover_demo.mov/.gif or failover_demo.txt + DEMO.md, and a trace.jsonl whose served_by values show the failover from primary to ollama.
If not: if the transcript doesn’t show all four beats, re-run with 2>&1 | tee failover_demo.txt so both the falling over warning (stderr) and the output are captured. If the trace only shows one served_by value, you may have reused an old trace.jsonl — rm -f trace.jsonl before the clean run.
Definition of Done
agent.pyis the refactored loop and names no provider and contains noif provider ==/isinstanceladder (proven by the grep in step 3).- The agent depends only on
LLMClient; providers arrive via the config-builtFallbackClient, tools via the@toolregistry, the system prompt via versioned-prompt config. - At least three providers (one Ollama) sit behind the interface; swapping the primary model is a config-only edit (demonstrated in step 5).
llm/fallback.pyimplements an ordered chain that cascades on connection/timeout/429 and surfaces other errors;tests/test_fallback.pyproves both behaviors.- A happy-path run completed:
sandbox/SUMMARY.mdexists with a per-file summary and a new git commit, on Ollama for $0. - The failover demonstration exists: with the primary blackholed, the agent fell over to local Ollama and finished the same task; the trace’s
served_byshows the switch; cost was $0. - The trace records
served_byper call and theprompt_versionof the run. - Submit: the
llm/package,agent.py,tools.py,config.py,config.toml,build.py,prompts/(≥2 versions), the test suite, and the failover demo (recording orfailover_demo.txt+DEMO.md).
Self-verify:
uv run pytest -q && echo "all tests OK"
! grep -Eq 'Ollama|Anthropic|OpenAI|if .*provider ==|isinstance' agent.py && echo "loop provider-agnostic OK"
grep -q "falling over" failover_demo.txt && grep -q '"served_by": "ollama"' trace.jsonl && echo "failover proven OK"
test -f sandbox/SUMMARY.md && git -C sandbox log --oneline | head -1 && echo "task completed OK"
Self-explain: in one sentence, why can the agent finish the task even when its primary provider is blackholed mid-run — and why does it cost $0? (Hint: what is the chain’s last link, and what does the loop actually depend on?)
Stretch Goals
- Fallback policy as a Strategy. Make the chain’s retry behavior a config-selected strategy (e.g.,
"immediate"vs."one-retry-then-fall-over"), wiring it through Lab 2’s strategy registry — the resilience policy itself becomes pluggable. - Per-call cost on the trace. Compute dollars from
tokens_in/outand per-million prices (Month 6 math), record it per call, and print a running total — then show the failover run costing $0 because Ollama served it. - A real remote secondary. Add OpenRouter’s free endpoint as the middle of the chain (
primary blackholed -> openrouter free -> ollama) and show a three-tier cascade. - Add a tool live. During the demo, drop a new
@tool(e.g.,count_lines) intotools.pyand re-run — proving the tool registry is open to extension with the agent already running the loop unchanged. - Replay the trace. Write
replay.pythat readstrace.jsonland prints a human transcript including which provider served each step — making the failover legible from the trace alone. - Health-check skip. Before a run, ping each provider’s base URL with a short timeout and reorder/skip dead ones up front, so the chain doesn’t pay the connection timeout on every call to a known-down primary.
Troubleshooting
- Blackhole hangs instead of failing fast.
http://10.255.255.1:1should refuse/timeout quickly, but on some networks it stalls. Lower the provider’stimeout(e.g., 5s) for the demo, or use a port that refuses immediately likehttp://localhost:1. TheTimeoutis inRETRYABLE, so a timeout also cascades. - Fatal 400 cascades to Ollama (wrong). Your error classification let a non-429 HTTPError through to the next provider. Re-check
complete: onlyRETRYABLEand 429 continue; every otherHTTPErrorre-raises.test_fatal_error_surfaces_not_cascadesshould catch this. served_byis always the chain name. The loop readsclient.last_served_by(set byFallbackClientafter a success). Confirm you build aFallbackClient, not a bare provider, and thatcompletesetsself.last_served_bybefore returning.- Model answers in prose, never calls tools. Use
qwen2.5:7bas the survivor; confirmtools.SCHEMASis non-empty (import tools; print(len(tools.SCHEMAS))) and passed tocomplete. Small models are the usual culprit. json.loads(arguments)fails. The model emitted malformed tool arguments. The loop catches the exception and feeds the error back to the model as a tool result, so it can retry — leave thattry/exceptin place.- Commit step fails.
sandboxmust be a git repo (re-run the Setupgit init), andgitmust be inALLOWED_SHELL. Checkgit -C sandbox status. - Trace missing
served_by/prompt_version. You ran an olderagent.py. Ensure_trace("start", ..., prompt_version=...)and the per-eventserved_by=served_byare present, and that yourm -f trace.jsonlbefore a clean run.