Lab 2 — Resilience: Timeouts, Hand-Rolled Retry, and Rate Limits
Time: ~3.5 hrs · Difficulty: Core · Builds on: Lab 1
Objective
Turn a request that works when the network is perfect into one that survives the network as it actually is: slow servers, dropped connections, transient 5xx errors, and an API telling you to slow down. You will set timeouts on every call, catch the two distinct families of failure, and — the centerpiece — hand-write a get_with_retry helper that does exponential backoff with jitter, caps its waits and attempts, retries only the things worth retrying, and honors rate-limit headers. Then you will paginate through a real multi-page GitHub response and assemble it into one list. The get_with_retry and get_all_pages functions you write here are the exact resilience layer GitHub Pulse uses in Lab 3 — and the exact pattern you will reimplement inside agent loops later in the course.
Setup
cd ~/agentic/month-04
uv init resilience --package --python 3.12
cd resilience
uv add requests python-dotenv
git init
echo ".env" >> .gitignore
echo ".venv/" >> .gitignore
cp ~/agentic/month-04/api-basics/.env .env # reuse your Lab 1 token
Confirm the token carried over: cat .env should show your GITHUB_TOKEN= line, and git status should not list .env.
Background
Recall first (from memory): From Lab 1, what one keyword must every requests call carry, and what happens without it? Does requests raise an exception when the server returns a 503? Answer before reading on.
A network call fails in two unrelated ways (Core Concepts §5). A transport failure means you got no response at all — DNS died, the connection was refused, or your timeout fired — and requests raises Timeout or ConnectionError. An HTTP failure means you got a response carrying a 4xx/5xx code, and requests raises nothing; you must inspect .status_code. Resilience is deciding, for each kind, whether to give up, surface the error, or wait and retry.
The retry policy has three parts you will build by hand (Core Concepts §6): wait longer after each failure (exponential backoff, 2 ** attempt), never wait or loop forever (caps), and stagger retries with a little randomness so many clients do not stampede in lockstep (jitter). And only retry things that can actually recover: a timeout or a 429/5xx on a GET, yes; a 401, a 404, or a non-idempotent POST, no. Finally, when an API publishes rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) or sends Retry-After, the polite move is to sleep exactly that long rather than guess.
Steps
1. See an uncaught timeout fail
Create src/resilience/timeout_demo.py. httpbin has a /delay/N endpoint that waits N seconds before responding:
import requests
def main():
# The server waits 5s; we only allow 2s. This WILL time out.
resp = requests.get("https://httpbin.org/delay/5", timeout=2)
print("got:", resp.status_code)
if __name__ == "__main__":
main()
uv run python src/resilience/timeout_demo.py
Checkpoint: the program crashes with requests.exceptions.ReadTimeout. This is good — a timeout converted “hang for 5 seconds (or forever, with no timeout)” into a clean, catchable exception. Imagine the server hanging indefinitely with no timeout=: your program would freeze. That is the bug a timeout prevents.
If not: if it printed got: 200, the server responded faster than 5s today — lower timeout to 1 or use /delay/10. An SSLError means httpbin is flaky; substitute https://httpstat.us/200?sleep=5000.
2. Catch both failure families
Update timeout_demo.py to distinguish transport failure from HTTP failure:
import sys
import requests
def fetch(url, timeout=2):
try:
resp = requests.get(url, timeout=timeout)
except requests.exceptions.Timeout:
print(f"transport: timed out after {timeout}s", file=sys.stderr)
return None
except requests.exceptions.ConnectionError:
print("transport: could not connect", file=sys.stderr)
return None
if resp.status_code >= 400:
print(f"http: server returned {resp.status_code}", file=sys.stderr)
return resp
def main():
fetch("https://httpbin.org/delay/5") # times out
fetch("https://httpbin.org/status/503") # HTTP failure, no exception
r = fetch("https://httpbin.org/status/200")
print("ok:", r.status_code if r else "no response")
if __name__ == "__main__":
main()
uv run python src/resilience/timeout_demo.py
Checkpoint: you see a transport message for the delay, an http: server returned 503 for the status endpoint (note: no exception — requests returned a Response), and ok: 200 at the end. You can now tell “no response” apart from “a response that is an error.”
If not: if the 503 line raised instead of printing, you accidentally called .raise_for_status() — the point here is that a bad status is a value, not an exception. If everything timed out, httpbin is slow; raise the timeout argument.
Steps 3–5 teach the new skill of the lab — a retry loop with backoff and jitter — as a gradual release: you first study and run the complete loop (worked), then fill in the backoff math yourself (faded), then design a retry decision for a new case (independent).
Stage 1 — Worked example (I do)
3. Hand-roll get_with_retry
This is the heart of the lab. The control flow has a shape worth seeing before you read the code — it is a loop that exits three different ways:
flowchart TD
A["Start attempt"] --> B["requests.get with timeout"]
B --> C{"Transport error?"}
C -->|"Yes"| D{"Last attempt?"}
D -->|"Yes"| E["Raise the error"]
D -->|"No"| F["Sleep: backoff + jitter"]
C -->|"No"| G{"Status retryable?"}
G -->|"No"| H["Return response"]
G -->|"Yes, and tries left"| F
F --> A
Notice: the only paths out are a usable response, a raised transport error, or running out of attempts — it can never loop forever, because the attempt count caps it.
Study and run the complete module below; do not change it yet. Create src/resilience/http.py — a module you will import in Lab 3:
import random
import time
import requests
RETRYABLE_STATUS = {429, 500, 502, 503, 504}
def get_with_retry(url, *, headers=None, params=None,
max_attempts=5, timeout=10, max_wait=60):
"""GET with exponential backoff + jitter on transient failures.
Retries on timeouts, connection errors, and retryable status codes.
Does NOT retry 4xx like 401/404 — those won't recover.
"""
for attempt in range(max_attempts):
try:
resp = requests.get(url, headers=headers, params=params, timeout=timeout)
except (requests.exceptions.Timeout, requests.exceptions.ConnectionError) as exc:
if attempt == max_attempts - 1:
raise
wait = _backoff(attempt, max_wait)
print(f"[retry] transport error ({exc.__class__.__name__}); "
f"sleeping {wait:.1f}s (attempt {attempt + 1}/{max_attempts})")
time.sleep(wait)
continue
if resp.status_code not in RETRYABLE_STATUS:
return resp # success OR a non-retryable error
if attempt == max_attempts - 1:
return resp # out of tries; let caller inspect it
wait = _retry_after(resp) or _backoff(attempt, max_wait)
print(f"[retry] HTTP {resp.status_code}; sleeping {wait:.1f}s "
f"(attempt {attempt + 1}/{max_attempts})")
time.sleep(wait)
return resp
def _backoff(attempt, max_wait):
"""Exponential backoff with full jitter, capped at max_wait."""
return min(2 ** attempt, max_wait) + random.uniform(0, 1)
def _retry_after(resp):
"""Honor an explicit Retry-After header (seconds) if present."""
value = resp.headers.get("Retry-After")
if value and value.isdigit():
return float(value)
return None
Checkpoint: the module imports without error: uv run python -c "from resilience.http import get_with_retry; print('ok')" prints ok. Read the function and be able to point to each of the three ideas: backoff (2 ** attempt), cap (min(..., max_wait) and max_attempts), jitter (random.uniform).
If not: ModuleNotFoundError: resilience means you ran from the wrong directory — run from the project root with uv run (see Troubleshooting). A SyntaxError means a line was dropped while copying; re-paste the whole module.
Stage 2 — Faded practice (we do)
Before watching the loop run, prove you own the backoff math. The three ideas grow like this:
flowchart LR
A["Attempt 0: ~1s"] --> B["Attempt 1: ~2s"]
B --> C["Attempt 2: ~4s"]
C --> D["Attempt 3: ~8s"]
D --> E["Capped at max_wait"]
Notice: each wait roughly doubles (2 ** attempt) until the cap flattens it — plus a jitter fraction so clients do not retry in lockstep.
Re-implement _backoff from scratch in a scratch file, given only its contract: return 2 ** attempt seconds, never more than max_wait, plus a random fraction under one second. Fill the TODO:
import random
def _backoff(attempt, max_wait):
# TODO: exponential growth, capped at max_wait, plus full jitter (0..1s)
...
print([round(_backoff(i, 8), 1) for i in range(5)])
Expected: five values that climb ~1, ~2, ~4, ~8, ~8 (the last two flattened by the max_wait=8 cap), each with a different fractional jitter.
Solution
```python def _backoff(attempt, max_wait): return min(2 ** attempt, max_wait) + random.uniform(0, 1) ```Checkpoint: your printed list climbs toward the cap and the fractions differ between runs.
If not: if values keep growing past max_wait, you forgot min(...); if every fraction is identical, you computed random.uniform once instead of inside the function.
4. Watch the retries happen (the worked loop in motion)
httpbin.org/status/503 always returns a 503, so it exhausts every retry — perfect for watching backoff. Create src/resilience/retry_demo.py:
import time
from resilience.http import get_with_retry
def main():
start = time.time()
resp = get_with_retry("https://httpbin.org/status/503", max_attempts=4)
elapsed = time.time() - start
print(f"final status: {resp.status_code} after {elapsed:.1f}s")
if __name__ == "__main__":
main()
uv run python src/resilience/retry_demo.py
Checkpoint: you see three [retry] HTTP 503; sleeping ...s lines with increasing sleeps (~1s, ~2s, ~4s, each plus a fraction of jitter), then final status: 503 after ~7s. The fourth attempt is not retried because it is the last — the function returns the 503 for the caller to handle. Run it twice: the jitter fractions differ each time. That randomness is the whole point.
If not: if it returned instantly with no [retry] lines, httpbin did not return 503 — try https://httpstat.us/503. If it hung far longer than ~7s, you raised max_attempts or removed the max_wait cap.
Stage 3 — Independent (you do)
5. Prove non-retryable errors fail fast
First predict, then verify. Without running anything, decide: for a 404, how many times will get_with_retry call the network, and roughly how long will it take? Write your answer down. Then check it:
import time
from resilience.http import get_with_retry
start = time.time()
resp = get_with_retry("https://httpbin.org/status/404", max_attempts=4)
print(f"404 returned after {time.time() - start:.1f}s") # near-instant
Checkpoint: the 404 returns in well under a second — no retries, no sleeps. A 404 will never become a 200 by waiting, so retrying it would just waste time and annoy the server. Compare to the ~7s the 503 took. If your prediction matched, you understand the retry decision; if not, re-read which statuses are in RETRYABLE_STATUS.
If not: if the 404 did sleep and retry, you added 404 to RETRYABLE_STATUS by mistake, or your status check is inverted — only {429, 500, 502, 503, 504} should retry.
6. Read GitHub’s rate-limit headers
Create src/resilience/ratelimit.py:
import datetime as dt
import os
from dotenv import load_dotenv
from resilience.http import get_with_retry
def main():
load_dotenv()
token = os.environ["GITHUB_TOKEN"]
headers = {
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json",
"User-Agent": "resilience-lab",
}
resp = get_with_retry("https://api.github.com/rate_limit", headers=headers)
core = resp.json()["resources"]["core"]
reset = dt.datetime.fromtimestamp(core["reset"])
print(f"limit={core['limit']} remaining={core['remaining']} resets at {reset:%H:%M:%S}")
if __name__ == "__main__":
main()
uv run python src/resilience/ratelimit.py
Checkpoint: you see limit=5000 remaining=49xx resets at HH:MM:SS. This is the budget your tool must respect. When remaining hits 0, GitHub returns 403/429; the right response is to sleep until the reset time — which is exactly what a production agent does so it never gets throttled.
If not: limit=60 instead of 5000 means the token did not load (you are anonymous) — confirm .env copied over and load_dotenv() runs. A KeyError: 'GITHUB_TOKEN' means the cp from Setup failed.
7. Paginate a real multi-page response
Pagination is a loop driven by the server’s next link, with a page cap as a safety belt:
flowchart TD
A["Start: page 0, results empty"] --> B["GET this page (with retry)"]
B --> C["Append batch to results"]
C --> D{"Has a next link?"}
D -->|"No"| E["Return results"]
D -->|"Yes"| F{"Hit max_pages cap?"}
F -->|"Yes"| E
F -->|"No"| G["Follow next URL"]
G --> B
Notice: the loop ends on either no-next-link OR the page cap — the cap is what stops a buggy or hostile API from looping forever.
octocat’s starred repos (or any active user’s) span multiple pages. Create src/resilience/paginate.py:
import os
from dotenv import load_dotenv
from resilience.http import get_with_retry
def get_all_pages(url, *, headers=None, params=None, max_pages=10):
params = dict(params or {}, per_page=100)
results = []
next_url, next_params = url, params
for page in range(max_pages):
resp = get_with_retry(next_url, headers=headers, params=next_params)
resp.raise_for_status()
batch = resp.json()
results.extend(batch)
nxt = resp.links.get("next") # requests parses the Link header
if not nxt:
break
next_url, next_params = nxt["url"], None # next URL already has the params
return results
def main():
load_dotenv()
headers = {
"Authorization": f"Bearer {os.environ['GITHUB_TOKEN']}",
"Accept": "application/vnd.github+json",
"User-Agent": "resilience-lab",
}
repos = get_all_pages(
"https://api.github.com/users/torvalds/repos",
headers=headers, max_pages=5,
)
print(f"assembled {len(repos)} repos across pages")
langs = {r["language"] for r in repos if r["language"]}
print("languages seen:", ", ".join(sorted(langs)))
if __name__ == "__main__":
main()
uv run python src/resilience/paginate.py
Checkpoint: you see assembled N repos across pages where N is larger than 100 (proving more than one page was fetched and stitched together) and a list of languages. Temporarily lower per_page to 5 inside get_all_pages and watch N stay the same while it makes more round-trips — the next link, not a page guess, is what drives the loop.
If not: if N is exactly 100 (or per_page), the user has only one page — pick a prolific user like torvalds. An empty resp.links is normal for a single page; the .get("next") guard handles it.
8. Confirm the page cap protects you
In paginate.py, set max_pages=1 and re-run.
Checkpoint: N is now ~100 (one page only) and the loop stopped even though a next link existed. The cap is your guarantee that no API — buggy or hostile — can make your tool loop forever. Restore max_pages=5 afterward.
If not: if N still exceeds 100 with max_pages=1, the cap is not wired into the for _ in range(max_pages) line — confirm the loop bound is max_pages, not a hardcoded number.
Definition of Done
timeout_demo.pycatches bothTimeoutand an HTTP-error status without crashing, and distinguishes them in its output.src/resilience/http.pydefinesget_with_retrywith: exponential backoff (2 ** attempt), amax_waitcap, jitter, amax_attemptscap, retries restricted to{429, 500, 502, 503, 504}+ transport errors, andRetry-Aftersupport.retry_demo.pyshows increasing, jittered sleeps for a 503 and instant return for a 404.ratelimit.pyprints GitHub’slimit/remaining/reset time.paginate.pyassembles >100 results across pages viaresp.links["next"]and respectsmax_pages.- Self-verify the retry logic in one line:
uv run python -c "from resilience.http import get_with_retry, _backoff; print('backoffs:', [round(_backoff(i, 60),1) for i in range(5)])"
You should see five increasing values, each a power of two (1,2,4,8,16) plus a jitter fraction under 1.
Self-explain: in one sentence, why does adding random jitter to the backoff wait make a fleet of clients more reliable, not less?
Stretch Goals
- Sleep until reset. Extend
get_with_retryso that when it sees403/429withX-RateLimit-Remaining: 0, it sleeps untilX-RateLimit-Reset(computed from the header) instead of using plain backoff. Cap the sleep so a far-future reset does not hang you. - Full jitter vs. equal jitter. Read the AWS backoff article and implement “equal jitter” (
half = 2**attempt/2; wait = half + random.uniform(0, half)). Compare the spread of waits. - A
Sessionwith retries. Wrap the auth headers in arequests.Session()and haveget_with_retryaccept a session, so connection pooling speeds up many calls. - Make it generic. Add a
methodparameter so the helper can also do idempotentPUT/DELETE, while refusing to retryPOSTby default. Explain in a comment why.
Troubleshooting
ModuleNotFoundError: resilience. The--packagelayout makesresilienceimportable only inside the project env. Run scripts withuv run python src/resilience/<file>.pyfrom the project root, oruv run python -m resilience.<module>.- Retries never trigger.
httpbin.org/status/503must return 503; ifhttpbinis down, substitutehttps://httpstat.us/503or point at a localpython -m http.serverand request a missing path (404 — note that one won’t retry, by design). - The retry loop hangs much longer than expected. You likely removed the
max_waitcap or setmax_attemptsvery high; high attempts with exponential backoff add up fast (1+2+4+8+16+… seconds). Keep attempts small (≤6). KeyError: 'GITHUB_TOKEN'..envdid not copy over. Re-run thecpfrom Setup, or recreate the file; confirm it is git-ignored.- Pagination returns exactly
per_pageitems and stops. The endpoint had only one page, or you readresp.linksafter reassigningresp. Pick an active user (e.g.,torvalds,octocat) and keepper_pagelow to force multiple pages while testing. AttributeError: 'Response' object has no attribute 'links'is impossible — butresp.linksbeing empty{}is normal when there is noLinkheader (single-page result). Guard with.get("next")as shown.