Lab 2 — pytest Deep Dive and Strict Types
Time: ~3.5 hrs · Difficulty: Core · Builds on: Lab 1 (Protocols, injection, pure functions)
Objective
Build a small, fully tested, fully typed library and learn the testing tools you will lean on for the rest of the course: pytest discovery and assert, fixtures for shared setup, parametrize for table-driven tests, and mocking for the rare case injection can’t cover — plus pytest-cov to measure how much of your code your tests actually run. Alongside, you’ll add type hints throughout and get a clean bill of health from mypy --strict. The domain is deliberately tiny (a “scoreboard” that ranks players) so all your attention goes to how to test and type, not to the problem itself. You will end the lab having written several tests before the code they test — the habit the month is trying to install.
Setup
cd ~/agentic/month-05
uv init scoreboard --package --python 3.12
cd scoreboard
uv add --dev pytest pytest-cov mypy
mkdir -p tests
Confirm the test runner is wired up:
uv run pytest --version
Checkpoint: You see a pytest 8.x version line. pytest discovers files named test_*.py; we keep them in tests/.
If not: command not found or no version means the dev dependency didn’t install; re-run uv add --dev pytest pytest-cov mypy from inside the scoreboard/ directory, and invoke it as uv run pytest, not bare pytest.
Background
Recall first (from memory): (1) From Lab 1, what is a pure function, and why is it trivial to test? (2) From Lab 1, what did injecting a CountingNotifier let you verify without any I/O? Answer before reading on — this lab turns those one-off REPL checks into an automated suite.
This lab’s new skill is the red → green → refactor cycle: write a failing test that specifies behavior (red), write the simplest code that passes (green), then improve the design while the test holds you safe (refactor). Steps 1–3 walk that loop explicitly:
flowchart LR
Red["Red: write a failing test"] --> Green["Green: simplest code that passes"]
Green --> Refactor["Refactor: improve, stay green"]
Refactor --> Red
Notice: the cycle never lets you write untested code — every change is bracketed by a test that was red before and green after.
A test is a small program that runs your code and asserts what should be true. The reason to write them is not virtue — it is speed and courage: a fast test suite lets you change code aggressively and find out in two seconds whether you broke anything. Read Core Concepts §8–§10 first. The throughline from Lab 1 continues here: because your logic is in pure functions and your collaborators are injected, most testing needs no mocking at all — you just call functions and pass in fakes. We cover mocking too, but treat it as the escape hatch it is.
Steps
1. Write the test first (red), then the code (green) — Stage 1: Worked example (I do)
Follow this loop exactly as written; it is the worked example for the whole skill. We want a pure function rank(players) that sorts players by score, highest first. Write the test before the function exists. Create src/scoreboard/core.py with just the data type:
from dataclasses import dataclass
@dataclass
class Player:
name: str
score: int
Now tests/test_core.py:
from scoreboard.core import Player, rank
def test_rank_orders_by_score_descending() -> None:
players = [Player("a", 10), Player("b", 30), Player("c", 20)]
ranked = rank(players)
assert [p.name for p in ranked] == ["b", "c", "a"]
Run uv run pytest.
Checkpoint: The test fails — ImportError: cannot import name 'rank'. That red is correct and intentional: you have a precise, executable specification of rank before writing a line of it. This is test-first development, and it is the habit the month-end “done” criterion is looking for.
If not: if pytest collects 0 tests, the file or function isn’t named test_* — check both. If you get a different error (e.g., No module named 'scoreboard'), confirm src/scoreboard/__init__.py exists and that you ran via uv run pytest from the project root (Troubleshooting).
2. Make it green with a pure function
Add rank to core.py:
def rank(players: list[Player]) -> list[Player]:
"""Pure: returns a new list sorted by score descending; does not mutate the input."""
return sorted(players, key=lambda p: p.score, reverse=True)
Run uv run pytest again.
Checkpoint: One test passes (1 passed). Confirm purity: add assert [p.name for p in players] == ["a", "b", "c"] to the end of the test to prove the original list is untouched (sorted returns a new list; .sort() would have mutated it). Re-run — still green. You have now lived one full red → green loop.
If not: still red means the import path or function name doesn’t match the test’s from scoreboard.core import Player, rank. If the purity assertion fails, you used .sort() (mutates in place) instead of sorted() (returns a new list).
3. Table-driven tests with parametrize — Stage 2: Faded practice (we do)
Now you drive the loop with less scaffolding. Tie-breaking is a behavior worth pinning down. The test below is written for you (it is the red); your job is the green and the refactor — make rank satisfy it. Add this test to test_core.py, run it, watch it fail, then change rank yourself before peeking at the answer:
import pytest
@pytest.mark.parametrize(
"scores, expected_order",
[
([("a", 10), ("b", 10)], ["a", "b"]), # tie -> alphabetical
([("b", 10), ("a", 10)], ["a", "b"]), # tie -> alphabetical regardless of input order
([("a", 5), ("b", 10)], ["b", "a"]), # score wins over name
([], []), # empty is empty
],
)
def test_rank_tiebreak(scores: list[tuple[str, int]], expected_order: list[str]) -> None:
players = [Player(n, s) for n, s in scores]
assert [p.name for p in rank(players)] == expected_order
This forces a tweak to rank. The fix is a single key change — sort by score descending, then name ascending. Try writing the new key= yourself; the reference answer is:
def rank(players: list[Player]) -> list[Player]:
return sorted(players, key=lambda p: (-p.score, p.name))
Run uv run pytest -v.
Checkpoint: With -v you see four separate test cases listed under test_rank_tiebreak, all passing. One test function, four cases, no copy-paste — that is parametrize (§8). Notice the empty-list case is free insurance against a whole class of crash.
If not: if only the tie cases fail, your sort key isn’t breaking ties by name — use the tuple (-p.score, p.name) so equal scores fall through to alphabetical. If mypy later complains about the lambda, see Troubleshooting (extract a typed _key helper).
4. Fixtures for shared setup
When several tests need the same starting data, a fixture builds it once and injects it by name. Add to test_core.py:
@pytest.fixture
def sample_players() -> list[Player]:
return [Player("ada", 42), Player("bo", 42), Player("cy", 7)]
def test_top_scorer_uses_tiebreak(sample_players: list[Player]) -> None:
assert rank(sample_players)[0].name == "ada"
def test_ranking_preserves_count(sample_players: list[Player]) -> None:
assert len(rank(sample_players)) == len(sample_players)
Checkpoint: Both tests receive a fresh sample_players list (fixtures run per-test, so one test can’t pollute another). uv run pytest shows all tests passing. This is dependency injection applied to tests — the same idea as Lab 1’s injected notifier.
If not: fixture 'sample_players' not found means the parameter name doesn’t match the @pytest.fixture function name exactly, or the fixture isn’t in the same file (or a conftest.py). Make the names line up.
5. Mocking — the escape hatch, and why injection beats it — Stage 3: Independent (you do)
You have now seen the full red → green → refactor loop and parametrize. For this step, work with less hand-holding: read the two testing styles, then make both pass on your own. Sometimes code reaches out to something you can’t inject around — the clock, the filesystem, a hard-wired call. Add a function that depends on the wall clock to core.py:
import datetime
def is_tournament_open(now: datetime.datetime | None = None) -> bool:
current = now or datetime.datetime.now()
return current.weekday() < 5 # open on weekdays
There are two ways to test this. First, the injection way — pass the time in (preferred, no magic):
def test_tournament_open_on_weekday() -> None:
monday = datetime.datetime(2026, 5, 25, 9, 0) # a Monday
assert is_tournament_open(monday) is True
def test_tournament_closed_on_weekend() -> None:
saturday = datetime.datetime(2026, 5, 30, 9, 0)
assert is_tournament_open(saturday) is False
Second, the mocking way — for when you can’t change the signature, use monkeypatch to replace the collaborator:
def test_tournament_open_via_monkeypatch(monkeypatch: pytest.MonkeyPatch) -> None:
real_datetime = datetime.datetime # capture the real class BEFORE patching
class FakeDateTime:
@staticmethod
def now() -> datetime.datetime:
return real_datetime(2026, 5, 25, 9, 0) # Monday
monkeypatch.setattr(datetime, "datetime", FakeDateTime)
assert is_tournament_open() is True
Checkpoint: Both styles pass. Compare them: the injection tests are shorter, clearer, and don’t reach into module internals; the monkeypatch test works but is fiddly and brittle. The lesson of §8 in your hands: design so you can inject, and you’ll rarely reach for mocking. Mocking is for code you don’t control or can’t yet refactor.
If not: if the weekday/weekend tests disagree, double-check the dates — 2026-05-25 is a Monday (weekday() == 0) and 2026-05-30 a Saturday (weekday() == 5). If the monkeypatch test leaks into others, you patched a global by hand instead of via monkeypatch.setattr (Troubleshooting).
6. Measure coverage
Run the suite with coverage:
uv run pytest --cov=scoreboard --cov-report=term-missing
Checkpoint: You get a table with a Cover percentage and a Missing column listing line numbers your tests never executed. Find an uncovered line, decide whether it represents real behavior worth a test (write one) or trivial glue (leave it). Push core.py to at least 80%. The Missing column is the most useful output here — it tells you exactly where your confidence has gaps.
If not: --cov flag unknown means pytest-cov isn’t installed (uv add --dev pytest-cov). If coverage looks oddly low, you scoped it to the whole package including untested modules — narrow with --cov=scoreboard.core or write tests for the gaps.
7. Lock in strict types
Make the whole package type-clean. Run:
uv run mypy --strict src tests
Fix anything it flags — common ones: a function missing a return annotation, a lambda whose types it can’t infer (fine here), or datetime.datetime | None not handled. Optionally pin the config so future runs are consistent; add to pyproject.toml:
[tool.mypy]
strict = true
…then you can just run uv run mypy src tests.
Checkpoint: mypy reports Success: no issues found. Now you have two independent green signals — pytest (behavior is correct) and mypy --strict (types are consistent). Treat both as part of “done” for everything you build from here on.
If not: most strict-mode errors here are a missing return annotation on a test (-> None) or an unhandled datetime.datetime | None. Fix the annotation rather than adding # type: ignore — the complaint is a real gap (README §9).
Definition of Done
src/scoreboard/core.pycontainsPlayer, a purerank, andis_tournament_open.tests/test_core.pycontains: aparametrized test (4+ cases), at least one fixture used by 2+ tests, and both an injection-style and a monkeypatch-style test.- At least one test was written before its implementation existed (you saw it go red, then green).
uv run pytestis fully green.uv run pytest --cov=scoreboardreports 80%+ coverage oncore.py.uv run mypy --strict src testsreports zero errors.
Self-verify in two commands:
uv run pytest --cov=scoreboard --cov-report=term-missing
uv run mypy --strict src tests
Self-explain: in one sentence, why does writing the failing test first produce better-designed code than writing the test after?
Stretch Goals
parametrizewithpytest.paramids. Give each case in Step 3 a readable id (pytest.param(..., id="tie-alphabetical")) so-voutput reads like a spec.- A fixture that yields and cleans up. Write a fixture using
yieldthat creates a temp file (via the built-intmp_pathfixture), and a test that writes a scoreboard to it and reads it back. Observe that teardown afteryieldruns even if the test fails. - Test a raised exception. Add
rankvalidation that raisesValueErroron a negative score, and test it withwith pytest.raises(ValueError):. - Coverage gate. Add
--cov-fail-under=80to apytestconfig so the suite fails if coverage drops below 80%. This is how CI enforces the bar in the capstone. - Try Pyright. Run
uvx pyright srcand compare its output tomypy --strict. Note one difference in what each flags.
Troubleshooting
pytestcollects 0 tests. Files must be namedtest_*.pyand functionstest_*. Check the names and that you’re running from the project root.ModuleNotFoundError: No module named 'scoreboard'in tests. The--packagesrc/layout usually resolves underuv run pytest. If it doesn’t, ensure there’s asrc/scoreboard/__init__.pyand that you run viauv run, not barepytest.--covflag unknown.pytest-covisn’t installed; runuv add --dev pytest-cov.monkeypatchchange leaks into other tests. It shouldn’t —monkeypatchauto-reverts after each test. If you see leakage, you patched something globally without it; prefermonkeypatch.setattrover manual reassignment.mypyflags thelambdakey insorted. It can usually infer it; if strict mode objects, extract a small typed helper functiondef _key(p: Player) -> tuple[int, str]:and pass that instead.- Coverage looks lower than expected.
--cov=scoreboardmeasures the package, including modules with no tests. Either test them or scope coverage to the module you care about.