Field testing

We dogfood pedestrian routing internally. The frozen unit is an origin–destination pair: one facade call across all three product profiles (Fast / Calm / Safe), one shared time bucket, full per-edge path details. Testers compare the three options and rate the choice, then walk one assigned mode inside a daytime walk window, annotating guidance nodes on the spot with mode-aware issue chips. Managers freeze the pairs, plan the cycle, and review the collected feedback next to street-level imagery. Everything lives behind this page.

I'm walking routes (tester)

Field app	Your assigned pair on the phone — open the personal link you received (`/field?t=<you>&day=<n>`), compare the three options, walk your assigned mode, tap nodes, rate the walk.
Tester guide	Two-minute read before your first walk: the compare step, issue chips per mode, photos, dictation, what gets recorded.

I'm running the cycle (manager / reviewer)

Map demo	Build an O/D pair and freeze it with "Save for field" — one facade call routes all three product profiles and records geometry, per-edge path details, and the shared time bucket in a single pair-doc. Doubles as the desk-review tool: any node opens in Mapillary / Street View.
Bench	Desk comparison of our pedestrian route against Google's (ours blue, Google amber) on a Google basemap, with a distance / overlap card and Street View inspection. Eyeball where the two routes diverge.
Field plan	Assign pair × mode × day per tester, each with a walk window; generates the personal links.
Field review	Collected feedback: per-pair overview (walks, ratings, issue counts) and a map view where flagged nodes sit next to street-level imagery, with the triage queue pre-sorted by fix surface.
Manager guide	The full runbook: freezing pairs, assignments, collection, photo EXIF-join, imagery keys.

How the pieces fit

Freeze — build O/D pairs in the demo and save them for field testing. Each frozen pair carries all three product profile routes plus per-edge path details, so every complaint joins back to what the engine believed at that exact spot (frozen snapshots keep all feedback comparable).
Plan — assign tester × day × mode with a walk window on the field plan and send out personal links with the tester guide.
Walk — testers first compare the three options and rate choice value, then walk the assigned mode, annotating nodes with mode-aware chips and rating correctness and mode alignment at the end in the field app; rows queue offline and sync automatically.
Review — triage on field review, queue pre-sorted by fix surface; verify flagged nodes remotely via street-level imagery and the engine's per-edge beliefs; export the raw feedback.jsonl for analysis.

Every feedback row joins to its frozen pair's exact geometry, per-edge path details, weights, and time bucket — rows double as replayable regression cases for routing and guidance work, and the pair-docs feed the bench directly (edge overlap, detour ratios, time deltas across the full pair set).

Cycle scope and method

Daytime only. Cycle 1 is pre-registered on daytime buckets (wd_am / wd_pm): it tests Safe's route-shape promise, while the lights weight and darkness perception are explicitly out of scope. Night is a separate opt-in probe — one evening, a few volunteers walking in pairs, frozen wd_nt pairs after actual dark.
Search clarity is not part of walks. Whether you can create a route and understand the options in the live app is a pure frontend signal — it's measured in a separate once-per-tester live-app session, not on the end-of-walk sheet.
Mode alignment is walked, not previewed. Testers rate mode alignment only for the mode they actually walked — armchair ratings of the unwalked options aren't collected.
Inter-rater agreement. A subset of pairs is walked by two or three testers in the same mode and walk window, so we can see how much of a rating is the route and how much is the rater. Total walks stay the same — it's purely how the assignment grid is filled.

Ratings are descriptive health metrics — no decision hangs on a rating alone. Issue chips and their joins to the frozen pair-docs are what drives fixes. "Ratings tell us how bad it is; chips and joins tell us what to fix; the bench tells us whether we fixed it."