Field testing
We dogfood pedestrian routing internally. The frozen unit is an origin–destination pair: one facade call across all three product profiles (Fast / Calm / Safe), one shared time bucket, full per-edge path details. Testers compare the three options and rate the choice, then walk one assigned mode inside a daytime walk window, annotating guidance nodes on the spot with mode-aware issue chips. Managers freeze the pairs, plan the cycle, and review the collected feedback next to street-level imagery. Everything lives behind this page.
I'm walking routes (tester)
| Field app |
Your assigned pair on the phone — open the
personal link you received
(/field?t=<you>&day=<n>),
compare the three options, walk your
assigned mode, tap nodes, rate the walk.
|
|---|---|
| Tester guide | Two-minute read before your first walk: the compare step, issue chips per mode, photos, dictation, what gets recorded. |
I'm running the cycle (manager / reviewer)
| Map demo | Build an O/D pair and freeze it with "Save for field" — one facade call routes all three product profiles and records geometry, per-edge path details, and the shared time bucket in a single pair-doc. Doubles as the desk-review tool: any node opens in Mapillary / Street View. |
|---|---|
| Bench | Desk comparison of our pedestrian route against Google's (ours blue, Google amber) on a Google basemap, with a distance / overlap card and Street View inspection. Eyeball where the two routes diverge. |
| Field plan | Assign pair × mode × day per tester, each with a walk window; generates the personal links. |
| Field review | Collected feedback: per-pair overview (walks, ratings, issue counts) and a map view where flagged nodes sit next to street-level imagery, with the triage queue pre-sorted by fix surface. |
| Manager guide | The full runbook: freezing pairs, assignments, collection, photo EXIF-join, imagery keys. |
How the pieces fit
- Freeze — build O/D pairs in the demo and save them for field testing. Each frozen pair carries all three product profile routes plus per-edge path details, so every complaint joins back to what the engine believed at that exact spot (frozen snapshots keep all feedback comparable).
- Plan — assign tester × day × mode with a walk window on the field plan and send out personal links with the tester guide.
- Walk — testers first compare the three options and rate choice value, then walk the assigned mode, annotating nodes with mode-aware chips and rating correctness and mode alignment at the end in the field app; rows queue offline and sync automatically.
-
Review — triage on
field review, queue
pre-sorted by fix surface; verify flagged nodes
remotely via street-level imagery and the engine's
per-edge beliefs; export the raw
feedback.jsonlfor analysis.
Every feedback row joins to its frozen pair's exact geometry, per-edge path details, weights, and time bucket — rows double as replayable regression cases for routing and guidance work, and the pair-docs feed the bench directly (edge overlap, detour ratios, time deltas across the full pair set).
Cycle scope and method
-
Daytime only. Cycle 1 is
pre-registered on daytime buckets
(
wd_am/wd_pm): it tests Safe's route-shape promise, while the lights weight and darkness perception are explicitly out of scope. Night is a separate opt-in probe — one evening, a few volunteers walking in pairs, frozenwd_ntpairs after actual dark. - Search clarity is not part of walks. Whether you can create a route and understand the options in the live app is a pure frontend signal — it's measured in a separate once-per-tester live-app session, not on the end-of-walk sheet.
- Mode alignment is walked, not previewed. Testers rate mode alignment only for the mode they actually walked — armchair ratings of the unwalked options aren't collected.
- Inter-rater agreement. A subset of pairs is walked by two or three testers in the same mode and walk window, so we can see how much of a rating is the route and how much is the rater. Total walks stay the same — it's purely how the assignment grid is filled.
Ratings are descriptive health metrics — no decision hangs on a rating alone. Issue chips and their joins to the frozen pair-docs are what drives fixes. "Ratings tell us how bad it is; chips and joins tell us what to fix; the bench tells us whether we fixed it."