# GTFS Harmonization and QA Concept Last updated: 2026-07-01 ## Decision Run harmonization inside the existing Mobility Workbench for now: - Same FastAPI server. - Same operator/data-engineering UI. - Same PostgreSQL/PostGIS database. - Separate GTFS Harmonization and Mapping Data UI modules backed by the existing source/job tables. - Separate QA/harmonization API surface starting with `/api/qa/*`. - Separate canonical export concept, but no separate public API backend yet. Split this into a separate service later when one of these becomes true: - third-party API consumers need independent uptime, auth, quotas, billing, or SLA boundaries; - export jobs need independent workers, storage, or scaling; - canonical data publication needs immutable release management independent of the editing workbench; - commercial/public API concerns start slowing down internal QA and import workflows. The public/API product should not expose raw workbench tables directly. It should consume versioned canonical snapshots. The journey/routing interface should consume the active harmonized transit snapshot. It should not expose raw GTFS feed selection as a normal traveller-facing routing control. Feed-specific filters remain useful for QA, layer inspection, diagnostics, and source review. ## Target Pipeline ```text source catalog -> raw feed snapshots -> validation reports -> normalized staging tables -> canonical matching and deduplication -> conflict review and reusable rules -> versioned canonical snapshot -> GTFS/API/GeoParquet exports ``` The pan-European output should be a canonical mobility dataset first, not one giant internal GTFS feed. GTFS should be one export format from that canonical snapshot. ## Core Concepts ### Source Registry Track every identified source, including feeds not yet imported: - source URL and publisher; - country/region/mode coverage; - source authority and priority; - update cadence and freshness; - importability; - license and redistribution status. Mobility Database can be used as a broad discovery connector. Prefer the full `feeds_v2.csv` catalog/API over validator acceptance-test feed lists because it includes feed status, official/source flags, latest/direct URLs, license URLs, features, and bounding boxes. Treat it as candidate metadata: the catalog metadata is reusable, but each transit feed still needs its own provider licence review and authority ranking. PTNA can be used as a GTFS/OSM QA and crosswalk connector. Its country pages expose feed IDs, provider names, release dates, validity windows, route-analysis links, detail pages, and original release-page links. Detail pages can add license text, OSM permission notes, `network:guid`, and route matching hints. PTNA should not become the canonical publisher for a feed; the harmonizer should follow the original provider URL where possible and keep PTNA as evidence. The generated discovery files live under `docs/generated/`: - `gtfs_feed_candidates.csv` keeps every discovered feed/evidence row. - `gtfs_ingestable_sources.csv` keeps rows that can be imported as GTFS sources after review. - `gtfs_test_run_sources.csv` keeps a smaller multi-source set for deduplication tests. Required license flags before publication: - `can_import` - `can_derive` - `can_redistribute` - `requires_attribution` - `commercial_restrictions` ### Raw Snapshots Every update should preserve immutable raw input: - source id; - fetch time; - source hash; - upstream metadata; - parser/import version; - validator report; - previous active snapshot. This keeps deduplication and conflict decisions reproducible. ### Canonical Entities Stable meubility IDs should be the internal truth. Source IDs remain aliases. Initial canonical entity families: - operators/agencies/authorities/networks; - stop places and station complexes; - platforms, tracks, bus bays, entrances; - routes/lines; - route patterns and trip patterns; - calendars/service validity; - shapes/geometries; - fares/ticketing references later. ### Authority Ranking Conflict resolution needs explicit source authority: - manual review decision; - national official feed or registry; - regional authority feed; - operator feed; - broad aggregator feed; - OSM as visual/gap evidence, not timetable authority. Authority can differ by entity type. A source can be authoritative for timetable but weak for route geometry or operator identity. ### Conflict Review The QA dashboard should expose review queues for: - duplicate operators/agencies; - duplicate stop places/station complexes; - GTFS stops without canonical links; - OSM stops without GTFS/canonical links; - canonical stop groups with large spatial disagreement; - routes with missing, weak, or conflicting OSM links; - routes with missing shapes or route-pattern geometry; - stale calendars and short service horizons; - license/redistribution blockers. Manual resolutions must become reusable rules so source updates do not reintroduce the same conflict. ## Export Strategy Do not start with one giant Europe GTFS zip as the only product. Produce: - versioned canonical snapshot tables; - country/region GTFS exports; - network/operator GTFS exports; - full-Europe analytical dumps such as GeoParquet; - API-ready entity endpoints later. Each export needs: - snapshot id; - source feed versions; - generation time; - validation summary; - license/attribution manifest; - conflict/review status. ## Current Implementation Step The first implementation is a lightweight harmonization boundary: - `/api/qa/summary`; - source discovery metrics; - import health metrics; - GTFS validation counters; - canonical stop/link coverage; - route matching and geometry counters; - publication-readiness warnings. - GTFS source add/import/review controls live in the `GTFS Harmonization` sidebar module. - OSM/route-layer source controls live in the `Mapping Data` sidebar module. - The journey panel displays the active harmonized transit snapshot instead of a GTFS source picker. This is intentionally a skeleton. The next step is to turn non-zero warning/bad counters into review queues with drill-down lists and persistent resolution actions.