6.1 KiB
GTFS Harmonization and QA Concept
Last updated: 2026-07-01
Decision
Run harmonization inside the existing Mobility Workbench for now:
- Same FastAPI server.
- Same operator/data-engineering UI.
- Same PostgreSQL/PostGIS database.
- Separate GTFS Harmonization and Mapping Data UI modules backed by the existing source/job tables.
- Separate QA/harmonization API surface starting with
/api/qa/*. - Separate canonical export concept, but no separate public API backend yet.
Split this into a separate service later when one of these becomes true:
- third-party API consumers need independent uptime, auth, quotas, billing, or SLA boundaries;
- export jobs need independent workers, storage, or scaling;
- canonical data publication needs immutable release management independent of the editing workbench;
- commercial/public API concerns start slowing down internal QA and import workflows.
The public/API product should not expose raw workbench tables directly. It should consume versioned canonical snapshots.
The journey/routing interface should consume the active harmonized transit snapshot. It should not expose raw GTFS feed selection as a normal traveller-facing routing control. Feed-specific filters remain useful for QA, layer inspection, diagnostics, and source review.
Target Pipeline
source catalog
-> raw feed snapshots
-> validation reports
-> normalized staging tables
-> canonical matching and deduplication
-> conflict review and reusable rules
-> versioned canonical snapshot
-> GTFS/API/GeoParquet exports
The pan-European output should be a canonical mobility dataset first, not one giant internal GTFS feed. GTFS should be one export format from that canonical snapshot.
Core Concepts
Source Registry
Track every identified source, including feeds not yet imported:
- source URL and publisher;
- country/region/mode coverage;
- source authority and priority;
- update cadence and freshness;
- importability;
- license and redistribution status.
Mobility Database can be used as a broad discovery connector. Prefer the full feeds_v2.csv catalog/API over validator acceptance-test feed lists because it includes feed status, official/source flags, latest/direct URLs, license URLs, features, and bounding boxes. Treat it as candidate metadata: the catalog metadata is reusable, but each transit feed still needs its own provider licence review and authority ranking.
PTNA can be used as a GTFS/OSM QA and crosswalk connector. Its country pages expose feed IDs, provider names, release dates, validity windows, route-analysis links, detail pages, and original release-page links. Detail pages can add license text, OSM permission notes, network:guid, and route matching hints. PTNA should not become the canonical publisher for a feed; the harmonizer should follow the original provider URL where possible and keep PTNA as evidence.
The generated discovery files live under docs/generated/:
gtfs_feed_candidates.csvkeeps every discovered feed/evidence row.gtfs_ingestable_sources.csvkeeps rows that can be imported as GTFS sources after review.gtfs_test_run_sources.csvkeeps a smaller multi-source set for deduplication tests.
Required license flags before publication:
can_importcan_derivecan_redistributerequires_attributioncommercial_restrictions
Raw Snapshots
Every update should preserve immutable raw input:
- source id;
- fetch time;
- source hash;
- upstream metadata;
- parser/import version;
- validator report;
- previous active snapshot.
This keeps deduplication and conflict decisions reproducible.
Canonical Entities
Stable meubility IDs should be the internal truth. Source IDs remain aliases.
Initial canonical entity families:
- operators/agencies/authorities/networks;
- stop places and station complexes;
- platforms, tracks, bus bays, entrances;
- routes/lines;
- route patterns and trip patterns;
- calendars/service validity;
- shapes/geometries;
- fares/ticketing references later.
Authority Ranking
Conflict resolution needs explicit source authority:
- manual review decision;
- national official feed or registry;
- regional authority feed;
- operator feed;
- broad aggregator feed;
- OSM as visual/gap evidence, not timetable authority.
Authority can differ by entity type. A source can be authoritative for timetable but weak for route geometry or operator identity.
Conflict Review
The QA dashboard should expose review queues for:
- duplicate operators/agencies;
- duplicate stop places/station complexes;
- GTFS stops without canonical links;
- OSM stops without GTFS/canonical links;
- canonical stop groups with large spatial disagreement;
- routes with missing, weak, or conflicting OSM links;
- routes with missing shapes or route-pattern geometry;
- stale calendars and short service horizons;
- license/redistribution blockers.
Manual resolutions must become reusable rules so source updates do not reintroduce the same conflict.
Export Strategy
Do not start with one giant Europe GTFS zip as the only product. Produce:
- versioned canonical snapshot tables;
- country/region GTFS exports;
- network/operator GTFS exports;
- full-Europe analytical dumps such as GeoParquet;
- API-ready entity endpoints later.
Each export needs:
- snapshot id;
- source feed versions;
- generation time;
- validation summary;
- license/attribution manifest;
- conflict/review status.
Current Implementation Step
The first implementation is a lightweight harmonization boundary:
/api/qa/summary;- source discovery metrics;
- import health metrics;
- GTFS validation counters;
- canonical stop/link coverage;
- route matching and geometry counters;
- publication-readiness warnings.
- GTFS source add/import/review controls live in the
GTFS Harmonizationsidebar module. - OSM/route-layer source controls live in the
Mapping Datasidebar module. - The journey panel displays the active harmonized transit snapshot instead of a GTFS source picker.
This is intentionally a skeleton. The next step is to turn non-zero warning/bad counters into review queues with drill-down lists and persistent resolution actions.