166 lines
6.1 KiB
Markdown
166 lines
6.1 KiB
Markdown
# GTFS Harmonization and QA Concept
|
|
|
|
Last updated: 2026-07-01
|
|
|
|
## Decision
|
|
|
|
Run harmonization inside the existing Mobility Workbench for now:
|
|
|
|
- Same FastAPI server.
|
|
- Same operator/data-engineering UI.
|
|
- Same PostgreSQL/PostGIS database.
|
|
- Separate GTFS Harmonization and Mapping Data UI modules backed by the existing source/job tables.
|
|
- Separate QA/harmonization API surface starting with `/api/qa/*`.
|
|
- Separate canonical export concept, but no separate public API backend yet.
|
|
|
|
Split this into a separate service later when one of these becomes true:
|
|
|
|
- third-party API consumers need independent uptime, auth, quotas, billing, or SLA boundaries;
|
|
- export jobs need independent workers, storage, or scaling;
|
|
- canonical data publication needs immutable release management independent of the editing workbench;
|
|
- commercial/public API concerns start slowing down internal QA and import workflows.
|
|
|
|
The public/API product should not expose raw workbench tables directly. It should consume versioned canonical snapshots.
|
|
|
|
The journey/routing interface should consume the active harmonized transit snapshot. It should not expose raw GTFS feed selection as a normal traveller-facing routing control. Feed-specific filters remain useful for QA, layer inspection, diagnostics, and source review.
|
|
|
|
## Target Pipeline
|
|
|
|
```text
|
|
source catalog
|
|
-> raw feed snapshots
|
|
-> validation reports
|
|
-> normalized staging tables
|
|
-> canonical matching and deduplication
|
|
-> conflict review and reusable rules
|
|
-> versioned canonical snapshot
|
|
-> GTFS/API/GeoParquet exports
|
|
```
|
|
|
|
The pan-European output should be a canonical mobility dataset first, not one giant internal GTFS feed. GTFS should be one export format from that canonical snapshot.
|
|
|
|
## Core Concepts
|
|
|
|
### Source Registry
|
|
|
|
Track every identified source, including feeds not yet imported:
|
|
|
|
- source URL and publisher;
|
|
- country/region/mode coverage;
|
|
- source authority and priority;
|
|
- update cadence and freshness;
|
|
- importability;
|
|
- license and redistribution status.
|
|
|
|
Mobility Database can be used as a broad discovery connector. Prefer the full `feeds_v2.csv` catalog/API over validator acceptance-test feed lists because it includes feed status, official/source flags, latest/direct URLs, license URLs, features, and bounding boxes. Treat it as candidate metadata: the catalog metadata is reusable, but each transit feed still needs its own provider licence review and authority ranking.
|
|
|
|
PTNA can be used as a GTFS/OSM QA and crosswalk connector. Its country pages expose feed IDs, provider names, release dates, validity windows, route-analysis links, detail pages, and original release-page links. Detail pages can add license text, OSM permission notes, `network:guid`, and route matching hints. PTNA should not become the canonical publisher for a feed; the harmonizer should follow the original provider URL where possible and keep PTNA as evidence.
|
|
|
|
The generated discovery files live under `docs/generated/`:
|
|
|
|
- `gtfs_feed_candidates.csv` keeps every discovered feed/evidence row.
|
|
- `gtfs_ingestable_sources.csv` keeps rows that can be imported as GTFS sources after review.
|
|
- `gtfs_test_run_sources.csv` keeps a smaller multi-source set for deduplication tests.
|
|
|
|
Required license flags before publication:
|
|
|
|
- `can_import`
|
|
- `can_derive`
|
|
- `can_redistribute`
|
|
- `requires_attribution`
|
|
- `commercial_restrictions`
|
|
|
|
### Raw Snapshots
|
|
|
|
Every update should preserve immutable raw input:
|
|
|
|
- source id;
|
|
- fetch time;
|
|
- source hash;
|
|
- upstream metadata;
|
|
- parser/import version;
|
|
- validator report;
|
|
- previous active snapshot.
|
|
|
|
This keeps deduplication and conflict decisions reproducible.
|
|
|
|
### Canonical Entities
|
|
|
|
Stable meubility IDs should be the internal truth. Source IDs remain aliases.
|
|
|
|
Initial canonical entity families:
|
|
|
|
- operators/agencies/authorities/networks;
|
|
- stop places and station complexes;
|
|
- platforms, tracks, bus bays, entrances;
|
|
- routes/lines;
|
|
- route patterns and trip patterns;
|
|
- calendars/service validity;
|
|
- shapes/geometries;
|
|
- fares/ticketing references later.
|
|
|
|
### Authority Ranking
|
|
|
|
Conflict resolution needs explicit source authority:
|
|
|
|
- manual review decision;
|
|
- national official feed or registry;
|
|
- regional authority feed;
|
|
- operator feed;
|
|
- broad aggregator feed;
|
|
- OSM as visual/gap evidence, not timetable authority.
|
|
|
|
Authority can differ by entity type. A source can be authoritative for timetable but weak for route geometry or operator identity.
|
|
|
|
### Conflict Review
|
|
|
|
The QA dashboard should expose review queues for:
|
|
|
|
- duplicate operators/agencies;
|
|
- duplicate stop places/station complexes;
|
|
- GTFS stops without canonical links;
|
|
- OSM stops without GTFS/canonical links;
|
|
- canonical stop groups with large spatial disagreement;
|
|
- routes with missing, weak, or conflicting OSM links;
|
|
- routes with missing shapes or route-pattern geometry;
|
|
- stale calendars and short service horizons;
|
|
- license/redistribution blockers.
|
|
|
|
Manual resolutions must become reusable rules so source updates do not reintroduce the same conflict.
|
|
|
|
## Export Strategy
|
|
|
|
Do not start with one giant Europe GTFS zip as the only product. Produce:
|
|
|
|
- versioned canonical snapshot tables;
|
|
- country/region GTFS exports;
|
|
- network/operator GTFS exports;
|
|
- full-Europe analytical dumps such as GeoParquet;
|
|
- API-ready entity endpoints later.
|
|
|
|
Each export needs:
|
|
|
|
- snapshot id;
|
|
- source feed versions;
|
|
- generation time;
|
|
- validation summary;
|
|
- license/attribution manifest;
|
|
- conflict/review status.
|
|
|
|
## Current Implementation Step
|
|
|
|
The first implementation is a lightweight harmonization boundary:
|
|
|
|
- `/api/qa/summary`;
|
|
- source discovery metrics;
|
|
- import health metrics;
|
|
- GTFS validation counters;
|
|
- canonical stop/link coverage;
|
|
- route matching and geometry counters;
|
|
- publication-readiness warnings.
|
|
- GTFS source add/import/review controls live in the `GTFS Harmonization` sidebar module.
|
|
- OSM/route-layer source controls live in the `Mapping Data` sidebar module.
|
|
- The journey panel displays the active harmonized transit snapshot instead of a GTFS source picker.
|
|
|
|
This is intentionally a skeleton. The next step is to turn non-zero warning/bad counters into review queues with drill-down lists and persistent resolution actions.
|