Files
meubility-workbench/docs/gtfs_harmonization.md
2026-07-01 23:29:51 +02:00

166 lines
6.1 KiB
Markdown

# GTFS Harmonization and QA Concept
Last updated: 2026-07-01
## Decision
Run harmonization inside the existing Mobility Workbench for now:
- Same FastAPI server.
- Same operator/data-engineering UI.
- Same PostgreSQL/PostGIS database.
- Separate GTFS Harmonization and Mapping Data UI modules backed by the existing source/job tables.
- Separate QA/harmonization API surface starting with `/api/qa/*`.
- Separate canonical export concept, but no separate public API backend yet.
Split this into a separate service later when one of these becomes true:
- third-party API consumers need independent uptime, auth, quotas, billing, or SLA boundaries;
- export jobs need independent workers, storage, or scaling;
- canonical data publication needs immutable release management independent of the editing workbench;
- commercial/public API concerns start slowing down internal QA and import workflows.
The public/API product should not expose raw workbench tables directly. It should consume versioned canonical snapshots.
The journey/routing interface should consume the active harmonized transit snapshot. It should not expose raw GTFS feed selection as a normal traveller-facing routing control. Feed-specific filters remain useful for QA, layer inspection, diagnostics, and source review.
## Target Pipeline
```text
source catalog
-> raw feed snapshots
-> validation reports
-> normalized staging tables
-> canonical matching and deduplication
-> conflict review and reusable rules
-> versioned canonical snapshot
-> GTFS/API/GeoParquet exports
```
The pan-European output should be a canonical mobility dataset first, not one giant internal GTFS feed. GTFS should be one export format from that canonical snapshot.
## Core Concepts
### Source Registry
Track every identified source, including feeds not yet imported:
- source URL and publisher;
- country/region/mode coverage;
- source authority and priority;
- update cadence and freshness;
- importability;
- license and redistribution status.
Mobility Database can be used as a broad discovery connector. Prefer the full `feeds_v2.csv` catalog/API over validator acceptance-test feed lists because it includes feed status, official/source flags, latest/direct URLs, license URLs, features, and bounding boxes. Treat it as candidate metadata: the catalog metadata is reusable, but each transit feed still needs its own provider licence review and authority ranking.
PTNA can be used as a GTFS/OSM QA and crosswalk connector. Its country pages expose feed IDs, provider names, release dates, validity windows, route-analysis links, detail pages, and original release-page links. Detail pages can add license text, OSM permission notes, `network:guid`, and route matching hints. PTNA should not become the canonical publisher for a feed; the harmonizer should follow the original provider URL where possible and keep PTNA as evidence.
The generated discovery files live under `docs/generated/`:
- `gtfs_feed_candidates.csv` keeps every discovered feed/evidence row.
- `gtfs_ingestable_sources.csv` keeps rows that can be imported as GTFS sources after review.
- `gtfs_test_run_sources.csv` keeps a smaller multi-source set for deduplication tests.
Required license flags before publication:
- `can_import`
- `can_derive`
- `can_redistribute`
- `requires_attribution`
- `commercial_restrictions`
### Raw Snapshots
Every update should preserve immutable raw input:
- source id;
- fetch time;
- source hash;
- upstream metadata;
- parser/import version;
- validator report;
- previous active snapshot.
This keeps deduplication and conflict decisions reproducible.
### Canonical Entities
Stable meubility IDs should be the internal truth. Source IDs remain aliases.
Initial canonical entity families:
- operators/agencies/authorities/networks;
- stop places and station complexes;
- platforms, tracks, bus bays, entrances;
- routes/lines;
- route patterns and trip patterns;
- calendars/service validity;
- shapes/geometries;
- fares/ticketing references later.
### Authority Ranking
Conflict resolution needs explicit source authority:
- manual review decision;
- national official feed or registry;
- regional authority feed;
- operator feed;
- broad aggregator feed;
- OSM as visual/gap evidence, not timetable authority.
Authority can differ by entity type. A source can be authoritative for timetable but weak for route geometry or operator identity.
### Conflict Review
The QA dashboard should expose review queues for:
- duplicate operators/agencies;
- duplicate stop places/station complexes;
- GTFS stops without canonical links;
- OSM stops without GTFS/canonical links;
- canonical stop groups with large spatial disagreement;
- routes with missing, weak, or conflicting OSM links;
- routes with missing shapes or route-pattern geometry;
- stale calendars and short service horizons;
- license/redistribution blockers.
Manual resolutions must become reusable rules so source updates do not reintroduce the same conflict.
## Export Strategy
Do not start with one giant Europe GTFS zip as the only product. Produce:
- versioned canonical snapshot tables;
- country/region GTFS exports;
- network/operator GTFS exports;
- full-Europe analytical dumps such as GeoParquet;
- API-ready entity endpoints later.
Each export needs:
- snapshot id;
- source feed versions;
- generation time;
- validation summary;
- license/attribution manifest;
- conflict/review status.
## Current Implementation Step
The first implementation is a lightweight harmonization boundary:
- `/api/qa/summary`;
- source discovery metrics;
- import health metrics;
- GTFS validation counters;
- canonical stop/link coverage;
- route matching and geometry counters;
- publication-readiness warnings.
- GTFS source add/import/review controls live in the `GTFS Harmonization` sidebar module.
- OSM/route-layer source controls live in the `Mapping Data` sidebar module.
- The journey panel displays the active harmonized transit snapshot instead of a GTFS source picker.
This is intentionally a skeleton. The next step is to turn non-zero warning/bad counters into review queues with drill-down lists and persistent resolution actions.