Alpha stage commit
This commit is contained in:
165
docs/gtfs_harmonization.md
Normal file
165
docs/gtfs_harmonization.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# GTFS Harmonization and QA Concept
|
||||
|
||||
Last updated: 2026-07-01
|
||||
|
||||
## Decision
|
||||
|
||||
Run harmonization inside the existing Mobility Workbench for now:
|
||||
|
||||
- Same FastAPI server.
|
||||
- Same operator/data-engineering UI.
|
||||
- Same PostgreSQL/PostGIS database.
|
||||
- Separate GTFS Harmonization and Mapping Data UI modules backed by the existing source/job tables.
|
||||
- Separate QA/harmonization API surface starting with `/api/qa/*`.
|
||||
- Separate canonical export concept, but no separate public API backend yet.
|
||||
|
||||
Split this into a separate service later when one of these becomes true:
|
||||
|
||||
- third-party API consumers need independent uptime, auth, quotas, billing, or SLA boundaries;
|
||||
- export jobs need independent workers, storage, or scaling;
|
||||
- canonical data publication needs immutable release management independent of the editing workbench;
|
||||
- commercial/public API concerns start slowing down internal QA and import workflows.
|
||||
|
||||
The public/API product should not expose raw workbench tables directly. It should consume versioned canonical snapshots.
|
||||
|
||||
The journey/routing interface should consume the active harmonized transit snapshot. It should not expose raw GTFS feed selection as a normal traveller-facing routing control. Feed-specific filters remain useful for QA, layer inspection, diagnostics, and source review.
|
||||
|
||||
## Target Pipeline
|
||||
|
||||
```text
|
||||
source catalog
|
||||
-> raw feed snapshots
|
||||
-> validation reports
|
||||
-> normalized staging tables
|
||||
-> canonical matching and deduplication
|
||||
-> conflict review and reusable rules
|
||||
-> versioned canonical snapshot
|
||||
-> GTFS/API/GeoParquet exports
|
||||
```
|
||||
|
||||
The pan-European output should be a canonical mobility dataset first, not one giant internal GTFS feed. GTFS should be one export format from that canonical snapshot.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Source Registry
|
||||
|
||||
Track every identified source, including feeds not yet imported:
|
||||
|
||||
- source URL and publisher;
|
||||
- country/region/mode coverage;
|
||||
- source authority and priority;
|
||||
- update cadence and freshness;
|
||||
- importability;
|
||||
- license and redistribution status.
|
||||
|
||||
Mobility Database can be used as a broad discovery connector. Prefer the full `feeds_v2.csv` catalog/API over validator acceptance-test feed lists because it includes feed status, official/source flags, latest/direct URLs, license URLs, features, and bounding boxes. Treat it as candidate metadata: the catalog metadata is reusable, but each transit feed still needs its own provider licence review and authority ranking.
|
||||
|
||||
PTNA can be used as a GTFS/OSM QA and crosswalk connector. Its country pages expose feed IDs, provider names, release dates, validity windows, route-analysis links, detail pages, and original release-page links. Detail pages can add license text, OSM permission notes, `network:guid`, and route matching hints. PTNA should not become the canonical publisher for a feed; the harmonizer should follow the original provider URL where possible and keep PTNA as evidence.
|
||||
|
||||
The generated discovery files live under `docs/generated/`:
|
||||
|
||||
- `gtfs_feed_candidates.csv` keeps every discovered feed/evidence row.
|
||||
- `gtfs_ingestable_sources.csv` keeps rows that can be imported as GTFS sources after review.
|
||||
- `gtfs_test_run_sources.csv` keeps a smaller multi-source set for deduplication tests.
|
||||
|
||||
Required license flags before publication:
|
||||
|
||||
- `can_import`
|
||||
- `can_derive`
|
||||
- `can_redistribute`
|
||||
- `requires_attribution`
|
||||
- `commercial_restrictions`
|
||||
|
||||
### Raw Snapshots
|
||||
|
||||
Every update should preserve immutable raw input:
|
||||
|
||||
- source id;
|
||||
- fetch time;
|
||||
- source hash;
|
||||
- upstream metadata;
|
||||
- parser/import version;
|
||||
- validator report;
|
||||
- previous active snapshot.
|
||||
|
||||
This keeps deduplication and conflict decisions reproducible.
|
||||
|
||||
### Canonical Entities
|
||||
|
||||
Stable meubility IDs should be the internal truth. Source IDs remain aliases.
|
||||
|
||||
Initial canonical entity families:
|
||||
|
||||
- operators/agencies/authorities/networks;
|
||||
- stop places and station complexes;
|
||||
- platforms, tracks, bus bays, entrances;
|
||||
- routes/lines;
|
||||
- route patterns and trip patterns;
|
||||
- calendars/service validity;
|
||||
- shapes/geometries;
|
||||
- fares/ticketing references later.
|
||||
|
||||
### Authority Ranking
|
||||
|
||||
Conflict resolution needs explicit source authority:
|
||||
|
||||
- manual review decision;
|
||||
- national official feed or registry;
|
||||
- regional authority feed;
|
||||
- operator feed;
|
||||
- broad aggregator feed;
|
||||
- OSM as visual/gap evidence, not timetable authority.
|
||||
|
||||
Authority can differ by entity type. A source can be authoritative for timetable but weak for route geometry or operator identity.
|
||||
|
||||
### Conflict Review
|
||||
|
||||
The QA dashboard should expose review queues for:
|
||||
|
||||
- duplicate operators/agencies;
|
||||
- duplicate stop places/station complexes;
|
||||
- GTFS stops without canonical links;
|
||||
- OSM stops without GTFS/canonical links;
|
||||
- canonical stop groups with large spatial disagreement;
|
||||
- routes with missing, weak, or conflicting OSM links;
|
||||
- routes with missing shapes or route-pattern geometry;
|
||||
- stale calendars and short service horizons;
|
||||
- license/redistribution blockers.
|
||||
|
||||
Manual resolutions must become reusable rules so source updates do not reintroduce the same conflict.
|
||||
|
||||
## Export Strategy
|
||||
|
||||
Do not start with one giant Europe GTFS zip as the only product. Produce:
|
||||
|
||||
- versioned canonical snapshot tables;
|
||||
- country/region GTFS exports;
|
||||
- network/operator GTFS exports;
|
||||
- full-Europe analytical dumps such as GeoParquet;
|
||||
- API-ready entity endpoints later.
|
||||
|
||||
Each export needs:
|
||||
|
||||
- snapshot id;
|
||||
- source feed versions;
|
||||
- generation time;
|
||||
- validation summary;
|
||||
- license/attribution manifest;
|
||||
- conflict/review status.
|
||||
|
||||
## Current Implementation Step
|
||||
|
||||
The first implementation is a lightweight harmonization boundary:
|
||||
|
||||
- `/api/qa/summary`;
|
||||
- source discovery metrics;
|
||||
- import health metrics;
|
||||
- GTFS validation counters;
|
||||
- canonical stop/link coverage;
|
||||
- route matching and geometry counters;
|
||||
- publication-readiness warnings.
|
||||
- GTFS source add/import/review controls live in the `GTFS Harmonization` sidebar module.
|
||||
- OSM/route-layer source controls live in the `Mapping Data` sidebar module.
|
||||
- The journey panel displays the active harmonized transit snapshot instead of a GTFS source picker.
|
||||
|
||||
This is intentionally a skeleton. The next step is to turn non-zero warning/bad counters into review queues with drill-down lists and persistent resolution actions.
|
||||
Reference in New Issue
Block a user