meubility-workbench/docs/gtfs_harmonization.md

# GTFS Harmonization and QA Concept

Last updated: 2026-07-01

## Decision

Run harmonization inside the existing Mobility Workbench for now:

- Same FastAPI server.
- Same operator/data-engineering UI.
- Same PostgreSQL/PostGIS database.
- Separate GTFS Harmonization and Mapping Data UI modules backed by the existing source/job tables.
- Separate QA/harmonization API surface starting with `/api/qa/*`.
- Separate canonical export concept, but no separate public API backend yet.

Split this into a separate service later when one of these becomes true:

- third-party API consumers need independent uptime, auth, quotas, billing, or SLA boundaries;
- export jobs need independent workers, storage, or scaling;
- canonical data publication needs immutable release management independent of the editing workbench;
- commercial/public API concerns start slowing down internal QA and import workflows.

The public/API product should not expose raw workbench tables directly. It should consume versioned canonical snapshots.

The journey/routing interface should consume the active harmonized transit snapshot. It should not expose raw GTFS feed selection as a normal traveller-facing routing control. Feed-specific filters remain useful for QA, layer inspection, diagnostics, and source review.

## Target Pipeline

```text
source catalog
  -> raw feed snapshots
  -> validation reports
  -> normalized staging tables
  -> canonical matching and deduplication
  -> conflict review and reusable rules
  -> versioned canonical snapshot
  -> GTFS/API/GeoParquet exports
```

The pan-European output should be a canonical mobility dataset first, not one giant internal GTFS feed. GTFS should be one export format from that canonical snapshot.

## Core Concepts

### Source Registry

Track every identified source, including feeds not yet imported:

- source URL and publisher;
- country/region/mode coverage;
- source authority and priority;
- update cadence and freshness;
- importability;
- license and redistribution status.

Mobility Database can be used as a broad discovery connector. Prefer the full `feeds_v2.csv` catalog/API over validator acceptance-test feed lists because it includes feed status, official/source flags, latest/direct URLs, license URLs, features, and bounding boxes. Treat it as candidate metadata: the catalog metadata is reusable, but each transit feed still needs its own provider licence review and authority ranking.

PTNA can be used as a GTFS/OSM QA and crosswalk connector. Its country pages expose feed IDs, provider names, release dates, validity windows, route-analysis links, detail pages, and original release-page links. Detail pages can add license text, OSM permission notes, `network:guid`, and route matching hints. PTNA should not become the canonical publisher for a feed; the harmonizer should follow the original provider URL where possible and keep PTNA as evidence.

The generated discovery files live under `docs/generated/`:

- `gtfs_feed_candidates.csv` keeps every discovered feed/evidence row.
- `gtfs_ingestable_sources.csv` keeps rows that can be imported as GTFS sources after review.
- `gtfs_test_run_sources.csv` keeps a smaller multi-source set for deduplication tests.

Required license flags before publication:

- `can_import`
- `can_derive`
- `can_redistribute`
- `requires_attribution`
- `commercial_restrictions`

### Raw Snapshots

Every update should preserve immutable raw input:

- source id;
- fetch time;
- source hash;
- upstream metadata;
- parser/import version;
- validator report;
- previous active snapshot.

This keeps deduplication and conflict decisions reproducible.

### Canonical Entities

Stable meubility IDs should be the internal truth. Source IDs remain aliases.

Initial canonical entity families:

- operators/agencies/authorities/networks;
- stop places and station complexes;
- platforms, tracks, bus bays, entrances;
- routes/lines;
- route patterns and trip patterns;
- calendars/service validity;
- shapes/geometries;
- fares/ticketing references later.

### Authority Ranking

Conflict resolution needs explicit source authority:

- manual review decision;
- national official feed or registry;
- regional authority feed;
- operator feed;
- broad aggregator feed;
- OSM as visual/gap evidence, not timetable authority.

Authority can differ by entity type. A source can be authoritative for timetable but weak for route geometry or operator identity.

### Conflict Review

The QA dashboard should expose review queues for:

- duplicate operators/agencies;
- duplicate stop places/station complexes;
- GTFS stops without canonical links;
- OSM stops without GTFS/canonical links;
- canonical stop groups with large spatial disagreement;
- routes with missing, weak, or conflicting OSM links;
- routes with missing shapes or route-pattern geometry;
- stale calendars and short service horizons;
- license/redistribution blockers.

Manual resolutions must become reusable rules so source updates do not reintroduce the same conflict.

## Export Strategy

Do not start with one giant Europe GTFS zip as the only product. Produce:

- versioned canonical snapshot tables;
- country/region GTFS exports;
- network/operator GTFS exports;
- full-Europe analytical dumps such as GeoParquet;
- API-ready entity endpoints later.

Each export needs:

- snapshot id;
- source feed versions;
- generation time;
- validation summary;
- license/attribution manifest;
- conflict/review status.

## Current Implementation Step

The first implementation is a lightweight harmonization boundary:

- `/api/qa/summary`;
- source discovery metrics;
- import health metrics;
- GTFS validation counters;
- canonical stop/link coverage;
- route matching and geometry counters;
- publication-readiness warnings.
- GTFS source add/import/review controls live in the `GTFS Harmonization` sidebar module.
- OSM/route-layer source controls live in the `Mapping Data` sidebar module.
- The journey panel displays the active harmonized transit snapshot instead of a GTFS source picker.

This is intentionally a skeleton. The next step is to turn non-zero warning/bad counters into review queues with drill-down lists and persistent resolution actions.