Files
meubility-workbench/docs/source_acquisition.md
2026-07-01 23:29:51 +02:00

4.9 KiB

Source acquisition and operator inventory

This repository now contains two seed catalogues:

  • docs/source_catalog_seed.csv — broad discovery catalogue for official NAPs, feed registries, route-geometry evidence, realtime/disruption sources, rail/air registries and country notes.
  • docs/ingestable_sources_seed.csv — direct static feeds that the current prototype can import immediately.
  • docs/generated/gtfs_feed_candidates.csv — generated GTFS discovery manifest from Mobility Database, PTNA, the validator acceptance list, and curated local seeds.
  • docs/generated/gtfs_ingestable_sources.csv — generated direct GTFS source rows suitable for source-registry import after license/source review.
  • docs/generated/gtfs_test_run_sources.csv — generated focused feed set for the first multi-source harmonization/deduplication run.

Regenerate the GTFS discovery manifests:

python -m app.cli discover-gtfs-sources --max-ptna-details 0 --test-limit 24

Use --countries ALL for the broad global Mobility Database/acceptance-list pass. Use a positive --max-ptna-details when you want PTNA license and OSM crosswalk fields; the country-table scrape is fast, while detail pages can be slow.

Import the direct feed seed list into the source registry:

python -m app.cli import-source-catalog --csv docs/source_catalog_seed.csv
python -m app.cli import-ingestable-sources --csv docs/ingestable_sources_seed.csv
python -m app.cli import-ingestable-sources --csv docs/generated/gtfs_test_run_sources.csv
python -m app.cli stats

Queue the focused multi-source harmonization test run:

python -m app.cli queue-source-imports-from-csv --csv docs/generated/gtfs_test_run_sources.csv

That queues every listed source import with per-source matching disabled, then queues one route-matching job and one route-layer rebuild after the imports. This avoids rebuilding matches/layers after every individual feed.

Then run feeds one by one from the UI or CLI:

python -m app.cli run-source 1
python -m app.cli run-match

Operator list strategy

There is no single complete European transport-operator list. Generate the operator table by unioning and reconciling:

  1. GTFS agency.txt records from every imported static feed.
  2. NeTEx Operator, Authority, Network, and Line records once NeTEx ingestion is added.
  3. National Access Point dataset publishers and data-provider metadata.
  4. National stop registries and access-node systems, such as NaPTAN, NSR, and Swiss SLOID/DiDok/service-point datasets.
  5. Rail undertaking registries such as ERADIS Single Safety Certificates.
  6. Aviation registries such as EASA AOC/TCO lists, plus airport registries such as OurAirports.
  7. OSM operator and network tags as a gap-finding and alias-discovery layer, not as authority.
  8. Manual commercial/onboarding records for booking/API coverage.

Persist every operator row with provenance: source table, source URL, first_seen, last_seen, confidence, and whether it is an authority, data publisher, brand, legal operator, infrastructure manager, or booking partner.

Geometry policy

For this workbench, the extracted OSM route layer is the authoritative visual layer for routes, networks and stop display. GTFS, NeTEx, official GIS data, infrastructure registries and historical vehicle traces are matching and QA inputs. They can propose corrections, flag missing or stale OSM route-layer geometry, and explain timetable deviations, but they do not override the canonical visual route layer automatically.

Use non-OSM geometry sources as evidence in this order:

  1. NeTEx journey-pattern/link-sequence geometry and GTFS shapes.txt.
  2. Official stop/station registries such as NaPTAN, NSR, and SLOID/DiDok.
  3. Official infrastructure registries such as ERA RINF and RailNetEurope DII.
  4. Official operator GIS route datasets where available.
  5. Historical realtime vehicle traces after QA.
  6. OSM route-layer gaps and conflicts for manual review.

OSM PBF extraction

For large countries or Europe-wide experiments, pre-filter raw PBF files before importing them:

scripts/osmium_transport_filter.sh europe-latest.osm.pbf europe-transport.osm.pbf

This requires the external osmium CLI. The result is still an OSM transport extract and remains the input to the canonical visual route-layer extraction; it is not a separate geometry hierarchy.

Temporary closures and disruption data

Structured temporary closure/disruption data usually comes from:

  • GTFS-Realtime Service Alerts, TripUpdates and VehiclePositions.
  • SIRI-SX, SIRI-ET, SIRI-VM, and related national profiles.
  • DATEX II roadworks, closures, incidents, restrictions and weather for bus detours and access legs.
  • Rail-specific feeds such as National Rail Darwin or operator construction-work feeds.
  • Ferry and air operator/airport APIs where available, often commercial or auth-gated.

Model these as separate validity-windowed event tables rather than modifying the base static timetable.