4.9 KiB
Source acquisition and operator inventory
This repository now contains two seed catalogues:
docs/source_catalog_seed.csv— broad discovery catalogue for official NAPs, feed registries, route-geometry evidence, realtime/disruption sources, rail/air registries and country notes.docs/ingestable_sources_seed.csv— direct static feeds that the current prototype can import immediately.docs/generated/gtfs_feed_candidates.csv— generated GTFS discovery manifest from Mobility Database, PTNA, the validator acceptance list, and curated local seeds.docs/generated/gtfs_ingestable_sources.csv— generated direct GTFS source rows suitable for source-registry import after license/source review.docs/generated/gtfs_test_run_sources.csv— generated focused feed set for the first multi-source harmonization/deduplication run.
Regenerate the GTFS discovery manifests:
python -m app.cli discover-gtfs-sources --max-ptna-details 0 --test-limit 24
Use --countries ALL for the broad global Mobility Database/acceptance-list pass. Use a positive --max-ptna-details when you want PTNA license and OSM crosswalk fields; the country-table scrape is fast, while detail pages can be slow.
Import the direct feed seed list into the source registry:
python -m app.cli import-source-catalog --csv docs/source_catalog_seed.csv
python -m app.cli import-ingestable-sources --csv docs/ingestable_sources_seed.csv
python -m app.cli import-ingestable-sources --csv docs/generated/gtfs_test_run_sources.csv
python -m app.cli stats
Queue the focused multi-source harmonization test run:
python -m app.cli queue-source-imports-from-csv --csv docs/generated/gtfs_test_run_sources.csv
That queues every listed source import with per-source matching disabled, then queues one route-matching job and one route-layer rebuild after the imports. This avoids rebuilding matches/layers after every individual feed.
Then run feeds one by one from the UI or CLI:
python -m app.cli run-source 1
python -m app.cli run-match
Operator list strategy
There is no single complete European transport-operator list. Generate the operator table by unioning and reconciling:
- GTFS
agency.txtrecords from every imported static feed. - NeTEx
Operator,Authority,Network, andLinerecords once NeTEx ingestion is added. - National Access Point dataset publishers and data-provider metadata.
- National stop registries and access-node systems, such as NaPTAN, NSR, and Swiss SLOID/DiDok/service-point datasets.
- Rail undertaking registries such as ERADIS Single Safety Certificates.
- Aviation registries such as EASA AOC/TCO lists, plus airport registries such as OurAirports.
- OSM
operatorandnetworktags as a gap-finding and alias-discovery layer, not as authority. - Manual commercial/onboarding records for booking/API coverage.
Persist every operator row with provenance: source table, source URL, first_seen, last_seen, confidence, and whether it is an authority, data publisher, brand, legal operator, infrastructure manager, or booking partner.
Geometry policy
For this workbench, the extracted OSM route layer is the authoritative visual layer for routes, networks and stop display. GTFS, NeTEx, official GIS data, infrastructure registries and historical vehicle traces are matching and QA inputs. They can propose corrections, flag missing or stale OSM route-layer geometry, and explain timetable deviations, but they do not override the canonical visual route layer automatically.
Use non-OSM geometry sources as evidence in this order:
- NeTEx journey-pattern/link-sequence geometry and GTFS
shapes.txt. - Official stop/station registries such as NaPTAN, NSR, and SLOID/DiDok.
- Official infrastructure registries such as ERA RINF and RailNetEurope DII.
- Official operator GIS route datasets where available.
- Historical realtime vehicle traces after QA.
- OSM route-layer gaps and conflicts for manual review.
OSM PBF extraction
For large countries or Europe-wide experiments, pre-filter raw PBF files before importing them:
scripts/osmium_transport_filter.sh europe-latest.osm.pbf europe-transport.osm.pbf
This requires the external osmium CLI. The result is still an OSM transport extract and remains the input to the canonical visual route-layer extraction; it is not a separate geometry hierarchy.
Temporary closures and disruption data
Structured temporary closure/disruption data usually comes from:
- GTFS-Realtime Service Alerts, TripUpdates and VehiclePositions.
- SIRI-SX, SIRI-ET, SIRI-VM, and related national profiles.
- DATEX II roadworks, closures, incidents, restrictions and weather for bus detours and access legs.
- Rail-specific feeds such as National Rail Darwin or operator construction-work feeds.
- Ferry and air operator/airport APIs where available, often commercial or auth-gated.
Model these as separate validity-windowed event tables rather than modifying the base static timetable.