Files
meubility-workbench/README.md
2026-07-01 23:29:51 +02:00

283 lines
9.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mobility Workbench
Working prototype for a mobility-data management interface and pipeline.
It is intentionally small but executable. The current implementation lets you:
- register data sources;
- download/copy source files into a local cache;
- import GTFS static timetable feeds;
- import raw OSM PBF extracts by deriving transport GeoJSON;
- import OSM-derived transport GeoJSON;
- persist raw datasets and normalized route/stop records;
- run automatic GTFS-route ↔ OSM-route matching;
- persist manual accept/reject rules from the UI;
- expose GeoJSON layers for a zoomable map;
- use a management web UI with separate GTFS Harmonization and Mapping Data modules, plus source runs, stats, matches, and map inspection.
The default database is SQLite so the prototype runs immediately. The schema is kept simple enough to migrate to PostGIS when the pipeline needs European scale, vector tiles, and spatial indexes.
## Quick start
```bash
cd mobility-workbench
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m app.cli load-sample
uvicorn app.main:app --reload
```
Open:
```text
http://127.0.0.1:8000
```
The sample project loads a small Berlin-like GTFS feed plus an OSM-like GeoJSON network. It imports routes/stops, runs the matcher, and shows matched and missing coverage on the map.
## PostgreSQL/PostGIS
SQLite remains the default. For Germany-scale imports, point `DATABASE_URL` at PostgreSQL:
```bash
export DATABASE_URL=postgresql://USER:PASSWORD@localhost:5432/meubility
python -m app.cli init-db
uvicorn app.main:app --reload
```
PostgreSQL mode automatically creates `postgis` and `pg_trgm`, stores GTFS `stop_times` and OSM features in main tables, and uses GiST/trigram indexes for map bbox queries, route-layer stop linking, and search filters. To keep using legacy sidecars with PostgreSQL, set:
```bash
export POSTGRES_USE_SIDECARS=true
```
To migrate the existing SQLite project into a fresh PostgreSQL database:
```bash
python scripts/migrate_sqlite_to_postgres.py \
--sqlite-path data/workbench.sqlite \
--postgres-url postgresql://USER:PASSWORD@localhost:5432/meubility \
--reset
```
The migration copies normal tables first, imports legacy GTFS/OSM sidecars into PostgreSQL main tables, rewrites dataset storage metadata to `main`, refreshes PostGIS geometry columns, and rebuilds runtime indexes.
## Docker start
```bash
docker compose up --build
```
Then open:
```text
http://127.0.0.1:8000
```
## CLI commands
```bash
python -m app.cli init-db
python -m app.cli reset-db
python -m app.cli load-sample
python -m app.cli stats
python -m app.cli add-source --name "My GTFS" --kind gtfs --url ./data/feed.zip --country DE
python -m app.cli add-source --name "VBB Online GTFS" --kind gtfs --url https://unternehmen.vbb.de/fileadmin/user_upload/VBB/Dokumente/API-Datensaetze/gtfs-mastscharf/GTFS.zip --country DE --license "CC BY 4.0"
python -m app.cli add-source --name "DB Long-distance Rail GTFS.DE" --kind gtfs --url https://download.gtfs.de/germany/fv_free/latest.zip --country DE --license "Creative Commons 4.0"
python -m app.cli add-source --name "Germany Regional Rail GTFS.DE" --kind gtfs --url https://download.gtfs.de/germany/rv_free/latest.zip --country DE --license "Creative Commons 4.0"
python -m app.cli add-source --name "Berlin OSM" --kind osm_pbf --url https://download.geofabrik.de/europe/germany/berlin-latest.osm.pbf --country DE --license ODbL
python -m app.cli run-source 1
python -m app.cli run-match
python -m app.cli prune-cache --dry-run
python -m app.cli prune-cache
```
## HTTP API
Core endpoints:
```text
GET /api/sources
POST /api/sources
POST /api/sources/{source_id}/run
POST /api/sample/reset
POST /api/match/run
GET /api/stats
GET /api/matches
POST /api/matches/{match_id}/accept
POST /api/matches/{match_id}/reject
GET /api/rules
POST /api/rules
```
Map layers:
```text
GET /api/map/osm_routes.geojson
GET /api/map/osm_stops.geojson
GET /api/map/gtfs_routes.geojson
GET /api/map/gtfs_stops.geojson
GET /api/map/matched_gtfs_routes.geojson
GET /api/map/matched_gtfs_routes.geojson?status=missing
```
Map endpoints accept viewport and layer filters:
```text
bbox=min_lon,min_lat,max_lon,max_lat
zoom=13
kind=route,infra,stop,station,terminal
mode=bus,tram,train,subway,light_rail,ferry
geometry=point,line,polygon,nonpoint
source_id=4
dataset_id=5
limit=5000
```
## Source types implemented
### `gtfs`
Expected input: GTFS static zip.
Imported files:
```text
agency.txt
stops.txt
routes.txt
trips.txt
stop_times.txt
shapes.txt, if available
```
The importer stores agencies, stops, routes, trips, limited stop-times, and representative route geometries. Route geometry comes from `shapes.txt` where available; otherwise it falls back to stop sequences from a representative trip.
Multiple GTFS sources can be active at once. Map endpoints and layer controls keep sources separate with `source_id` filters, so VBB, DB long-distance rail, DB/regional rail, and local sample feeds can be rendered independently.
The journey UI routes against the active harmonized transit snapshot instead of exposing a raw GTFS source selector. Feed-level filters remain available for map layers, QA, and source diagnostics.
### `osm_pbf`
Expected input: an OSM `.osm.pbf` extract, for example a Geofabrik regional extract.
The importer records the downloaded/copied file once as an immutable raw dataset with kind `osm_pbf_raw`. For `.osm.pbf` inputs it then runs `scripts/osmium_transport_filter.sh` and stores one transport-only extract as `osm_pbf_transport`. The Python extractor reads that filtered extract, writes `transport.geojson`, and imports it through the `osm_geojson` importer.
The raw and filtered datasets are inactive storage stages; the derived `osm_geojson` dataset is the active visual layer. Re-running an unchanged source reuses the existing raw, filtered, and derived datasets instead of duplicating the extract.
The extractor emits:
```text
route relations as LineString/MultiLineString features built from member ways
rail/tram/subway/ferry/aerialway infrastructure ways
stations, stops, platforms, bus stations, and ferry terminals
```
Route display uses OSM route relation member ways, not stop-to-stop straight-line interpolation.
### `osm_geojson`
Expected input: GeoJSON `FeatureCollection` containing OSM-derived route/station/stop/terminal features.
Minimum useful properties for route features:
```json
{
"osm_type": "relation",
"osm_id": "12345",
"type": "route",
"route": "train",
"ref": "RE1",
"name": "RE1 Example Line",
"operator": "Example Operator",
"network": "Example Network"
}
```
Supported route modes include:
```text
train, light_rail, subway, tram, bus, trolleybus, coach,
ferry, monorail, funicular, aerialway
```
## Matching logic
The current automatic matcher scores each GTFS route against OSM route features using:
```text
mode compatibility
route ref similarity
route name similarity
operator/network similarity
bbox overlap or proximity, used as a major disambiguator for common refs
GTFS/OSM geometry proximity, where both geometries are available
same normalized route key
```
Each match also stores a scope classification:
```text
in_osm_scope
near_osm_scope
outside_osm_scope
unknown_scope
```
Overall coverage and in-scope coverage are intentionally separate. A GTFS route outside the loaded OSM extract should not be interpreted as a failed route match.
Status thresholds:
```text
>= 85 matched
6584 probable
4064 weak
< 40 missing
```
Manual accept/reject actions are stored as `match_rules`. The current prototype records the rule; the next implementation step is applying those rules automatically before/after every matching run.
The route layer treats OSM route geometry as the visual authority when a suitable match exists. Multiple GTFS timetable shapes or trips, including opposite directions, can link to the same OSM-backed `RoutePattern`; each GTFS shape link keeps its own match and direction evidence. When no OSM route matches, the builder creates a `gtfs_proposed` visual pattern from GTFS geometry for review.
## Data flow
```text
source registration
→ local source cache
→ dataset record with hash
→ raw OSM commit, if source is osm_pbf
→ filtered transport extract, if source is osm_pbf and prefiltering is enabled
→ derived transport GeoJSON extraction, if source is osm_pbf
→ normalized GTFS / OSM tables
→ route matching
→ canonical stops and OSM-authoritative route layer
→ manual review rules
→ GeoJSON map layers
→ downstream routing/coverage/tile generation
```
## Current limitations
- PostgreSQL/PostGIS is supported for large local imports; vector tiles are still the next step for country/Europe-scale browsing.
- OSM PBF snapshot extraction is implemented; applying replication `.osc.gz` diffs onto prior raw snapshots is still a next step.
- GTFS-RT, SIRI, NeTEx, TransXChange, OSDM, fares, and booking APIs are not yet implemented.
- The matcher is deliberately transparent rather than sophisticated.
- The frontend requests viewport-bounded GeoJSON by layer; vector tiles are still the next step for country/Europe scale.
## OSM extraction helper
A starter Osmium shell filter script is included:
```bash
scripts/osmium_transport_filter.sh europe-latest.osm.pbf transport.osm.pbf
```
The script calls Osmium through `scripts/host_tool.sh`, which also works from a Flatpak/containerized terminal when `flatpak-spawn --host` is available. The app has a Python Osmium-based `osm_pbf` importer for repeatable prototype runs. For the next stage, add OSM replication diff application, move large-region imports to PostGIS, and serve generalized vector tiles where network editing requires broad viewport rendering.
## Tests
```bash
pytest -q
```