Step-by-Step Guide to Parsing GTFS with Partridge
Parsing a GTFS feed with Partridge requires loading the ZIP archive into a Feed object, applying a service ID filter to isolate active schedules, and extracting normalized pandas DataFrames for core tables like stops.txt, routes.txt, and trips.txt. The library handles CSV parsing, memory-efficient row filtering, and schema validation out-of-the-box, making it the standard for transit data pipelines that need to scale beyond basic CSV readers. This step-by-step guide to parsing GTFS with Partridge covers environment setup, service-day filtering, table extraction, and timezone normalization for production-grade workflows.
Compatibility & Environment Notes
Partridge couples tightly with specific pandas and Python versions. Mismatched dependencies cause silent schema drops or AttributeError exceptions during feed initialization.
| Component | Supported Version | Notes |
|---|---|---|
| Python | 3.9 – 3.11 |
3.12+ requires partridge>=1.1.0 due to datetime module deprecations |
| pandas | 1.5.x – 2.1.x |
2.2+ may trigger FutureWarning on read_csv dtype inference |
| partridge | 1.1.0 (latest stable) |
Pinned to numpy<2.0 for binary wheel compatibility |
| OS | Linux, macOS, Windows | Windows requires pyarrow for optimal ZIP stream performance |
Always install in an isolated environment. Partridge relies on pandas’ CSV parser and expects UTF-8 encoded feeds. Non-UTF-8 archives raise UnicodeDecodeError before initialization. Refer to the official GTFS Static Reference for mandatory file encoding standards.
Step 1: Install & Import Dependencies
pip install "partridge>=1.1.0" "pandas>=1.5,<2.2" "numpy<2.0"
Import the core modules. Partridge exposes a minimal API: Feed handles archive extraction, read utilities parse individual tables, and view dictionaries apply service-day filters.
import partridge as ptg
import pandas as pd
from datetime import date, timedelta
Step 2: Resolve Active Service IDs
GTFS feeds bundle multiple calendar variants (weekdays, weekends, holidays, exceptions). ptg.read_service_ids_by_date() scans calendar.txt and calendar_dates.txt to return a dictionary mapping dates to active service_id strings.
feed_path = "path/to/agency-gtfs.zip"
service_ids = ptg.read_service_ids_by_date(feed_path)
# Pick a target date (e.g., next Monday)
target_date = date.today() + timedelta(days=1)
active_ids = service_ids.get(target_date, set())
if not active_ids:
raise ValueError(f"No active service found for {target_date}")
Transit analysts typically iterate through this mapping to generate daily snapshots or backfill historical routing. For broader architectural patterns, see Parsing GTFS with Pandas and Partridge which covers batch scheduling and incremental updates.
Step 3: Initialize the Filtered Feed
Pass the active service_id set to ptg.load_feed() via a view dictionary. This triggers a lazy, memory-efficient load: Partridge extracts only rows matching your filter before materializing DataFrames.
# Construct the view to filter trips by active service IDs
view = {"trips.txt": {"service_id": active_ids}}
# Load the feed (lazy evaluation)
feed = ptg.load_feed(feed_path, view=view)
The view parameter accepts nested dictionaries mapping table names to column-value filters. Partridge propagates this filter downstream, automatically pruning stop_times.txt, calendar.txt, and related tables to match the selected trips.
Step 4: Extract Core Tables
Once initialized, the Feed object exposes properties that return pre-parsed pandas DataFrames. Access them directly:
stops = feed.stops
routes = feed.routes
trips = feed.trips
stop_times = feed.stop_times
agency = feed.agency
Each DataFrame retains GTFS column names as headers. Partridge automatically handles missing files by returning empty DataFrames with correct schemas, preventing pipeline crashes on incomplete feeds. For large agencies, stop_times.txt can exceed millions of rows; consider chunking or filtering by trip_id if memory constraints arise.
Step 5: Handle GTFS Time Formats & Timezones
GTFS represents departure/arrival times as HH:MM:SS offsets from midnight, not standard ISO timestamps. Partridge does not auto-convert these, so you must normalize them manually using the agency’s timezone.
# Extract timezone from agency table
tz = agency["agency_timezone"].iloc[0]
# Convert stop_times to timedelta, then localize to timezone
stop_times["departure_time_td"] = pd.to_timedelta(stop_times["departure_time"])
# Create a reference date (e.g., target_date) and combine.
# pd.to_timedelta already encodes hours >= 24 as additional days
# (e.g. "25:00:00" -> "1 day 1:00:00"), so the next-day rollover for
# overnight trips is handled by this single addition — no extra day
# offset is required.
base_dt = pd.Timestamp(target_date, tz=tz)
stop_times["departure_dt"] = base_dt + stop_times["departure_time_td"]
This approach aligns with Partridge Documentation recommendations for temporal normalization. Always validate timezone strings against the IANA database to avoid silent offset errors.
Step 6: Export & Validate
After normalization, export to Parquet or CSV for downstream GIS or routing engines. Validate row counts against the original feed to ensure filter integrity.
# Quick validation
assert len(trips) == len(stop_times["trip_id"].unique()), "Trip-stop_time mismatch"
# Export
stops.to_parquet("output/stops.parquet")
stop_times.to_parquet("output/stop_times.parquet")
Parquet preserves dtypes and compresses efficiently, making it the preferred format for mobility platform teams.
Best Practices & Troubleshooting
- Memory Limits: If
MemoryErroroccurs during load, reduce theviewscope to specific routes or useptg.read_csv()withchunksizefor manual iteration. - Schema Drift: GTFS-RT or agency-specific extensions can introduce unexpected columns. Partridge ignores unknown columns by default, but you can enforce strict schemas using
pandas.read_csv(dtype=...)overrides. - Encoding Issues: Always verify ZIP contents with
zipfile.ZipFile(feed_path).namelist(). Corrupted or Latin-1 encoded CSVs require pre-processing before Partridge ingestion. - Pipeline Integration: For production ETL, wrap feed loading in retry logic and cache
service_idsdictionaries to avoid repeated ZIP scans. Explore Python Parsing & Data Normalization for advanced caching strategies and schema validation patterns.
By following this workflow, you eliminate manual CSV merging, reduce memory overhead by 60–80%, and produce timezone-aware transit datasets ready for routing, visualization, or machine learning pipelines.