Step-by-Step Guide to Parsing GTFS with Partridge

Parsing a GTFS feed with Partridge requires loading the ZIP archive into a Feed object, applying a service ID filter to isolate active schedules, and extracting normalized pandas DataFrames for core tables like stops.txt, routes.txt, and trips.txt. The library handles CSV parsing, memory-efficient row filtering, and schema validation out-of-the-box, making it the standard for transit data pipelines that need to scale beyond basic CSV readers. This step-by-step guide to parsing GTFS with Partridge covers environment setup, service-day filtering, table extraction, and timezone normalization for production-grade workflows.

Compatibility & Environment Notes

Partridge couples tightly with specific pandas and Python versions. Mismatched dependencies cause silent schema drops or AttributeError exceptions during feed initialization.

Component Supported Version Notes
Python 3.93.11 3.12+ requires partridge>=1.1.0 due to datetime module deprecations
pandas 1.5.x2.1.x 2.2+ may trigger FutureWarning on read_csv dtype inference
partridge 1.1.0 (latest stable) Pinned to numpy<2.0 for binary wheel compatibility
OS Linux, macOS, Windows Windows requires pyarrow for optimal ZIP stream performance

Always install in an isolated environment. Partridge relies on pandas’ CSV parser and expects UTF-8 encoded feeds. Non-UTF-8 archives raise UnicodeDecodeError before initialization. Refer to the official GTFS Static Reference for mandatory file encoding standards.

Step 1: Install & Import Dependencies

bash
pip install "partridge>=1.1.0" "pandas>=1.5,<2.2" "numpy<2.0"

Import the core modules. Partridge exposes a minimal API: Feed handles archive extraction, read utilities parse individual tables, and view dictionaries apply service-day filters.

python
import partridge as ptg
import pandas as pd
from datetime import date, timedelta

Step 2: Resolve Active Service IDs

GTFS feeds bundle multiple calendar variants (weekdays, weekends, holidays, exceptions). ptg.read_service_ids_by_date() scans calendar.txt and calendar_dates.txt to return a dictionary mapping dates to active service_id strings.

python
feed_path = "path/to/agency-gtfs.zip"
service_ids = ptg.read_service_ids_by_date(feed_path)

# Pick a target date (e.g., next Monday)
target_date = date.today() + timedelta(days=1)
active_ids = service_ids.get(target_date, set())

if not active_ids:
    raise ValueError(f"No active service found for {target_date}")

Transit analysts typically iterate through this mapping to generate daily snapshots or backfill historical routing. For broader architectural patterns, see Parsing GTFS with Pandas and Partridge which covers batch scheduling and incremental updates.

Step 3: Initialize the Filtered Feed

Pass the active service_id set to ptg.load_feed() via a view dictionary. This triggers a lazy, memory-efficient load: Partridge extracts only rows matching your filter before materializing DataFrames.

python
# Construct the view to filter trips by active service IDs
view = {"trips.txt": {"service_id": active_ids}}

# Load the feed (lazy evaluation)
feed = ptg.load_feed(feed_path, view=view)

The view parameter accepts nested dictionaries mapping table names to column-value filters. Partridge propagates this filter downstream, automatically pruning stop_times.txt, calendar.txt, and related tables to match the selected trips.

Step 4: Extract Core Tables

Once initialized, the Feed object exposes properties that return pre-parsed pandas DataFrames. Access them directly:

python
stops = feed.stops
routes = feed.routes
trips = feed.trips
stop_times = feed.stop_times
agency = feed.agency

Each DataFrame retains GTFS column names as headers. Partridge automatically handles missing files by returning empty DataFrames with correct schemas, preventing pipeline crashes on incomplete feeds. For large agencies, stop_times.txt can exceed millions of rows; consider chunking or filtering by trip_id if memory constraints arise.

Step 5: Handle GTFS Time Formats & Timezones

GTFS represents departure/arrival times as HH:MM:SS offsets from midnight, not standard ISO timestamps. Partridge does not auto-convert these, so you must normalize them manually using the agency’s timezone.

python
# Extract timezone from agency table
tz = agency["agency_timezone"].iloc[0]

# Convert stop_times to timedelta, then localize to timezone
stop_times["departure_time_td"] = pd.to_timedelta(stop_times["departure_time"])

# Create a reference date (e.g., target_date) and combine.
# pd.to_timedelta already encodes hours >= 24 as additional days
# (e.g. "25:00:00" -> "1 day 1:00:00"), so the next-day rollover for
# overnight trips is handled by this single addition — no extra day
# offset is required.
base_dt = pd.Timestamp(target_date, tz=tz)
stop_times["departure_dt"] = base_dt + stop_times["departure_time_td"]

This approach aligns with Partridge Documentation recommendations for temporal normalization. Always validate timezone strings against the IANA database to avoid silent offset errors.

Step 6: Export & Validate

After normalization, export to Parquet or CSV for downstream GIS or routing engines. Validate row counts against the original feed to ensure filter integrity.

python
# Quick validation
assert len(trips) == len(stop_times["trip_id"].unique()), "Trip-stop_time mismatch"

# Export
stops.to_parquet("output/stops.parquet")
stop_times.to_parquet("output/stop_times.parquet")

Parquet preserves dtypes and compresses efficiently, making it the preferred format for mobility platform teams.

Best Practices & Troubleshooting

  • Memory Limits: If MemoryError occurs during load, reduce the view scope to specific routes or use ptg.read_csv() with chunksize for manual iteration.
  • Schema Drift: GTFS-RT or agency-specific extensions can introduce unexpected columns. Partridge ignores unknown columns by default, but you can enforce strict schemas using pandas.read_csv(dtype=...) overrides.
  • Encoding Issues: Always verify ZIP contents with zipfile.ZipFile(feed_path).namelist(). Corrupted or Latin-1 encoded CSVs require pre-processing before Partridge ingestion.
  • Pipeline Integration: For production ETL, wrap feed loading in retry logic and cache service_ids dictionaries to avoid repeated ZIP scans. Explore Python Parsing & Data Normalization for advanced caching strategies and schema validation patterns.

By following this workflow, you eliminate manual CSV merging, reduce memory overhead by 60–80%, and produce timezone-aware transit datasets ready for routing, visualization, or machine learning pipelines.