Fixing Missing stop_times.txt Records in Python
Fixing missing stop_times.txt records in Python requires a deterministic validation-and-reconciliation pipeline that cross-references trips.txt and stops.txt, flags referential gaps, and applies spec-compliant patching or reporting rules. The most reliable approach uses vectorized joins to detect orphaned trip_id/stop_sequence combinations, enforces strict GTFS column typing, and outputs either a patched feed or a structured anomaly report. Always validate against the GTFS Static Reference before deploying automated fixes, as silent interpolation can break downstream routing engines.
Why Records Go Missing
Missing records rarely occur randomly. Transit data pipelines typically drop rows due to three structural failures:
- Schedule truncation: Agencies export peak-only trips but leave
trips.txtentries for off-peak or seasonal services. - ETL misalignment: Column shifts, delimiter mismatches, or encoding errors truncate CSV exports during ingestion.
- Referential drift:
stop_idortrip_idvalues instop_times.txtreference deprecated or renamed entities in parent tables.
Before patching, establish a baseline integrity check. Understanding the Mastering stops.txt and stop_times.txt Relationships ensures you don’t accidentally create phantom stops or violate sequence continuity. A robust pipeline must verify required columns exist, detect duplicate (trip_id, stop_sequence) pairs, identify trips present in trips.txt but absent from stop_times.txt, and flag stop_id references that lack parent records.
Validation Pipeline Architecture
A production-grade reconciliation workflow follows a strict order of operations:
- Schema enforcement: Cast IDs to strings and sequences to nullable integers to prevent silent type coercion.
- Referential anti-join: Isolate
trip_idvalues that exist intrips.txtbut have zero matching rows instop_times.txt. - Sequence gap detection: Group by
trip_id, sort bystop_sequence, and identify jumps greater than1. - Foreign key validation: Cross-check every
stop_idagainststops.txtto catch orphaned references. - Diagnostic aggregation: Compile counts, affected IDs, and severity levels into a machine-readable report.
Applying consistent Python Parsing & Data Normalization practices at ingestion prevents downstream join failures and reduces memory overhead when processing multi-gigabyte feeds.
Production-Ready Python Implementation
The following script uses pandas for fast, memory-aware joins and explicit dtype enforcement. It outputs a diagnostic CSV and optionally generates placeholder records for missing sequences.
import pandas as pd
import numpy as np
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
def audit_and_patch_stop_times(feed_dir: str, patch: bool = False) -> pd.DataFrame:
feed_path = Path(feed_dir)
# Load core GTFS tables with explicit dtypes
trips = pd.read_csv(feed_path / "trips.txt", dtype={"trip_id": str, "route_id": str})
stops = pd.read_csv(feed_path / "stops.txt", dtype={"stop_id": str})
stop_times = pd.read_csv(
feed_path / "stop_times.txt",
dtype={"trip_id": str, "stop_id": str, "stop_sequence": "Int64"}
)
# Validate required columns
required_st = {"trip_id", "stop_id", "arrival_time", "departure_time", "stop_sequence"}
missing_cols = required_st - set(stop_times.columns)
if missing_cols:
raise ValueError(f"stop_times.txt missing required columns: {missing_cols}")
# 1. Detect orphaned trips (in trips.txt but missing from stop_times.txt)
trip_coverage = stop_times[["trip_id"]].drop_duplicates()
orphaned = trips.merge(trip_coverage, on="trip_id", how="left", indicator=True)
missing_trip_ids = orphaned.loc[orphaned["_merge"] == "left_only", "trip_id"].tolist()
# 2. Detect invalid stop_id references
valid_stops = set(stops["stop_id"])
invalid_refs = stop_times[~stop_times["stop_id"].isin(valid_stops)]
# 3. Detect sequence gaps per trip
stop_times_sorted = stop_times.sort_values(["trip_id", "stop_sequence"])
gaps = stop_times_sorted.groupby("trip_id")["stop_sequence"].diff().gt(1)
gap_rows = stop_times_sorted[gaps]
# Compile diagnostics
diagnostics = pd.DataFrame({
"issue_type": ["missing_trip_ids", "invalid_stop_refs", "sequence_gaps"],
"count": [len(missing_trip_ids), len(invalid_refs), len(gap_rows)],
"sample_ids": [
missing_trip_ids[:5],
invalid_refs[["trip_id", "stop_id"]].head(5).to_dict("records"),
gap_rows[["trip_id", "stop_sequence"]].head(5).to_dict("records")
]
})
if patch:
logging.info("Generating patched stop_times.txt with placeholder sequences...")
# Safe patching: only fill missing sequences for trips that already have partial data
patched_rows = []
for trip_id, group in stop_times_sorted.groupby("trip_id"):
seqs = group["stop_sequence"].tolist()
if not seqs:
continue
expected = list(range(min(seqs), max(seqs) + 1))
missing = [s for s in expected if s not in seqs]
for m in missing:
patched_rows.append({
"trip_id": trip_id,
"stop_id": "PLACEHOLDER",
"stop_sequence": m,
"arrival_time": "00:00:00",
"departure_time": "00:00:00"
})
if patched_rows:
patch_df = pd.DataFrame(patched_rows)
patch_df.to_csv(feed_path / "stop_times_patched.txt", index=False)
logging.info(f"Appended {len(patch_df)} placeholder rows to stop_times_patched.txt")
diagnostics.to_csv(feed_path / "stop_times_diagnostics.csv", index=False)
logging.info("Diagnostics exported to stop_times_diagnostics.csv")
return diagnostics
# Usage:
# audit_and_patch_stop_times("/path/to/gtfs_feed", patch=True)
Patching Strategy & Spec Compliance
Automated patching should never be the default. GTFS routing engines rely on strict temporal and spatial continuity. When generating placeholder records:
- Never invent coordinates: Use
PLACEHOLDERforstop_idand00:00:00for times to force manual review. - Preserve sequence order: Missing rows must fill exact integer gaps. Never shift existing
stop_sequencevalues. - Flag for QA: Append a
_patchedsuffix to the output filename and require agency sign-off before merging.
Refer to the official GTFS Static Reference for mandatory field constraints and time format rules (HH:MM:SS). If your pipeline requires advanced merge logic, consult the pandas merge documentation to optimize join strategies for large transit datasets.
Deployment Checklist
Missing records degrade schedule accuracy and break isochrone generation. A deterministic Python pipeline that prioritizes detection over silent correction keeps routing engines stable and maintains data lineage for transit operators.