Best Practices for GTFS Agency Metadata

Implementing best practices for GTFS agency metadata starts with treating agency.txt as the immutable anchor of your transit feed. While the specification marks several fields as optional, production pipelines must enforce strict schema compliance, stable identifiers, and standardized locale codes. Proper implementation prevents routing engine failures, ensures accurate fare attribution, and maintains compatibility across feed versioning cycles. For teams building scalable mobility platforms, understanding how agency data propagates through GTFS Feed Architecture & Fundamentals is critical before deploying validation logic.

Core agency.txt Requirements & Validation Rules

The agency.txt file defines the operating entity behind every route, trip, and stop. To avoid downstream parsing errors, enforce these production standards:

  • Stable agency_id: Must be a persistent, non-numeric string (e.g., MTA-NYCT, BART-SF). Reusing IDs, rotating auto-generated UUIDs, or relying on integers breaks historical analytics, breaks real-time subscriptions, and corrupts trip-to-vehicle joins.
  • Mandatory Core Fields: Treat agency_name, agency_url, agency_timezone, and agency_lang as required. Omission causes silent failures in consumer SDKs and accessibility tools.
  • IANA Timezones Only: Use exact identifiers from the IANA Time Zone Database (e.g., America/New_York). Abbreviations like EST or PST are ambiguous, region-dependent, and fail daylight-saving transitions.
  • ISO 639-1 Language Codes: Restrict agency_lang to two-letter lowercase codes (en, es, fr). Avoid BCP-47 or extended tags unless your consumer stack explicitly supports them.
  • HTTPS Enforcement: agency_url and optional agency_fare_url must resolve to secure endpoints. Mixed-content warnings break mobile apps and violate modern transit API security baselines.

Production-Ready Python Validation

Transit automation pipelines should validate metadata before ingestion. The following routine uses pandas for CSV parsing and Pydantic v2 for schema enforcement. It normalizes inputs, rejects malformed records, and logs actionable errors.

python
import pandas as pd
from pydantic import BaseModel, Field, field_validator, ValidationError
from typing import Optional
import zoneinfo
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("gtfs_agency_validator")

class AgencyRecord(BaseModel):
    agency_id: str = Field(..., min_length=2, max_length=50)
    agency_name: str = Field(..., min_length=2)
    agency_url: str = Field(..., pattern=r'^https?://')
    agency_timezone: str
    agency_lang: str = Field(..., min_length=2, max_length=2)
    agency_phone: Optional[str] = None
    agency_fare_url: Optional[str] = None
    agency_email: Optional[str] = None

    @field_validator('agency_timezone')
    @classmethod
    def validate_timezone(cls, v: str) -> str:
        try:
            zoneinfo.ZoneInfo(v)
            return v
        except Exception:
            raise ValueError(f"Invalid IANA timezone: {v}")

    @field_validator('agency_lang')
    @classmethod
    def validate_language(cls, v: str) -> str:
        if not v.isalpha() or not v.islower() or len(v) != 2:
            raise ValueError("agency_lang must be a lowercase ISO 639-1 code (e.g., 'en')")
        return v

def validate_agency_csv(filepath: str) -> list[dict]:
    df = pd.read_csv(filepath, dtype=str, keep_default_na=False)
    valid_records = []
    for idx, row in df.iterrows():
        try:
            record = AgencyRecord(**row.to_dict())
            valid_records.append(record.model_dump())
        except ValidationError as e:
            logger.warning(f"Row {idx} validation failed: {e}")
    
    if not valid_records:
        raise ValueError("No valid agency records found. Feed rejected.")
    return valid_records

Key implementation notes:

  • Uses Pydantic v2 syntax (@field_validator, model_dump()) for current compatibility and faster serialization.
  • dtype=str during CSV read prevents pandas from coercing IDs into floats or stripping leading zeros.
  • Fails fast with explicit warnings, preventing silent data corruption in downstream routing engines.

Handling Multi-Agency Feeds & Mergers

Regional transit hubs often consolidate multiple operators into a single GTFS package. In these cases, agency_id collisions become a critical failure point. When merging feeds:

  1. Prefix IDs with a regional namespace (e.g., SFMTA_MUNI, SFMTA_BART) before concatenation.
  2. Maintain a crosswalk table mapping legacy IDs to canonical identifiers.
  3. Validate that agency_url and agency_fare_url point to operator-specific endpoints, not generic portal pages.

Without namespace isolation, routing engines will misattribute trips, fare calculators will apply incorrect rules, and real-time vehicle positions will detach from scheduled routes.

CI/CD Integration & Automated Guardrails

Manual validation is insufficient for high-frequency feed updates. Embed agency checks directly into your CI/CD pipeline:

  • Pre-commit hooks: Run lightweight schema checks before agency.txt enters version control.
  • Scheduled validation: Trigger full pipeline runs on every feed export using GitHub Actions, GitLab CI, or Airflow DAGs.
  • Threshold enforcement: Block feed publication if agency_id count changes unexpectedly or if timezone/language codes deviate from the approved allowlist.

Automated guardrails catch drift before consumers ingest broken data. For teams managing frequent schedule updates, aligning validation with Agency Metadata and Feed Versioning Practices ensures backward compatibility and clean changelog generation.

Downstream Impact & Versioning Strategy

Validated agency metadata must propagate cleanly to GTFS-Realtime consumers, routing engines, and fare calculators. When agency_id drifts between static and realtime feeds, vehicle positions detach from scheduled trips, causing blank maps and ETA failures. Similarly, mismatched timezones break schedule interpolation during daylight-saving shifts.

To maintain consistency across updates, implement automated diffing and semantic versioning. Track metadata changes alongside route and stop updates, and publish changelogs that explicitly flag agency_id rotations or timezone corrections. Additionally, align your validation thresholds with the official GTFS Specification. The spec evolves, and consumer platforms increasingly reject feeds that omit formerly optional fields. Treat the specification as a living contract, not a minimum viable baseline.

Quick Implementation Checklist