FGDC to ISO 19115 Conversion Pipelines

Migrating legacy geospatial metadata from the Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) to the international ISO 19115/19139 standard is a foundational requirement for modern data portals, cross-agency interoperability, and automated cataloging. For GIS data managers, open-source maintainers, and Python automation builders, manual translation is unsustainable at scale. Implementing robust FGDC to ISO 19115 Conversion Pipelines ensures consistent schema alignment, preserves data provenance, and enables seamless publishing to enterprise catalogs and open data platforms.

This workflow sits at the core of Automated Metadata Generation & Schema Mapping, where programmatic transformation replaces error-prone manual editing. The following guide details a production-tested pipeline architecture, deterministic mapping logic, and validation routines tailored for government and open-source environments.

Prerequisites & Environment Setup

Before implementing the conversion pipeline, ensure your runtime environment meets baseline technical requirements. A reliable transformation stack depends on strict XML parsing, namespace resolution, and coordinate reference system (CRS) normalization.

  • Python 3.9+ with lxml, pyproj, pyyaml, and xmlschema installed via pip
  • FGDC XML inputs conforming to the FGDC CSDGM v2.0 specification
  • ISO 19115-2/19139 XSD files for structural validation (available from OGC/ISO repositories)
  • Namespace awareness: FGDC uses no default namespace, while ISO 19115 relies on http://www.isotc211.org/2005/gmd and http://www.isotc211.org/2005/gco
  • CRS lookup table mapping legacy FGDC spdom definitions to EPSG codes or WKT strings

Configure your environment with strict XML parsing settings to prevent entity expansion vulnerabilities and ensure deterministic output. Use lxml.etree.XMLParser(resolve_entities=False, recover=False) to maintain schema fidelity during ingestion.

Pipeline Architecture & Step-by-Step Workflow

A reliable conversion pipeline follows a deterministic, stateless ETL pattern. Each stage isolates transformation logic, enabling unit testing, incremental debugging, and horizontal scaling for batch processing.

flowchart TD
    classDef ok fill:#d7efef,stroke:#0e7c86,color:#0a5d65;
    classDef bad fill:#fde0dd,stroke:#c0392b,color:#922b21;
    A(["FGDC CSDGM XML"]) --> B["Ingest & normalize (UTF-8, BOM)"]
    B --> WF{"Well-formed?"}
    WF -->|no| REJ["Reject & log"]
    WF -->|yes| C["Extract & flatten via XPath"]
    C --> D["Map & transform (crosswalk registry)"]
    D --> E["Enrich & default ISO fields"]
    E --> CRS{"CRS present?"}
    CRS -->|no| FB["Fallback to EPSG:4326 + warn"]
    CRS -->|yes| F["Serialize ISO 19139"]
    FB --> F
    F --> V{"Valid against XSD?"}
    V -->|no| LOG["Log errors for review"]
    V -->|yes| OUT["Export & route to staging"]
    class REJ,LOG bad
    class OUT ok

1. Ingest & Normalize

Load raw FGDC XML files, strip processing instructions, and normalize character encoding to UTF-8. Handle legacy BOM markers and non-standard line endings that frequently appear in archived metadata exports. Validate well-formedness before proceeding to prevent downstream XPath failures.

2. Extract & Flatten

Traverse the FGDC tree using targeted XPath expressions. Extract core elements (idinfo, dataqual, spdoinfo, metainfo, distinfo) and flatten them into an intermediate dictionary or JSON structure. This decoupling step is critical: it allows you to apply business logic without wrestling with nested XML namespaces during the mapping phase.

3. Map & Transform

Apply a deterministic mapping registry to convert FGDC nodes to ISO 19115 equivalents. The mapping must explicitly handle cardinality mismatches:

  • One-to-Many: A single FGDC origin node often expands into multiple ISO CI_ResponsibleParty entries (e.g., separating individualName, organisationName, and role).
  • Many-to-One: FGDC themekey and placekey arrays consolidate into ISO MD_Keywords with distinct thesaurusName URIs.
  • Type Coercion: Convert FGDC pubdate (YYYYMMDD or YYYY) to ISO 8601 CI_Date formats.

Use a declarative mapping dictionary rather than imperative if/else chains. This improves maintainability and enables schema versioning.

# Example mapping registry pattern
FGDC_TO_ISO_MAP = {
    "idinfo.citation.citeinfo.title": "identificationInfo.MD_DataIdentification.citation.CI_Citation.title",
    "idinfo.citation.citeinfo.pubdate": "identificationInfo.MD_DataIdentification.citation.CI_Citation.date.CI_Date.date",
    "dataqual.lineage.procstep": "dataQualityInfo.DQ_DataQuality.lineage.LI_Lineage.processStep.LI_ProcessStep.description"
}

4. Enrich & Default Population

ISO 19115 mandates several fields that FGDC either omits or treats as optional. Populate mandatory ISO elements with deterministic defaults when source data is absent:

  • MD_Metadata/fileIdentifier: Generate a UUID v4 or derive from dataset hash
  • language & characterSet: Default to eng and utf8
  • hierarchyLevel: Set to dataset unless FGDC spdom indicates series or feature
  • contact: Map FGDC ptcontac to MD_Metadata/contact/CI_ResponsibleParty with role pointOfContact

Missing CRS definitions should trigger a fallback to WGS 84 (EPSG:4326) with an explicit warning logged for manual review.

5. Serialize & Validate

Generate ISO-compliant XML using lxml.builder or xmlschema serialization. Validate the output against the official 19139 XSD before committing to storage. Capture structural warnings, missing mandatory nodes, and type mismatches in a structured log (JSON or CSV). For detailed schema conformance checks, see Validating FGDC metadata against XML schemas.

6. Export & Route

Write validated XML to a staging directory with atomic file operations (write to .tmp, rename on success). Trigger downstream routing via webhooks or message queues. If your architecture requires spatial profile alignment for open data portals, integrate with DCAT-AP Spatial Profile Mapping to ensure cross-walk compatibility with European and federal open data frameworks.

Handling Schema Divergence & Edge Cases

Legacy FGDC metadata frequently contains structural anomalies that break naive converters. Address these proactively:

  • Bounding Box vs. Polygon: FGDC westbc, eastbc, northbc, southbc must be converted to ISO EX_GeographicBoundingBox. If FGDC provides spdom/descgeog with complex polygons, extract coordinates and serialize as EX_GeographicExtent with gml:Polygon.
  • Temporal Ambiguity: FGDC dates often lack precision (YYYY or YYYY-MM). ISO 19115 requires full CI_Date objects. Implement a parser that defaults to YYYY-01-01 for year-only inputs and flags low-confidence dates.
  • Contact Role Mapping: FGDC cntinfo lacks explicit ISO CI_RoleCode values. Map ptcontacpointOfContact, originatororiginator, and distribdistributor. Unmapped roles should default to userDefined with a free-text note.
  • Cross-Reference Links: FGDC crossref elements map to ISO MD_AssociatedResource with associationType set to crossReference. Preserve original titles and URLs to maintain citation chains.

Validation, Linting & Continuous Integration

Production pipelines require automated quality gates. Integrate xmlschema validation into your CI/CD workflow to block non-compliant outputs. Implement a lightweight linter that checks:

  • Mandatory ISO 19115-1/2 elements
  • Controlled vocabulary compliance (e.g., MD_CharacterSetCode, MD_ScopeCode)
  • Coordinate system validity via pyproj.CRS.is_valid()
  • Date format conformance to ISO 8601

Pair validation with a templating engine to standardize boilerplate across datasets. When building reusable scaffolds for new projects, reference ISO 19115 Metadata Template Generation to enforce consistent header structures, contact blocks, and licensing statements.

Performance & Scaling Considerations

For agency-scale deployments processing thousands of FGDC records, optimize the pipeline for throughput and memory efficiency:

  • Streaming Parsing: Use lxml.etree.iterparse() to avoid loading multi-megabyte XML files entirely into memory.
  • Connection Pooling: If routing to remote catalogs (GeoNetwork, CKAN, ArcGIS Enterprise), batch API calls and implement exponential backoff.
  • Idempotency: Design the pipeline to be re-runnable without duplicating records. Hash input FGDC files and skip processing if the target ISO output already matches the expected checksum.
  • Parallel Execution: Use concurrent.futures.ProcessPoolExecutor for CPU-bound mapping stages, keeping I/O operations in a separate thread pool.

Conclusion & Next Steps

Implementing a structured FGDC to ISO 19115 Conversion Pipelines eliminates manual translation bottlenecks, enforces schema compliance, and future-proofs geospatial assets for modern cataloging ecosystems. By isolating ingestion, mapping, enrichment, and validation into discrete, testable stages, teams can scale metadata operations across legacy archives and real-time data feeds.

Next steps include integrating automated linting into your CI/CD pipeline, establishing cross-walks for downstream spatial profiles, and deploying monitoring dashboards to track conversion success rates and schema drift over time.