Metadata Schema Validation and Linting for Geospatial Workflows

In modern spatial data infrastructure, metadata integrity is the foundation of licensing compliance, catalog interoperability, and automated discovery. Metadata Schema Validation and Linting operates as the critical quality gate between dataset generation and publication. For GIS data managers, open-source maintainers, and government technology teams, enforcing structural and semantic rules prevents downstream failures in Spatial Data Infrastructure (SDI) deployments, data marketplace ingestion, and automated licensing attribution. This guide details a production-ready validation workflow, integrating rule-based linting, formal schema conformance checks, and CI/CD reporting patterns that scale across multi-format geospatial catalogs.

Prerequisites

Before implementing a validation pipeline, ensure your environment meets the following baseline requirements:

  • Python 3.9+ with pip or uv for deterministic dependency resolution
  • Target schema definitions: ISO 19115-1/2 XML schemas, FGDC CSDGM DTDs, or DCAT-AP JSON-LD contexts
  • Validation libraries: xmlschema, jsonschema, lxml, pydantic (for typed linting), and pytest for regression suites
  • Base metadata artifacts: Reference templates or previously validated records to establish organizational profiles
  • CI/CD runner access: GitHub Actions, GitLab CI, or Jenkins for automated gatekeeping

Organizations typically begin this process after establishing foundational Automated Metadata Generation & Schema Mapping practices, ensuring that generated records align with institutional data governance policies before validation rules are applied. Without a clear mapping strategy, validation becomes a reactive cleanup exercise rather than a proactive quality control mechanism.

Step 1: Define Validation Profiles and Rule Sets

Geospatial metadata rarely operates under a single monolithic standard. Agencies and open-source projects typically enforce a custom profile that extends or restricts base specifications. Begin by documenting mandatory, conditional, and optional elements specific to your licensing and publication requirements.

Map your profile against authoritative references such as ISO 19115-1:2014 Geographic information — Metadata — Part 1: Fundamentals for international compliance, or the FGDC Content Standard for Digital Geospatial Metadata (CSDGM) for U.S. federal workflows. Define rule categories:

  • Structural: Required elements, cardinality constraints, namespace declarations, and encoding validation
  • Semantic: Controlled vocabulary enforcement, date format validation (ISO 8601), CRS identifier verification (EPSG/OGC URNs)
  • Licensing: SPDX identifier presence, attribution URI format, usage restriction tags, and embargo period validation

Store these rules in a version-controlled YAML or JSON configuration file. This enables profile switching across environments (e.g., dev, staging, prod) and allows non-technical stewards to update business rules without modifying validation code.

Step 2: Implement Structural Linting

Linting precedes formal schema validation. It catches syntactic irregularities, encoding mismatches, and organizational conventions before expensive schema parsing begins. A fast linter should run in milliseconds and provide actionable feedback for developers and data curators.

flowchart TD
    classDef ok fill:#d7efef,stroke:#0e7c86,color:#0a5d65;
    classDef bad fill:#fde0dd,stroke:#c0392b,color:#922b21;
    A(["Generated metadata record"]) --> P["Load validation profile"]
    P --> L["Structural linting (fast)"]
    L --> LQ{"Lint passes?"}
    LQ -->|no| E1["Exit 1: lint failure"]
    LQ -->|yes| S["Schema conformance (cached XSD / JSON Schema)"]
    S --> SQ{"Schema valid?"}
    SQ -->|no| E2["Exit 2: schema violations"]
    SQ -->|yes| OK["Exit 0: success"]
    E1 --> R["Structured report + PR annotations"]
    E2 --> R
    class OK ok
    class E1,E2 bad

Common linting targets include:

  • XML well-formedness and namespace prefix consistency
  • JSON syntax validity and trailing comma detection
  • UTF-8 encoding verification and BOM stripping
  • Controlled vocabulary checks against cached authority lists
  • Date and coordinate format regex validation

When pulling metadata directly from spatial databases, such as in Automating metadata extraction from PostGIS tables, structural linting catches malformed outputs before they hit the validation engine. Database exports often contain null bytes, truncated strings, or improperly escaped characters that break XML parsers.

Below is a production-ready Python linting module that handles both JSON and XML inputs with strict error aggregation:

import json
import re
from pathlib import Path
from lxml import etree
from typing import List, Dict, Any

class MetadataLinter:
    def __init__(self, config: Dict[str, Any]):
        self.rules = config.get("lint_rules", {})

    def validate_encoding(self, file_path: Path) -> List[str]:
        errors = []
        try:
            with open(file_path, "rb") as f:
                content = f.read()
            content.decode("utf-8")
        except UnicodeDecodeError as e:
            errors.append(f"Encoding violation: {e}")
        return errors

    def lint_json(self, data: str) -> List[str]:
        errors = []
        try:
            parsed = json.loads(data)
        except json.JSONDecodeError as e:
            return [f"JSON syntax error: {e}"]

        if self.rules.get("require_license"):
            if not parsed.get("license"):
                errors.append("Missing required 'license' field")

        date_pattern = re.compile(r"^\d{4}-\d{2}-\d{2}$")
        if parsed.get("date") and not date_pattern.match(str(parsed["date"])):
            errors.append("Invalid date format. Expected YYYY-MM-DD")

        return errors

    def lint_xml(self, file_path: Path) -> List[str]:
        errors = []
        try:
            parser = etree.XMLParser(recover=False, resolve_entities=False)
            etree.parse(str(file_path), parser)
        except etree.XMLSyntaxError as e:
            errors.append(f"XML well-formedness failure: {e}")
        return errors

Linters should fail fast and return structured error arrays. Avoid attempting semantic validation at this stage; reserve that for schema conformance.

Step 3: Execute Formal Schema Conformance Checks

Once structural linting passes, the pipeline proceeds to strict schema validation. This step verifies that the metadata document conforms to the exact structure, data types, and cardinality rules defined by the governing standard.

For XML-based workflows (ISO 19115, FGDC), use xmlschema or lxml with XSD/DTD validation. For JSON/JSON-LD workflows (DCAT-AP, custom profiles), jsonschema or pydantic provides robust validation with draft-07/draft-2020-12 support. Refer to the official jsonschema documentation for advanced features like $ref resolution and custom format validators.

Teams generating baseline records via ISO 19115 Metadata Template Generation or migrating legacy records through FGDC to ISO 19115 Conversion Pipelines must verify conformance against authoritative XSD/DTD definitions before publication. Conversion processes frequently introduce namespace collisions, drop conditional elements, or misalign hierarchical relationships.

A reliable validation function should:

  1. Load the target schema from a cached local directory (never fetch over the network in production)
  2. Validate the document and collect all errors, not just the first failure
  3. Map schema errors to human-readable guidance
  4. Return a structured report for CI/CD consumption
import xmlschema
from jsonschema import Draft202012Validator

def validate_against_schema(doc: str, schema_path: str, doc_type: str = "json") -> Dict[str, Any]:
    errors = []

    if doc_type == "xml":
        schema = xmlschema.XMLSchema(schema_path)
        # iter_errors yields every validation failure, not just the first
        for e in schema.iter_errors(doc):
            errors.append({
                "path": e.path,
                "message": str(e),
                "severity": "error"
            })
    elif doc_type == "json":
        with open(schema_path, "r") as f:
            schema = json.load(f)
        validator = Draft202012Validator(schema)
        # iter_errors collects all schema violations across the document
        for e in validator.iter_errors(json.loads(doc)):
            errors.append({
                "path": list(e.absolute_path),
                "message": e.message,
                "severity": "error"
            })

    return {"valid": len(errors) == 0, "errors": errors}

Always cache external schema references locally. Network timeouts during CI runs cause false negatives and block deployments. Maintain a schemas/ directory in your repository and update it via a scheduled maintenance job.

Step 4: Integrate CI/CD Gatekeeping and Reporting

Validation must run automatically on every pull request, merge, and scheduled release. CI/CD integration transforms metadata quality from a manual review bottleneck into an automated, auditable process.

A typical GitHub Actions workflow should:

  • Trigger on push and pull_request targeting metadata directories
  • Run the linter first; exit immediately on failure
  • Run schema validation only if linting passes
  • Generate a SARIF or JSON report for PR annotations
  • Block merges if critical errors remain
name: Metadata Validation Pipeline
on:
  pull_request:
    paths: ['metadata/**', 'schemas/**', '.github/workflows/metadata-validation.yml']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install xmlschema jsonschema lxml pydantic
      - name: Run Linter
        run: python -m tools.lint --config profiles/default.yaml --input metadata/
      - name: Run Schema Validation
        run: python -m tools.validate --schema schemas/iso19115-3.xsd --input metadata/
      - name: Upload Validation Report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: metadata-validation-report
          path: reports/validation.json

Configure exit codes strategically: 0 for success, 1 for lint failures, 2 for schema violations. Use PR comment bots to post structured error summaries directly on changed files. This reduces reviewer context-switching and accelerates merge cycles.

Step 5: Maintain and Evolve Validation Rules

Metadata standards evolve. ISO 19115-3 introduces modular profiles, DCAT-AP v3 expands spatial predicates, and licensing frameworks update SPDX license lists. Your validation pipeline must adapt without breaking existing catalogs.

Implement a rule versioning strategy:

  • Tag configuration files with semantic versions (v1.2.0-profile.yaml)
  • Maintain a golden dataset of known-valid records for regression testing
  • Run validation suites against historical metadata during standard upgrades
  • Deprecate rules gradually with warning phases before enforcing strict failures

Automated regression testing ensures that new rules don’t invalidate previously compliant records. Use pytest with parameterized fixtures:

import pytest
from pathlib import Path
from tools.validate import validate_against_schema

@pytest.mark.parametrize("metadata_file", Path("tests/fixtures/valid/").glob("*.xml"))
def test_known_valid_records(metadata_file):
    result = validate_against_schema(metadata_file.read_text(), "schemas/iso19115.xsd", "xml")
    assert result["valid"], f"Regression failure in {metadata_file.name}: {result['errors']}"

Schedule quarterly reviews of controlled vocabularies, CRS registries, and license identifiers. Automate vocabulary updates via scripts that fetch the latest EPSG registry or SPDX license list, run diff checks, and open pull requests for curator approval.

Best Practices and Common Pitfalls

  • Separate linting from validation: Linting is heuristic and fast; validation is strict and schema-bound. Never merge them into a single monolithic check.
  • Cache all external references: Network-dependent validation causes flaky CI and deployment blocks. Mirror XSD, JSON Schema, and vocabulary files locally.
  • Validate encoding early: Geospatial metadata often contains non-ASCII characters in titles, abstracts, or contact information. UTF-8 validation must precede XML/JSON parsing.
  • Use strict JSON Schema modes: Enable additionalProperties: false in DCAT-AP profiles to catch undocumented fields that indicate mapping drift.
  • Log validation context: Include file paths, schema versions, and profile names in error reports. Debugging metadata failures without context wastes engineering hours.
  • Handle conditional cardinality gracefully: Standards like ISO 19115 use complex if-then rules (e.g., if distributionFormat exists, then transferOptions is mandatory). Implement custom validators for these logic branches rather than relying solely on base XSD.

Conclusion

Metadata Schema Validation and Linting transforms geospatial data governance from an afterthought into an automated, auditable pipeline. By separating fast structural linting from strict schema conformance, caching authoritative references, and embedding validation directly into CI/CD workflows, organizations prevent catalog pollution, ensure licensing compliance, and maintain interoperability across SDI ecosystems. Start with a minimal linting profile, integrate schema validation incrementally, and establish regression testing to future-proof your metadata infrastructure. As data volumes grow and standards evolve, automated validation remains the only scalable path to trustworthy spatial data publishing.