Metadata Schema Validation and Linting for Geospatial Workflows

Q: Which Python libraries support ISO 19115 XML schema validation?

The xmlschema library provides XSD-based validation with full iter_errors support for collecting all violations rather than just the first. lxml with its etree.XMLSchema class is a lower-level alternative. For JSON and JSON-LD profiles, jsonschema with Draft202012Validator is the standard choice.

Metadata integrity is the foundation of licensing compliance, catalog interoperability, and automated discovery in spatial data infrastructure. Without a systematic quality gate between dataset generation and publication, malformed records propagate through SDI catalogs, break federated harvesting, and introduce attribution gaps that create downstream licensing exposure. This page details a production-ready validation workflow — covering rule-based linting, formal schema conformance checks, and CI/CD reporting patterns — that scales across multi-format geospatial catalogs. It sits within the broader Automated Metadata Generation & Schema Mapping framework, where generation and validation are two halves of the same compliance loop.

Prerequisites

Before implementing a validation pipeline, ensure your environment meets the following baseline requirements:

Python 3.11+ with pip or uv for deterministic dependency resolution
Target schema definitions: ISO 19115-1/2 XML schemas (XSD), FGDC CSDGM DTDs, or DCAT-AP JSON-LD contexts — stored locally in a schemas/ directory, not fetched at runtime
Validation libraries: xmlschema>=3.3, jsonschema>=4.21, lxml>=5.2, pydantic>=2.7 for typed linting, and pytest>=8.2 for regression suites
Base metadata artifacts: Reference templates or previously validated records to establish an organizational profile baseline
CI/CD runner access: GitHub Actions, GitLab CI, or Jenkins for automated gatekeeping on every pull request
Schema familiarity: Working knowledge of ISO 19115 metadata template generation output structure and DCAT-AP spatial profile mapping field conventions

Validation Pipeline Overview

The pipeline separates fast heuristic linting from strict schema conformance. Each stage has its own exit code so CI can surface precise failure reasons:

Concept and Spec Reference

Validation targets vary by the metadata standard governing your catalog. The three dominant standards each impose distinct structural and semantic constraints:

Standard	Encoding	Validator	Key Constraints
ISO 19115-1:2014	XML (ISO 19115-3)	XSD via `xmlschema`	Mandatory: `MD_Metadata`, `identificationInfo`, `language`, `characterSet`, `contact`, `dateStamp`
FGDC CSDGM	XML (DTD)	DTD via `lxml`	Mandatory: `idinfo`, `metainfo`, `distinfo`; date format YYYYMMDD
DCAT-AP v3	JSON-LD / RDF	JSON Schema via `jsonschema`	Mandatory: `dct:title`, `dct:description`, `dct:publisher`, `dcat:distribution`; spatial extent via `locn:geometry`
STAC 1.0	GeoJSON	JSON Schema	Required: `stac_version`, `id`, `geometry`, `bbox`, `links`, `assets`, `properties.datetime`

Licensing metadata fields deserve explicit attention. For ISO 19115, MD_Constraints and its subtypes MD_LegalConstraints and MD_SecurityConstraints carry SPDX identifiers and usage restrictions. For DCAT-AP, dct:license must reference a recognised license URI. In FGDC to ISO 19115 conversion pipelines, these fields are the most frequent source of post-conversion schema failures because FGDC’s accconst and useconst free-text fields do not map cleanly to structured ISO constraint elements.

Implementation Walkthrough

Step 1: Define Validation Profiles and Rule Sets

Geospatial metadata rarely operates under a single monolithic standard. Agencies and open-source projects enforce a custom profile that extends or restricts a base specification. Begin by documenting mandatory, conditional, and optional elements specific to your licensing and publication requirements.

Define rule categories:

Structural: Required elements, cardinality constraints, namespace declarations, encoding validation
Semantic: Controlled vocabulary enforcement, date format validation (ISO 8601), CRS identifier verification (EPSG/OGC URNs)
Licensing: SPDX identifier presence, attribution URI format, usage restriction tags, embargo period validation

Store these rules in a version-controlled YAML or JSON configuration file. This enables profile switching across environments (dev, staging, prod) and allows non-technical data stewards to update business rules without modifying validation code.

# profiles/default.yaml (loaded as a dict)
lint_rules:
  require_license: true
  require_spdx: true
  date_format: "YYYY-MM-DD"
  max_abstract_length: 4000
  crs_registry: ["EPSG", "OGC"]

schema_rules:
  standard: "iso19115-3"
  xsd_path: "schemas/iso19115-3-2016/mdb/1.3/mdb.xsd"
  fail_on_warnings: false

Step 2: Implement Structural Linting

Linting precedes formal schema validation. It catches syntactic irregularities, encoding mismatches, and organisational convention violations before expensive schema parsing begins. A fast linter runs in milliseconds and provides actionable feedback for developers and data curators.

Common linting targets:

XML well-formedness and namespace prefix consistency
JSON syntax validity and trailing comma detection
UTF-8 encoding verification and BOM stripping
Controlled vocabulary checks against cached authority lists
Date and coordinate format regex validation

When pulling metadata from spatial databases — as in automating metadata extraction from PostGIS tables — structural linting catches malformed outputs before they reach the validation engine. Database exports often contain null bytes, truncated strings, or improperly escaped characters that break XML parsers silently.

import json
import re
from pathlib import Path
from lxml import etree
from typing import List, Dict, Any


class MetadataLinter:
    def __init__(self, config: Dict[str, Any]) -> None:
        self.rules = config.get("lint_rules", {})

    def validate_encoding(self, file_path: Path) -> List[str]:
        errors: List[str] = []
        try:
            file_path.read_bytes().decode("utf-8")
        except UnicodeDecodeError as exc:
            errors.append(f"Encoding violation: {exc}")
        return errors

    def lint_json(self, data: str) -> List[str]:
        errors: List[str] = []
        try:
            parsed = json.loads(data)
        except json.JSONDecodeError as exc:
            return [f"JSON syntax error: {exc}"]

        if self.rules.get("require_license") and not parsed.get("license"):
            errors.append("Missing required 'license' field")

        if self.rules.get("require_spdx"):
            license_val = parsed.get("license", "")
            if license_val and not re.match(r"^[A-Za-z0-9\.\-\+]+$", license_val):
                errors.append(f"Non-SPDX license identifier: {license_val!r}")

        date_pattern = re.compile(r"^\d{4}-\d{2}-\d{2}$")
        for date_field in ("date", "dateModified", "datePublished"):
            val = parsed.get(date_field)
            if val and not date_pattern.match(str(val)):
                errors.append(f"Invalid date format in '{date_field}'. Expected YYYY-MM-DD")

        return errors

    def lint_xml(self, file_path: Path) -> List[str]:
        errors: List[str] = []
        try:
            parser = etree.XMLParser(recover=False, resolve_entities=False)
            etree.parse(str(file_path), parser)
        except etree.XMLSyntaxError as exc:
            errors.append(f"XML well-formedness failure: {exc}")
        return errors

Linters should fail fast and return structured error arrays. Do not attempt semantic validation at this stage — reserve that for schema conformance.

Step 3: Execute Formal Schema Conformance Checks

Once structural linting passes, the pipeline proceeds to strict schema validation. This step verifies that the metadata document conforms to the exact structure, data types, and cardinality rules defined by the governing standard.

For XML-based workflows (ISO 19115, FGDC), use xmlschema or lxml with XSD/DTD validation. For JSON/JSON-LD workflows (DCAT-AP, STAC), jsonschema with Draft 2020-12 support provides robust validation including $ref resolution and custom format validators.

Teams generating baseline records via ISO 19115 metadata template generation must verify conformance against authoritative XSD definitions before publication. Template generation frequently introduces namespace collisions or drops conditional elements when fields are left unpopulated.

import json
from pathlib import Path
from typing import Any, Dict, List

import xmlschema
from jsonschema import Draft202012Validator


def validate_against_schema(
    doc: str,
    schema_path: str,
    doc_type: str = "json",
) -> Dict[str, Any]:
    """Collect ALL schema violations — not just the first failure."""
    errors: List[Dict[str, Any]] = []

    if doc_type == "xml":
        schema = xmlschema.XMLSchema(schema_path)
        for err in schema.iter_errors(doc):
            errors.append({
                "path": err.path,
                "message": str(err),
                "severity": "error",
            })

    elif doc_type == "json":
        schema_def = json.loads(Path(schema_path).read_text())
        validator = Draft202012Validator(schema_def)
        for err in validator.iter_errors(json.loads(doc)):
            errors.append({
                "path": list(err.absolute_path),
                "message": err.message,
                "severity": "error",
            })

    return {"valid": len(errors) == 0, "errors": errors}

Always cache external schema references locally. Network timeouts during CI runs cause false negatives and block deployments. Maintain a schemas/ directory under version control and update it via a scheduled maintenance job that opens a pull request for curator approval.

Validation and CI Integration

GitHub Actions Workflow

A complete validation workflow should:

Trigger on push and pull_request targeting metadata directories
Run the linter first; exit immediately on failure (exit code 1)
Run schema validation only if linting passes (exit code 2 on violation)
Generate a JSON report for PR annotations
Block merges if critical errors remain

name: Metadata Validation Pipeline
on:
  pull_request:
    paths:
      - "metadata/**"
      - "schemas/**"
      - ".github/workflows/metadata-validation.yml"

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install "xmlschema>=3.3" "jsonschema>=4.21" "lxml>=5.2" "pydantic>=2.7"

      - name: Structural linting
        run: python -m tools.lint --config profiles/default.yaml --input metadata/

      - name: Schema conformance
        run: python -m tools.validate --schema schemas/iso19115-3.xsd --input metadata/

      - name: Upload validation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: metadata-validation-report
          path: reports/validation.json

For deeper integration with spatial data schema linting in CI, add a pre-commit hook that runs the linter locally before any commit reaches the remote, catching issues before CI queues fill.

Pre-commit Hook Configuration

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: metadata-lint
        name: Geospatial metadata linter
        entry: python -m tools.lint --config profiles/default.yaml --input metadata/
        language: python
        types: [xml, json]
        pass_filenames: false

Configure exit codes consistently: 0 for success, 1 for lint failures, 2 for schema violations. Use PR comment bots to post structured error summaries directly on changed files, reducing reviewer context-switching and accelerating merge cycles. This is the same CI/CD policy enforcement gate pattern used across the broader compliance pipeline.

Derivative and Lineage Management

Spatial transformations introduce silent compliance risks that validation must track. When a dataset is reprojected, clipped, joined, or rasterized, the derivative record inherits licensing obligations from its source but its metadata is frequently not updated to reflect the transformation lineage.

Key validation rules for derivative management:

Reprojection: The referenceSystemInfo element (ISO 19115) or dct:conformsTo (DCAT-AP) must be updated when the CRS changes. Validate that EPSG codes in the output record match the actual CRS of the file using pyproj.CRS.from_user_input() against the GDAL-reported CRS.
Clipping and subsetting: Bounding box coordinates must be recalculated. Validate that EX_GeographicBoundingBox values in XML or bbox in GeoJSON/STAC are within ±180/±90 and tighter than the source extent.
License inheritance: Derivative works under ODbL or CC BY-SA carry share-alike obligations. Validate that the useConstraints or dct:license field in the derivative record matches the required inherited license, not the source record’s license verbatim.
Processing step audit: The LI_ProcessStep sequence in ISO 19115 dataQualityInfo must grow by one entry for each transformation. A regression test that counts processStep elements against a known lineage depth catches silent omissions.

For field-level lineage tracking across format migrations — including records processed through FGDC to ISO 19115 conversion pipelines — validate that no mandatory FGDC elements were silently dropped during crosswalk. Map FGDC lineage/srcinfo to ISO 19115 LI_Source and assert its presence when source citations existed in the original record.

Pitfalls and Resolution Table

Pitfall	Root Cause	Resolution Strategy
Schema validation passes but catalog rejects record	Schema checks structure but not vocabulary; catalog enforces controlled terms	Add semantic linting layer that validates `topicCategory`, `language`, and `characterSet` against ISO 19139 code lists
CI fails intermittently on schema fetch	XSD `import` or `include` directives resolve over the network at validation time	Mirror all referenced XSD files into `schemas/` and patch import paths; use `xmlschema.XMLSchemaResolver` to intercept remote URIs
UTF-8 validation passes but XML parser fails	BOM present in file; linter detects BOM but does not strip it before parsing	Strip BOM in the encoding validation step; open files with `open(f, encoding="utf-8-sig")` before lxml parsing
Valid DCAT-AP records rejected after vocabulary update	SPDX or EU Vocabulary authority lists updated; cached copies outdated	Automate scheduled pulls of SPDX `licenses.json` and EU NAL CSV; run diff check and raise PR for curator approval before merging
`if-then` cardinality rules not enforced	Base XSD does not encode ISO 19115 conditional cardinality; `xmlschema` reports valid	Implement custom Python validators for conditional rules (e.g. `distributionFormat` present → `transferOptions` mandatory) on top of XSD pass
Regression suite breaks after standard upgrade	New schema version flags previously valid constructs as errors	Tag all fixtures with the schema version that produced them; run fixtures only against their tagged version; add new fixtures for current version
False negatives on empty mandatory elements	XSD `minOccurs="1"` is satisfied by an empty element tag	Add linting rule that checks `len(element.text.strip()) > 0` for all mandatory text fields before schema validation
PostGIS export contains null bytes	PostgreSQL text fields with embedded `\x00` are valid in the database but illegal in XML	Run null-byte check in `validate_encoding()` using `b"\x00" in content` before XML parsing

Evolving Validation Rules

Metadata standards evolve. ISO 19115-3 introduces modular profiles, DCAT-AP v3 expands spatial predicates, and licensing frameworks update SPDX license lists. Implement a rule versioning strategy to adapt without breaking existing catalogs:

Tag configuration files with semantic versions (v1.2.0-profile.yaml)
Maintain a golden dataset of known-valid records for regression testing
Run validation suites against historical metadata during standard upgrades
Deprecate rules gradually with warning phases before enforcing strict failures

Automated regression testing uses pytest with parameterised fixtures:

import pytest
from pathlib import Path
from tools.validate import validate_against_schema


@pytest.mark.parametrize(
    "metadata_file",
    sorted(Path("tests/fixtures/valid/").glob("*.xml")),
)
def test_known_valid_records(metadata_file: Path) -> None:
    result = validate_against_schema(
        metadata_file.read_text(),
        "schemas/iso19115-3.xsd",
        "xml",
    )
    assert result["valid"], (
        f"Regression failure in {metadata_file.name}: {result['errors']}"
    )

Schedule quarterly reviews of controlled vocabularies, CRS registries, and license identifiers. Automate vocabulary updates via scripts that fetch the latest EPSG registry or SPDX license list, run diff checks, and open pull requests for curator approval before any changes reach the validation profile.

Frequently Asked Questions

What is the difference between metadata linting and schema validation?

Linting is a fast, heuristic check for structural and syntactic issues — encoding errors, malformed XML, missing required fields — that runs in milliseconds. Schema validation is a strict conformance check against an authoritative XSD or JSON Schema definition, verifying data types, cardinality, and element relationships. The two must run in sequence: linting first, then validation only if linting passes.

Which Python libraries support ISO 19115 XML schema validation?

The xmlschema library provides XSD-based validation with full iter_errors support for collecting all violations rather than just the first. lxml with its etree.XMLSchema class is a lower-level alternative. For JSON and JSON-LD profiles, jsonschema with Draft202012Validator is the standard choice.

How do I prevent CI failures caused by network-fetched schemas?

Cache all external schema references — XSD files, JSON Schema definitions, DCAT-AP contexts — in a version-controlled schemas/ directory. Never allow CI jobs to fetch schemas over the network at runtime. Update the cache via a scheduled maintenance job that opens a pull request for curator review.

Can the same pipeline validate both ISO 19115 XML and DCAT-AP JSON records?

Yes. Dispatch on file extension or a doc_type field in the record manifest. The validate_against_schema() function shown above handles both paths. The linting layer also operates on both formats through separate lint_xml() and lint_json() methods on MetadataLinter.

Automated Metadata Generation & Schema Mapping — parent section covering the full generation-to-validation pipeline
ISO 19115 Metadata Template Generation — generating the records that this validation pipeline checks
FGDC to ISO 19115 Conversion Pipelines — conversion workflows where schema violations most commonly appear
DCAT-AP Spatial Profile Mapping — DCAT-AP field conventions and JSON-LD context requirements
Spatial Data Schema Linting in CI — integrating linting into CI/CD pipelines across all spatial data types
Automating Metadata Extraction from PostGIS Tables — upstream extraction step that feeds records into this validation pipeline

# Metadata Schema Validation and Linting for Geospatial Workflows

# Prerequisites

# Validation Pipeline Overview

# Concept and Spec Reference

# Implementation Walkthrough

# Step 1: Define Validation Profiles and Rule Sets

# Step 2: Implement Structural Linting

# Step 3: Execute Formal Schema Conformance Checks

# Validation and CI Integration

# GitHub Actions Workflow

# Pre-commit Hook Configuration

# Derivative and Lineage Management

# Pitfalls and Resolution Table

# Evolving Validation Rules

# Frequently Asked Questions

# What is the difference between metadata linting and schema validation?

# Which Python libraries support ISO 19115 XML schema validation?

# How do I prevent CI failures caused by network-fetched schemas?

# Can the same pipeline validate both ISO 19115 XML and DCAT-AP JSON records?

# Related