Why do WMS endpoints return 400 when validated with a bare URL?

OGC services expect mandatory query parameters such as SERVICE=WMS&REQUEST=GetCapabilities. A bare base URL is not a valid OGC request, so the server returns 400. Append the GetCapabilities parameters before validating, or validate against the capabilities document URL instead.

How should tokenised or authenticated metadata URLs be handled?

Short-lived tokens embedded in metadata should be excluded from automated validation via an allowlist pattern. Hardcoded API keys in metadata files should be flagged as security violations during schema linting before link checking runs.

What status codes indicate a broken reference versus a transient failure?

Status codes 404, 410, 400, and 403 are treated as broken references. Codes 429, 500, 502, 503, and 504 are transient and should trigger exponential backoff retries before being marked broken.

Automated Broken Link and Reference Detection for Geospatial Metadata

Geospatial data catalogs and spatial repositories embed dozens of external references per record: WMS/WFS service endpoints, licensing pages, coordinate reference system registry URIs, data dictionary anchors, and provenance URLs. When any of these silently go stale — a 404 from a licensing portal, a WMS endpoint that migrated to a new host, a CRS registry URL that changed structure — downstream consumers lose the ability to verify provenance, resolve attribution obligations, or re-project data correctly. Detecting these failures after publication is expensive; catching them before merge is cheap. This page documents a production-ready, CI-integrated workflow for extracting, normalising, and validating every reference embedded in spatial metadata files before they reach a production branch. It sits within the broader CI/CD Validation & Policy Enforcement for Spatial Data framework, where automated reference validation is one enforcement gate among several.

Prerequisites

Python 3.9+ — requests>=2.31.0, urllib3>=2.0.0, lxml>=4.9.0, pyyaml>=6.0.1; install with pip install "requests>=2.31" "urllib3>=2.0" "lxml>=4.9" "pyyaml>=6.0"
CI/CD runner with network egress — GitHub Actions ubuntu-latest, GitLab CI, or Azure Pipelines; the runner must be able to reach external hosts (WMS servers, license registries, CRS registries)
Repository access to directories containing metadata files: ISO 19115/19139 XML, GeoJSON properties, STAC links[] arrays, FGDC CSDGM XML, YAML/JSON catalog manifests
Schema familiarity — basic knowledge of ISO 19115 MD_Distribution.transferOptions and STAC links[].href fields where URLs are typically embedded
Optional — a local HTTP caching proxy (e.g. hoverfly, mitmproxy) to reduce external request volume during iterative development and to avoid triggering rate limits on public spatial portals

Pairing reference checks with spatial data schema linting in CI ensures that structurally invalid XML or malformed JSON schemas do not cause parser failures before URLs are even extracted — run schema linting first in your pipeline.

Concept & Spec Reference

Metadata standards embed references in different locations and with different semantics. Understanding where URLs live in each format is essential for writing format-aware extractors.

Format	Reference field(s)	Typical reference type
ISO 19115/19139 XML	`MD_Distribution/transferOptions/MD_DigitalTransferOptions/onLine/CI_OnlineResource/linkage/URL`	Service endpoints, download URLs
ISO 19115/19139 XML	`MD_DataIdentification/resourceConstraints/MD_LegalConstraints/otherConstraints/CharacterString`	Licensing page URLs
GeoJSON (FeatureCollection)	`properties.*` (any string value matching `https?://`)	Provenance, attribution, license
STAC Item / Collection	`links[].href` where `rel` is `license`, `derived_from`, `canonical`, `alternate`	License, source, canonical
FGDC CSDGM	`idinfo/citation/citeinfo/onlink`, `distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/networkr`	Download, citation URLs
YAML/JSON catalog	Any string value matching `https?://`	Service, catalog, schema URLs

Status code semantics for enforcement:

HTTP status	Classification	Default action
2xx	Valid	Pass
301 / 302	Redirect (resolved)	Pass if terminal target is 2xx
400	Invalid request (often OGC base URL)	Inspect — may need OGC params
403	Access-controlled	Warn — not necessarily broken
404	Not found	Broken
410	Gone (explicit removal)	Broken — flag for immediate remediation
429	Rate limited	Retry with backoff; do not count as broken on first attempt
5xx	Server error	Retry; mark broken after max retries exhausted
Connection error	Unreachable host	Broken

Implementation Walkthrough

Step 1 — Build a resilient HTTP session

A single shared session with retry logic and a descriptive User-Agent header reduces connection overhead and prevents the validator from being blocked by rate-limiting middleware.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def build_session(max_retries: int = 2, backoff_factor: float = 0.5) -> requests.Session:
    """Return a requests.Session configured with retry-on-transient-error logic."""
    session = requests.Session()
    retry = Retry(
        total=max_retries,
        backoff_factor=backoff_factor,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET"],
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    # Identify the validator so service operators can contact you if needed
    session.headers.update({"User-Agent": "GeoRefValidator/1.0 (+https://geospatialcompliance.org)"})
    return session

Step 2 — Format-aware URL extraction

Generic regex scanning misses structured fields and produces noise from embedded JSON strings or free-text descriptions. Use format-aware parsers for each metadata type, falling back to regex for unstructured text fields.

import re
import json
from pathlib import Path
from lxml import etree
import yaml

ISO19139_NS = {
    "gmd": "http://www.isotc211.org/2005/gmd",
    "gco": "http://www.isotc211.org/2005/gco",
}

# Covers http and https absolute URLs; does not match relative paths
_URL_RE = re.compile(r'https?://[^\s\'"<>]+')


def extract_iso19139(path: Path) -> list[str]:
    """Extract linkage URLs from an ISO 19115/19139 XML file."""
    try:
        tree = etree.parse(str(path))
    except etree.XMLSyntaxError:
        return []
    urls = tree.xpath(
        "//gmd:CI_OnlineResource/gmd:linkage/gmd:URL/text() | "
        "//gco:CharacterString[contains(text(),'http')]/text()",
        namespaces=ISO19139_NS,
    )
    # Flatten and filter to absolute URLs only
    return [u.strip() for u in urls if _URL_RE.match(u.strip())]


def extract_stac(path: Path) -> list[str]:
    """Extract href values from STAC links[] array."""
    try:
        doc = json.loads(path.read_text(encoding="utf-8"))
    except (json.JSONDecodeError, UnicodeDecodeError):
        return []
    return [
        link["href"]
        for link in doc.get("links", [])
        if isinstance(link.get("href"), str) and _URL_RE.match(link["href"])
    ]


def extract_generic(path: Path) -> list[str]:
    """Regex scan for any file type not covered by a structured extractor."""
    try:
        text = path.read_text(encoding="utf-8", errors="ignore")
    except OSError:
        return []
    return list(set(_URL_RE.findall(text)))


EXTRACTORS = {
    ".xml": extract_iso19139,
    ".geojson": extract_generic,
    ".json": extract_stac,
    ".yml": extract_generic,
    ".yaml": extract_generic,
}


def extract_urls(path: Path) -> list[str]:
    extractor = EXTRACTORS.get(path.suffix.lower(), extract_generic)
    return extractor(path)

Step 3 — Normalise and deduplicate

Before any network calls, strip session-specific query parameters, resolve relative paths to absolute ones, and maintain a process-level set of already-validated URLs. This step is critical for large catalogs where the same WMS endpoint may appear in hundreds of records.

import urllib.parse
from typing import Set

# Parameters that vary between users/sessions but do not affect resource identity
_NOISE_PARAMS = {"utm_source", "utm_medium", "utm_campaign", "session_id", "token", "apikey"}


def normalise_url(url: str) -> str:
    """Strip noise query params and lowercase the scheme+host."""
    try:
        parsed = urllib.parse.urlparse(url)
        # Rebuild query string without noise parameters
        kept = [
            (k, v)
            for k, v in urllib.parse.parse_qsl(parsed.query)
            if k.lower() not in _NOISE_PARAMS
        ]
        clean_query = urllib.parse.urlencode(kept)
        normalised = parsed._replace(
            scheme=parsed.scheme.lower(),
            netloc=parsed.netloc.lower(),
            query=clean_query,
            fragment="",  # fragments are client-side only; irrelevant for HTTP validation
        )
        return urllib.parse.urlunparse(normalised)
    except Exception:
        return url

Step 4 — Validate with HEAD → GET fallback

Many legacy OGC servers reject HEAD with 405 Method Not Allowed. The validator must fall back to GET without treating the 405 itself as a broken reference.

from dataclasses import dataclass, field
from typing import Optional


@dataclass
class RefResult:
    url: str
    source_file: str
    status_code: Optional[int] = None
    is_broken: bool = False
    error: Optional[str] = None


def validate_url(url: str, source: str, session: requests.Session, timeout: int = 10) -> RefResult:
    """Validate a single URL; HEAD first, GET as fallback for HEAD-hostile servers."""
    try:
        resp = session.head(url, timeout=timeout, allow_redirects=True)
        if resp.status_code in (405, 501):
            # Server rejected HEAD — fall back to GET with stream=True to avoid
            # downloading large bodies (e.g. a 500 MB GeoTIFF served over HTTP)
            resp = session.get(url, timeout=timeout, allow_redirects=True, stream=True)
            resp.close()

        broken = resp.status_code >= 400 and resp.status_code not in (401, 403)
        return RefResult(url=url, source_file=source, status_code=resp.status_code, is_broken=broken)
    except requests.exceptions.ConnectionError as exc:
        return RefResult(url=url, source_file=source, is_broken=True, error=f"ConnectionError: {exc}")
    except requests.exceptions.Timeout:
        return RefResult(url=url, source_file=source, is_broken=True, error="Timeout")
    except requests.exceptions.RequestException as exc:
        return RefResult(url=url, source_file=source, is_broken=True, error=str(exc))

Step 5 — Scan a directory and produce a report

import json
import logging
from pathlib import Path

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")


def scan_and_report(
    target_dir: str,
    report_path: str = "link_report.json",
    timeout: int = 10,
    max_retries: int = 2,
) -> dict:
    session = build_session(max_retries=max_retries)
    seen: Set[str] = set()
    results: list[RefResult] = []

    for file_path in Path(target_dir).rglob("*"):
        if file_path.suffix.lower() not in EXTRACTORS:
            continue
        for raw_url in extract_urls(file_path):
            url = normalise_url(raw_url)
            if url in seen:
                continue
            seen.add(url)
            result = validate_url(url, str(file_path), session, timeout)
            results.append(result)
            if result.is_broken:
                logging.warning("BROKEN [%s] %s in %s", result.status_code or result.error, url, file_path.name)

    broken = [r for r in results if r.is_broken]
    report = {
        "total_checked": len(seen),
        "broken_count": len(broken),
        "broken": [r.__dict__ for r in broken],
    }
    Path(report_path).write_text(json.dumps(report, indent=2), encoding="utf-8")
    logging.info("Report written to %s", report_path)
    return report


if __name__ == "__main__":
    import argparse
    import sys

    parser = argparse.ArgumentParser(description="Validate URLs in geospatial metadata files")
    parser.add_argument("--dir", required=True, help="Root directory to scan")
    parser.add_argument("--report", default="link_report.json", help="Output JSON report path")
    parser.add_argument("--threshold", type=int, default=0, help="Max broken refs before exit(1)")
    args = parser.parse_args()

    report = scan_and_report(args.dir, args.report)
    if report["broken_count"] > args.threshold:
        print(f"FAIL: {report['broken_count']} broken reference(s) exceed threshold {args.threshold}")
        sys.exit(1)
    print(f"PASS: {report['total_checked']} references checked, {report['broken_count']} broken")

Validation & CI Integration

GitHub Actions workflow

The following configuration runs the validator only when metadata or spatial data files change, uploads the report as a build artifact, and fails the build when any reference is broken. Adjust --threshold to a non-zero value during a grace period when remediating a backlog of legacy broken references.

name: Geospatial Reference Validation

on:
  pull_request:
    paths:
      - "metadata/**"
      - "catalogs/**"
      - "data/**/*.geojson"
      - "data/**/*.json"

jobs:
  link-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install "requests>=2.31" "urllib3>=2.0" "lxml>=4.9" "pyyaml>=6.0"

      - name: Run reference validator
        run: |
          python link_validator.py \
            --dir ./metadata \
            --threshold 0 \
            --report link_report.json

      - name: Upload validation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: broken-links-report
          path: link_report.json

Pre-commit hook

For teams who want reference validation at commit time rather than PR time, add this entry to .pre-commit-config.yaml:

- repo: local
  hooks:
    - id: geo-ref-check
      name: Geospatial reference validation
      language: python
      entry: python link_validator.py --dir ./metadata --threshold 0
      additional_dependencies:
        - "requests>=2.31"
        - "lxml>=4.9"
        - "pyyaml>=6.0"
      pass_filenames: false
      always_run: true

Merge criteria and failure thresholds are governed by your policy enforcement gates for data PRs, which define the organisational rules for blocking versus warning on different categories of broken reference.

Derivative & Lineage Management

When spatial datasets are transformed — reprojected, clipped, aggregated, or converted between formats — the metadata records describing them should be updated to reflect the new distribution endpoint or service URL. Broken references in derivative records are a common but under-diagnosed lineage problem.

Reprojection: If a WFS endpoint serves data in EPSG:4326 and a derivative raster is published at a new endpoint in EPSG:3857, the MD_Distribution.transferOptions block in the ISO 19115 record must point to the new endpoint. Stale endpoint URLs from the source record that are copied forward into derivative records will fail validation. Automate the endpoint update as part of the reprojection pipeline rather than relying on manual record updates.

Clipped or subsetted datasets: Clip operations typically produce derivative records that inherit the parent’s license URL. Validate that the license URL still resolves and still applies to the spatial extent of the derivative. A license that restricts redistribution to a specific country boundary may point to a terms page that has since moved.

Format conversion: Converting from Shapefile to GeoPackage changes the canonical download URL. Broken download links in STAC links[rel=enclosure] fields are a direct consequence of format conversion without record synchronisation. Add a post-conversion step that updates all href values and immediately validates the new endpoints.

Track all reference updates in your metadata artifact retention strategies alongside the lineage events that triggered them, so that auditors can trace exactly when an endpoint changed and which pipeline version introduced the update.

Pitfalls & Resolution Table

Pitfall	Root Cause	Resolution Strategy
WMS base URL returns 400	OGC services require `SERVICE=WMS&REQUEST=GetCapabilities` query params; a bare URL is not a valid OGC request	Append `?SERVICE=WMS&REQUEST=GetCapabilities` to endpoint URLs before validation, or validate the full capabilities document URL
`HEAD` returns 405 but resource is valid	Legacy OGC and GeoServer implementations disable `HEAD` by default	Implement `HEAD → GET` fallback in validator; do not count 405 as broken
False-positive broken on short-lived token URLs	Embedded API tokens expire before CI runs; the URL resolves for an authenticated user but not the CI runner	Allowlist token-bearing URLs via hostname pattern; flag the practice of embedding tokens in metadata as a schema linting violation
Public spatial portal returns 429	Large catalogs send too many requests in parallel to the same host	Rate-limit requests per host using a semaphore or `asyncio` throttle; add `time.sleep(0.2)` between sequential requests to the same netloc
Redirect chain leads to new permanent location	Data portal migrated content; old URL issues 301 but validation counts the original as broken	Follow redirects (enabled by default with `allow_redirects=True`); log final URL alongside original so record owners can update the source
ISO 19115 records with percent-encoded URLs fail normalisation	`urllib.parse.urlparse` handles percent-encoding but naive string comparison may treat `%2F` and `/` as different URLs	Normalise URLs using `urllib.parse.unquote_plus` before deduplication; store the decoded form in the seen-set
GeoJSON properties contain non-URL strings matching the regex	Attribute values like `http://` inside free-text descriptions match the URL pattern	Add a minimum-length filter (e.g. `len(url) > 15`) and exclude patterns that lack a valid TLD or path segment after the host
STAC `links[rel=self]` points to internal staging hostname	Self-links in STAC records sometimes reference a staging or localhost URL used during generation	Exclude `rel=self` from external validation; validate `rel=license`, `rel=derived_from`, `rel=enclosure`, and `rel=canonical` only

CI/CD Validation & Policy Enforcement for Spatial Data — parent section; architectural context for all validation gates in a spatial data pipeline
Spatial Data Schema Linting in CI — run schema linting before reference validation to prevent parser failures on malformed records
Policy Enforcement Gates for Data PRs — define merge criteria and blocking thresholds that govern how broken-reference failures affect PR outcomes
Metadata Artifact Retention Strategies — track reference update history alongside lineage events for auditable provenance chains

# Automated Broken Link and Reference Detection for Geospatial Metadata

# Prerequisites

# Concept & Spec Reference

# Implementation Walkthrough

# Step 1 — Build a resilient HTTP session

# Step 2 — Format-aware URL extraction

# Step 3 — Normalise and deduplicate

# Step 4 — Validate with HEAD → GET fallback

# Step 5 — Scan a directory and produce a report

# Validation & CI Integration

# GitHub Actions workflow

# Pre-commit hook

# Derivative & Lineage Management

# Pitfalls & Resolution Table

# Related