Automated Broken Link and Reference Detection for Geospatial Metadata
Geospatial data catalogs and spatial repositories rely heavily on external references to maintain provenance, licensing compliance, and service interoperability. When a Web Map Service (WMS) endpoint shifts, a data dictionary URL changes, or a licensing page returns a 404, the downstream impact on data consumers and automated pipelines can be severe. Implementing Automated Broken Link and Reference Detection transforms this maintenance burden from a reactive audit into a proactive, version-controlled process. Within the broader framework of CI/CD Validation & Policy Enforcement for Spatial Data, link validation serves as a critical quality gate that ensures metadata artifacts remain actionable before they reach production environments.
This guide provides a production-tested workflow for scanning, validating, and reporting broken references across common geospatial formats. The approach is engineered for GIS data managers, open-source maintainers, and government technical teams who require reproducible, policy-driven validation without manual intervention.
Prerequisites & Environment Baseline
Before deploying the validation pipeline, ensure your environment meets the following baseline requirements:
- Python 3.9+ with
piporuvfor deterministic dependency resolution - Core libraries:
requests,urllib3,lxml,pyyaml,json,re - CI/CD runner with network egress capabilities (GitHub Actions, GitLab CI, or Azure Pipelines)
- Repository access to target directories containing metadata files (GeoJSON, ISO 19115 XML, YAML/JSON catalogs, FGDC records)
- Optional but recommended: A local HTTP proxy or caching layer to reduce external request volume during development and testing
Link validation should not operate in isolation. Pairing reference checks with structural validation ensures that malformed XML or invalid JSON schemas don’t cause parser failures before URLs are even extracted. For teams building comprehensive metadata pipelines, integrating Spatial Data Schema Linting in CI alongside reference scanning creates a robust, multi-layered quality assurance strategy.
Core Workflow Architecture
The detection pipeline follows a deterministic, four-stage execution model designed for reproducibility and scalability:
- Discovery & Parsing: Traverse the repository tree, identify metadata and spatial data files, and extract all embedded URLs, URNs, and cross-references using format-aware parsers.
- Normalization & Deduplication: Resolve relative paths against a base directory, strip non-routing query parameters (e.g.,
utm_source,session_id), and cache previously validated endpoints to minimize redundant network calls across large catalogs. - Validation Execution: Issue lightweight
HEADrequests first, falling back toGETfor services that rejectHEAD. Apply configurable timeouts, exponential backoff retry logic, and strict status code evaluation. - Reporting & Gate Enforcement: Generate structured output (JSON/Markdown), annotate pull requests with actionable findings, and trigger pass/fail conditions aligned with organizational policy thresholds.
sequenceDiagram
participant V as Link validator
participant C as Dedup cache
participant S as Remote server
V->>C: URL already seen?
C-->>V: yes, skip
V->>S: HEAD request
alt HEAD rejected (405 / 501)
V->>S: GET request (fallback)
end
S-->>V: status code
alt status >= 400 or error
V->>V: mark broken
else status < 400
V->>V: mark valid
end
V->>V: aggregate report, enforce gateImplementation: Python Validation Engine
The following script demonstrates a robust, extensible pattern for extracting and validating references. It handles both structured metadata (ISO XML, YAML) and semi-structured spatial formats (GeoJSON, FGDC) while respecting HTTP best practices and geospatial service quirks.
import os
import re
import json
import logging
import urllib.parse
from pathlib import Path
from typing import Dict, List, Set, Tuple
from dataclasses import dataclass, field
import requests
import yaml
from lxml import etree
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
@dataclass
class ValidationResult:
url: str
status_code: int | None = None
is_broken: bool = False
error_message: str | None = None
source_file: str = ""
URL_PATTERN = re.compile(
r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+(?::\d+)?(?:/[-\w./?%&=]*)?'
)
class LinkValidator:
def __init__(self, timeout: int = 10, max_retries: int = 2):
self.timeout = timeout
self.session = self._build_session(max_retries)
self.results: List[ValidationResult] = []
self.seen_urls: Set[str] = set()
def _build_session(self, retries: int) -> requests.Session:
session = requests.Session()
retry_strategy = Retry(
total=retries,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("https://", adapter)
session.mount("http://", adapter)
session.headers.update({"User-Agent": "GeoLinkValidator/1.0"})
return session
def extract_urls(self, file_path: Path) -> List[str]:
text = file_path.read_text(encoding="utf-8", errors="ignore")
# URL_PATTERN matches absolute http(s) links only; de-duplicate before checking
return list(set(URL_PATTERN.findall(text)))
def validate_url(self, url: str, source: str) -> ValidationResult:
if url in self.seen_urls:
return ValidationResult(url=url, source_file=source)
self.seen_urls.add(url)
try:
# Prefer HEAD for efficiency, fallback to GET
resp = self.session.head(url, timeout=self.timeout, allow_redirects=True)
if resp.status_code in (405, 501):
resp = self.session.get(url, timeout=self.timeout, allow_redirects=True)
is_broken = resp.status_code >= 400
return ValidationResult(
url=url,
status_code=resp.status_code,
is_broken=is_broken,
source_file=source
)
except requests.exceptions.RequestException as e:
return ValidationResult(
url=url,
is_broken=True,
error_message=str(e),
source_file=source
)
def scan_directory(self, target_dir: str) -> List[ValidationResult]:
target = Path(target_dir)
for file_path in target.rglob("*"):
if file_path.suffix.lower() in {".xml", ".yml", ".yaml", ".json", ".geojson"}:
urls = self.extract_urls(file_path)
for url in urls:
self.results.append(self.validate_url(url, str(file_path)))
return self.results
def generate_report(self, output_path: str = "link_report.json"):
broken = [r for r in self.results if r.is_broken]
report = {
"total_checked": len(self.seen_urls),
"broken_count": len(broken),
"results": [r.__dict__ for r in broken] if broken else []
}
Path(output_path).write_text(json.dumps(report, indent=2), encoding="utf-8")
logging.info(f"Report saved to {output_path}")
return report
Key Reliability Features
- Retry with Exponential Backoff: Uses
urllib3.Retryto gracefully handle transient network failures and429 Too Many Requestsresponses without hammering endpoints. - HEAD-to-GET Fallback: Many geospatial servers (especially legacy OGC implementations) block
HEADrequests. The engine automatically falls back toGETwhile preserving timeout constraints. - Deduplication Cache: Prevents redundant validation of identical URLs across thousands of metadata records, drastically reducing CI/CD runtime.
CI/CD Integration & Pipeline Wiring
Embedding the validator into your continuous integration workflow ensures that broken references are caught before merging. Below is a minimal GitHub Actions configuration that runs the script, fails the build if broken links exceed a defined threshold, and uploads the report as an artifact.
name: Geospatial Link Validation
on:
pull_request:
paths:
- 'metadata/**'
- 'catalogs/**'
- 'data/**/*.geojson'
jobs:
link-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install requests pyyaml lxml
- name: Run link validator
run: python link_validator.py --dir ./metadata --threshold 0 --report link_report.json
- name: Upload validation report
if: always()
uses: actions/upload-artifact@v4
with:
name: broken-links-report
path: link_report.json
For teams managing high-volume spatial data contributions, automating this step is only half the battle. You must also define clear merge criteria. Integrating Policy Enforcement Gates for Data PRs allows you to block merges when critical service endpoints fail, while permitting non-breaking warnings for deprecated but still-resolvable references.
Handling Geospatial Service Quirks & Rate Limits
Geospatial infrastructure introduces unique validation challenges that generic link checkers often miss. Understanding these nuances prevents false positives and ensures your pipeline respects upstream service agreements.
- OGC Service Endpoints: Direct WMS/WFS URLs often require
SERVICE=WMS&REQUEST=GetCapabilitiesto return a valid200. A bare endpoint may return400or404. Consider appending standard OGC query parameters during validation or validating against the capabilities document instead of the base URL. Refer to the official OGC Web Map Service (WMS) Implementation Specification for standard request patterns. - Rate Limiting & Caching: Public spatial data portals frequently enforce strict rate limits. Implement a request cache or use a local proxy during development. For production CI, stagger requests or use a distributed queue if validating across multiple runners.
- URL Normalization Pitfalls: Geospatial metadata often embeds URLs with trailing slashes, mixed casing, or encoded characters. Use Python’s
urllib.parsemodule to normalize paths before validation. The official urllib.parse documentation provides robust utilities for stripping fragments and standardizing query strings. - Authentication & Tokenized URLs: Some catalogs use short-lived tokens or API keys. Exclude these from automated validation or implement a token-refresh step in your CI pipeline. Hardcoded tokens in metadata should be treated as security violations and flagged during schema linting.
Conclusion
Automated Broken Link and Reference Detection is no longer optional for modern geospatial data teams. By embedding deterministic URL extraction, resilient HTTP validation, and strict CI/CD gating into your metadata workflows, you eliminate silent data degradation and maintain trust across downstream consumers. The Python engine provided here scales from small open-source catalogs to enterprise spatial repositories, while the CI/CD wiring ensures that broken references never reach production. Pair this validation layer with comprehensive schema checks and policy-driven merge gates to build a spatial data pipeline that is both resilient and compliant.