Python Scripts for DCAT-AP Spatial Dataset Mapping

Python scripts for DCAT-AP spatial dataset mapping automate the transformation of raw geospatial metadata into standardized RDF graphs that comply with European open data mandates. By programmatically parsing bounding boxes, coordinate reference systems (CRS), geometry formats, and licensing constraints, these scripts generate valid dcat:Dataset and dcat:Distribution records ready for EU validation suites and direct ingestion into CKAN, GeoNetwork, or custom data catalogs.

Core Mapping Requirements

The DCAT-AP Spatial Profile extends the base W3C DCAT vocabulary with mandatory and recommended properties for geospatial resources. Production-grade automation must reliably map:

  • Spatial Coverage: dcat:bbox (WGS84 decimal degrees as a space-separated string), dcat:centroid, or locn:geometry (WKT/GeoJSON literals)
  • Coordinate Reference System: dcat:crs pointing to an authoritative OGC URI
  • Licensing & Rights: dct:license (URI), dct:rights (literal statement), and dct:accessRights
  • Temporal & Format Metadata: dct:temporal, dcat:mediaType, and dct:format

Implementing DCAT-AP Spatial Profile Mapping through Python eliminates manual RDF authoring and ensures consistent compliance across thousands of datasets. This workflow sits at the core of modern Automated Metadata Generation & Schema Mapping architectures used by national spatial data infrastructures.

Production-Ready Python Implementation

The following script uses rdflib to construct a compliant RDF graph from a structured Python dictionary. It includes namespace binding, spatial/CRS mapping, distribution generation, and serialization. Install dependencies first: pip install rdflib.

# dcat_ap_spatial_mapper.py
import rdflib
from rdflib import Namespace, URIRef, Literal, Graph
from rdflib.namespace import RDF, DCTERMS, DCAT, XSD, GEO
from typing import Dict, Any, List

# Standard namespace bindings
DCATAP = Namespace("http://data.europa.eu/r5r/")
LOCN = Namespace("http://www.w3.org/ns/locn#")
DCT = DCTERMS

def build_spatial_dataset(metadata: Dict[str, Any]) -> Graph:
    """Construct a DCAT-AP compliant RDF graph from structured metadata."""
    g = Graph()
    g.bind("dcat", DCAT)
    g.bind("dct", DCT)
    g.bind("locn", LOCN)
    g.bind("xsd", XSD)
    g.bind("geo", GEO)
    g.bind("dcatap", DCATAP)

    ds_uri = URIRef(metadata["dataset_uri"])
    g.add((ds_uri, RDF.type, DCAT.Dataset))
    g.add((ds_uri, DCT.title, Literal(metadata["title"], lang="en")))
    g.add((ds_uri, DCT.identifier, Literal(metadata["identifier"])))

    # Spatial bounding box (DCAT-AP v2/v3 requirement)
    if "bbox" in metadata:
        w, s, e, n = metadata["bbox"]
        g.add((ds_uri, DCAT.bbox, Literal(f"{w} {s} {e} {n}")))

    # Coordinate Reference System (OGC EPSG registry)
    if "crs" in metadata:
        crs_uri = URIRef(f"https://www.opengis.net/def/crs/EPSG/0/{metadata['crs']}")
        g.add((ds_uri, DCAT.crs, crs_uri))

    # Licensing & Rights
    if "license_uri" in metadata:
        g.add((ds_uri, DCT.license, URIRef(metadata["license_uri"])))
    if "rights_statement" in metadata:
        g.add((ds_uri, DCT.rights, Literal(metadata["rights_statement"])))

    # Distributions
    for dist in metadata.get("distributions", []):
        dist_uri = URIRef(dist["distribution_uri"])
        g.add((ds_uri, DCAT.distribution, dist_uri))
        g.add((dist_uri, RDF.type, DCAT.Distribution))
        g.add((dist_uri, DCT.title, Literal(dist["title"], lang="en")))
        if "media_type" in dist:
            g.add((dist_uri, DCAT.mediaType, Literal(dist["media_type"])))
        if "access_url" in dist:
            g.add((dist_uri, DCAT.accessURL, URIRef(dist["access_url"])))
        if "download_url" in dist:
            g.add((dist_uri, DCAT.downloadURL, URIRef(dist["download_url"])))

    return g

if __name__ == "__main__":
    sample_meta = {
        "dataset_uri": "https://data.example.eu/datasets/land-cover-2024",
        "title": "European Land Cover 2024",
        "identifier": "LC-2024-EU",
        "bbox": [-10.5, 35.0, 35.0, 72.0],
        "crs": "4326",
        "license_uri": "https://creativecommons.org/licenses/by/4.0/",
        "distributions": [
            {
                "distribution_uri": "https://data.example.eu/dist/lc-2024-geojson",
                "title": "GeoJSON Export",
                "media_type": "application/vnd.geo+json",
                "download_url": "https://data.example.eu/files/lc-2024.geojson"
            }
        ]
    }

    graph = build_spatial_dataset(sample_meta)
    print(graph.serialize(format="ttl"))

Validation & EU Compliance

Generating valid triples is only half the workflow. Before publishing, validate the output against the official EU DCAT-AP specification using the SEMIC DCAT-AP SHACL shapes or the RDFLib pySHACL validator. Key validation checks include:

  1. Mandatory Properties: dct:title, dct:identifier, and dcat:distribution must exist.
  2. Spatial Format: dcat:bbox must be a WGS84 string literal. If using locn:geometry, ensure it parses as valid WKT or GeoJSON.
  3. CRS Resolution: The dcat:crs URI must resolve to a valid OGC definition. Hardcoded EPSG codes without the https://www.opengis.net/def/crs/EPSG/0/ prefix will fail automated checks.
  4. License URIs: Use persistent, resolvable URIs (e.g., SPDX or Creative Commons). Avoid string literals for dct:license.

Integration & Deployment Patterns

Once validated, the serialized RDF (Turtle or JSON-LD) can be pushed directly into catalog APIs:

  • CKAN: Use the dcat extension to harvest RDF dumps or POST JSON-LD payloads to /api/3/action/dcat_import.
  • GeoNetwork: Enable the DCAT-AP metadata profile and ingest via the OAI-PMH or REST API.
  • CI/CD Pipelines: Embed the mapper in GitHub Actions or GitLab CI. Run pytest with fixture-driven metadata dictionaries, serialize outputs to a staging directory, and trigger a validation job before merging.

Best Practices for Scale

  • Namespace Management: Always bind prefixes explicitly. Unprefixed URIs cause serialization bloat and break cross-catalog harvesting.
  • Error Handling: Wrap spatial parsing in try/except blocks. Invalid coordinate orders (e.g., lat, lon instead of lon, lat) are the most common cause of validator failures.
  • Caching: OGC CRS URIs and license endpoints should be cached or pre-resolved to avoid HTTP timeouts during bulk generation.
  • Idempotency: Design the mapper to overwrite or merge graphs cleanly. Use rdflib.Graph().remove((s, p, o)) before adding updated triples to prevent duplicate statements in incremental updates.

By standardizing spatial metadata extraction and RDF construction, Python automation reduces manual curation overhead while guaranteeing interoperability across European open data ecosystems.