Python Scripts for DCAT-AP Spatial Dataset Mapping
Python scripts for DCAT-AP spatial dataset mapping automate the transformation of raw geospatial metadata into standardized RDF graphs that comply with European open data mandates. By programmatically parsing bounding boxes, coordinate reference systems (CRS), geometry formats, and licensing constraints, these scripts generate valid dcat:Dataset and dcat:Distribution records ready for EU validation suites and direct ingestion into CKAN, GeoNetwork, or custom data catalogs.
Core Mapping Requirements
The DCAT-AP Spatial Profile extends the base W3C DCAT vocabulary with mandatory and recommended properties for geospatial resources. Production-grade automation must reliably map:
- Spatial Coverage:
dcat:bbox(WGS84 decimal degrees as a space-separated string),dcat:centroid, orlocn:geometry(WKT/GeoJSON literals) - Coordinate Reference System:
dcat:crspointing to an authoritative OGC URI - Licensing & Rights:
dct:license(URI),dct:rights(literal statement), anddct:accessRights - Temporal & Format Metadata:
dct:temporal,dcat:mediaType, anddct:format
Implementing DCAT-AP Spatial Profile Mapping through Python eliminates manual RDF authoring and ensures consistent compliance across thousands of datasets. This workflow sits at the core of modern Automated Metadata Generation & Schema Mapping architectures used by national spatial data infrastructures.
Production-Ready Python Implementation
The following script uses rdflib to construct a compliant RDF graph from a structured Python dictionary. It includes namespace binding, spatial/CRS mapping, distribution generation, and serialization. Install dependencies first: pip install rdflib.
# dcat_ap_spatial_mapper.py
import rdflib
from rdflib import Namespace, URIRef, Literal, Graph
from rdflib.namespace import RDF, DCTERMS, DCAT, XSD, GEO
from typing import Dict, Any, List
# Standard namespace bindings
DCATAP = Namespace("http://data.europa.eu/r5r/")
LOCN = Namespace("http://www.w3.org/ns/locn#")
DCT = DCTERMS
def build_spatial_dataset(metadata: Dict[str, Any]) -> Graph:
"""Construct a DCAT-AP compliant RDF graph from structured metadata."""
g = Graph()
g.bind("dcat", DCAT)
g.bind("dct", DCT)
g.bind("locn", LOCN)
g.bind("xsd", XSD)
g.bind("geo", GEO)
g.bind("dcatap", DCATAP)
ds_uri = URIRef(metadata["dataset_uri"])
g.add((ds_uri, RDF.type, DCAT.Dataset))
g.add((ds_uri, DCT.title, Literal(metadata["title"], lang="en")))
g.add((ds_uri, DCT.identifier, Literal(metadata["identifier"])))
# Spatial bounding box (DCAT-AP v2/v3 requirement)
if "bbox" in metadata:
w, s, e, n = metadata["bbox"]
g.add((ds_uri, DCAT.bbox, Literal(f"{w} {s} {e} {n}")))
# Coordinate Reference System (OGC EPSG registry)
if "crs" in metadata:
crs_uri = URIRef(f"https://www.opengis.net/def/crs/EPSG/0/{metadata['crs']}")
g.add((ds_uri, DCAT.crs, crs_uri))
# Licensing & Rights
if "license_uri" in metadata:
g.add((ds_uri, DCT.license, URIRef(metadata["license_uri"])))
if "rights_statement" in metadata:
g.add((ds_uri, DCT.rights, Literal(metadata["rights_statement"])))
# Distributions
for dist in metadata.get("distributions", []):
dist_uri = URIRef(dist["distribution_uri"])
g.add((ds_uri, DCAT.distribution, dist_uri))
g.add((dist_uri, RDF.type, DCAT.Distribution))
g.add((dist_uri, DCT.title, Literal(dist["title"], lang="en")))
if "media_type" in dist:
g.add((dist_uri, DCAT.mediaType, Literal(dist["media_type"])))
if "access_url" in dist:
g.add((dist_uri, DCAT.accessURL, URIRef(dist["access_url"])))
if "download_url" in dist:
g.add((dist_uri, DCAT.downloadURL, URIRef(dist["download_url"])))
return g
if __name__ == "__main__":
sample_meta = {
"dataset_uri": "https://data.example.eu/datasets/land-cover-2024",
"title": "European Land Cover 2024",
"identifier": "LC-2024-EU",
"bbox": [-10.5, 35.0, 35.0, 72.0],
"crs": "4326",
"license_uri": "https://creativecommons.org/licenses/by/4.0/",
"distributions": [
{
"distribution_uri": "https://data.example.eu/dist/lc-2024-geojson",
"title": "GeoJSON Export",
"media_type": "application/vnd.geo+json",
"download_url": "https://data.example.eu/files/lc-2024.geojson"
}
]
}
graph = build_spatial_dataset(sample_meta)
print(graph.serialize(format="ttl"))
Validation & EU Compliance
Generating valid triples is only half the workflow. Before publishing, validate the output against the official EU DCAT-AP specification using the SEMIC DCAT-AP SHACL shapes or the RDFLib pySHACL validator. Key validation checks include:
- Mandatory Properties:
dct:title,dct:identifier, anddcat:distributionmust exist. - Spatial Format:
dcat:bboxmust be a WGS84 string literal. If usinglocn:geometry, ensure it parses as valid WKT or GeoJSON. - CRS Resolution: The
dcat:crsURI must resolve to a valid OGC definition. Hardcoded EPSG codes without thehttps://www.opengis.net/def/crs/EPSG/0/prefix will fail automated checks. - License URIs: Use persistent, resolvable URIs (e.g., SPDX or Creative Commons). Avoid string literals for
dct:license.
Integration & Deployment Patterns
Once validated, the serialized RDF (Turtle or JSON-LD) can be pushed directly into catalog APIs:
- CKAN: Use the
dcatextension to harvest RDF dumps or POST JSON-LD payloads to/api/3/action/dcat_import. - GeoNetwork: Enable the DCAT-AP metadata profile and ingest via the OAI-PMH or REST API.
- CI/CD Pipelines: Embed the mapper in GitHub Actions or GitLab CI. Run
pytestwith fixture-driven metadata dictionaries, serialize outputs to a staging directory, and trigger a validation job before merging.
Best Practices for Scale
- Namespace Management: Always bind prefixes explicitly. Unprefixed URIs cause serialization bloat and break cross-catalog harvesting.
- Error Handling: Wrap spatial parsing in
try/exceptblocks. Invalid coordinate orders (e.g.,lat, loninstead oflon, lat) are the most common cause of validator failures. - Caching: OGC CRS URIs and license endpoints should be cached or pre-resolved to avoid HTTP timeouts during bulk generation.
- Idempotency: Design the mapper to overwrite or merge graphs cleanly. Use
rdflib.Graph().remove((s, p, o))before adding updated triples to prevent duplicate statements in incremental updates.
By standardizing spatial metadata extraction and RDF construction, Python automation reduces manual curation overhead while guaranteeing interoperability across European open data ecosystems.