Automating license checks with Python and OGR
Automating license checks with Python and OGR eliminates manual EULA verification bottlenecks by programmatically reading embedded geospatial metadata, normalizing license strings, and validating them against a compliance matrix before ingestion. Using the osgeo Python bindings, you can open mixed-format datasets, extract key-value pairs, and enforce policy rules at scale. This pipeline approach reduces human error, generates auditable compliance logs, and aligns ingestion workflows with established Geospatial Data Licensing & Compliance Fundamentals.
For GIS data managers and government tech teams, embedding validation directly into ingestion scripts ensures rigorous Commercial EULA Compliance Tracking across vendor drops, open-data portals, and batch deliveries. Below is a production-ready implementation, followed by driver-specific handling notes and scaling strategies.
Core Implementation
The script below leverages GDAL’s unified architecture to handle both raster and vector formats. It attempts to open the dataset, scans common metadata keys for license information, normalizes the string, and validates it against an SPDX-compatible allowlist.
import os
import logging
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from osgeo import gdal, ogr
# Enable exception-based error handling
gdal.UseExceptions()
ogr.UseExceptions()
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
# SPDX-compatible allowlist
ALLOWED_LICENSES = {
"CC-BY-4.0", "CC-BY-SA-4.0", "ODBL-1.0", "MIT", "APACHE-2.0",
"US-GOV-PD", "EUPL-1.2"
}
# Common metadata keys across GDAL/OGR drivers
LICENSE_KEYS = [
"LICENSE", "LICENCE", "COPYRIGHT", "DATA_LICENSE",
"GDAL_LICENSE", "OGC_LICENSE", "AUTHOR"
]
def open_dataset(filepath: str):
"""Attempt to open a dataset as raster, then vector."""
try:
ds = gdal.Open(filepath)
if ds:
return ds, "raster"
except RuntimeError:
pass
try:
ds = ogr.Open(filepath)
if ds:
return ds, "vector"
except RuntimeError:
pass
return None, None
def normalize_license_string(raw: str) -> Optional[str]:
"""Clean and standardize license strings for deterministic matching."""
if not raw:
return None
cleaned = raw.strip().upper()
# Strip common prefixes
for prefix in ("LICENSE:", "COPYRIGHT:", "LICENCE:", "DATA LICENSE:"):
if cleaned.startswith(prefix):
cleaned = cleaned[len(prefix):].strip()
# Normalize spacing and underscores for SPDX alignment
cleaned = cleaned.replace(" ", "-").replace("_", "-")
return cleaned if cleaned else None
def extract_and_validate(filepath: str) -> Tuple[str, str, Optional[str], bool]:
"""Extract license metadata and validate against the compliance allowlist."""
ds, ds_type = open_dataset(filepath)
if not ds:
return filepath, "unknown", None, False
metadata = ds.GetMetadata()
raw_license = None
for key in LICENSE_KEYS:
if key in metadata:
raw_license = metadata[key]
break
normalized = normalize_license_string(raw_license)
is_compliant = normalized in ALLOWED_LICENSES if normalized else False
# Explicit cleanup to release file locks (critical for Windows/Shapefiles)
if ds_type == "raster":
ds = None
else:
ds.Destroy()
return filepath, ds_type, normalized, is_compliant
How the Validation Pipeline Works
- Unified Opening:
gdal.Open()attempts raster parsing first. If it fails,ogr.Open()handles vector formats (GeoPackage, Shapefile, GeoJSON). This covers 95% of standard geospatial deliveries. - Metadata Extraction: GDAL stores metadata as flat dictionaries. The script iterates through
LICENSE_KEYSto catch driver-specific variations (e.g.,LICENCEvsLICENSE). - String Normalization: Raw metadata often contains prefixes, inconsistent casing, or vendor-specific phrasing. The
normalize_license_string()function strips boilerplate and converts to uppercase, replacing spaces/underscores with hyphens to align with SPDX License List conventions. - Compliance Evaluation: The normalized string is checked against
ALLOWED_LICENSES. Returns a boolean flag for immediate pipeline gating.
Handling Edge Cases & Driver Quirks
Geospatial metadata is notoriously inconsistent. Implement robust fallbacks to prevent pipeline failures:
- Missing Metadata Keys: Many legacy formats (e.g., older Shapefiles) lack embedded license fields. Return
Noneand route to a manual review queue rather than failing the batch. - Domain-Specific Metadata: Some drivers store licenses in custom domains (e.g.,
IMAGE_STRUCTURE,DERIVED_SUBDATASETS). Useds.GetMetadataDomainList()to inspect available domains, then callds.GetMetadata(domain="YOUR_DOMAIN")if needed. - Multi-File Archives: ZIP-delivered datasets require extraction before validation. Wrap the script in a
tempfile.TemporaryDirectory()context, extract, validate, then archive or reject based on compliance status. - Proprietary Formats: Closed-source drivers (e.g., certain CAD or LiDAR formats) may not expose metadata. Log these as
unsupported_formatand escalate to procurement teams for vendor EULA verification.
Refer to the official GDAL Python API documentation for driver-specific metadata behavior and advanced domain querying.
Scaling to Batch Workflows
For enterprise ingestion, wrap the validation function in a parallel executor and export structured reports:
import concurrent.futures
import csv
from typing import Iterable
def process_batch(filepaths: Iterable[str], output_csv: str) -> None:
"""Run license checks in parallel and write compliance report."""
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
futures = {executor.submit(extract_and_validate, fp): fp for fp in filepaths}
for future in concurrent.futures.as_completed(futures):
results.append(future.result())
# Sort by compliance status for quick triage
results.sort(key=lambda x: (x[3], x[0]))
with open(output_csv, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Filepath", "Type", "Normalized_License", "Compliant"])
writer.writerows(results)
logging.info(f"Processed {len(results)} files. Report saved to {output_csv}")
Performance Notes:
- Use
ThreadPoolExecutor(notProcessPoolExecutor) since GDAL/OGR operations are I/O-bound and thread-safe underUseExceptions(). - Limit
max_workersto 4–8 to avoid exhausting file descriptors or hitting database connection limits on spatial databases. - For directories exceeding 10,000 files, implement chunked processing and append-mode CSV writing to prevent memory spikes.
Integration Best Practices
- Fail-Fast Gating: Reject non-compliant files before they enter your spatial database or cloud storage. This prevents downstream licensing violations and simplifies audit trails.
- Version Control the Allowlist: Store
ALLOWED_LICENSESin a centralized configuration file (YAML/JSON) or environment variable. Update it quarterly as organizational policy or open-data mandates change. - Log Everything: Capture raw metadata strings alongside normalized results. When a vendor disputes a compliance flag, you can prove exactly what the dataset reported versus what your policy requires.
Automating license checks with Python and OGR transforms a manual, error-prone compliance step into a repeatable, auditable pipeline component. By standardizing metadata extraction, aligning with SPDX identifiers, and embedding validation directly into ingestion scripts, teams can safely scale geospatial data operations while maintaining strict legal and policy adherence.