Automating license checks with Python and OGR

Automating license checks with Python and OGR eliminates manual EULA verification bottlenecks by programmatically reading embedded geospatial metadata, normalizing license strings, and validating them against a compliance matrix before ingestion. Using the osgeo Python bindings, you can open mixed-format datasets, extract key-value pairs, and enforce policy rules at scale. This pipeline approach reduces human error, generates auditable compliance logs, and aligns ingestion workflows with established Geospatial Data Licensing & Compliance Fundamentals.

For GIS data managers and government tech teams, embedding validation directly into ingestion scripts ensures rigorous Commercial EULA Compliance Tracking across vendor drops, open-data portals, and batch deliveries. Below is a production-ready implementation, followed by driver-specific handling notes and scaling strategies.

Core Implementation

The script below leverages GDAL’s unified architecture to handle both raster and vector formats. It attempts to open the dataset, scans common metadata keys for license information, normalizes the string, and validates it against an SPDX-compatible allowlist.

import os
import logging
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from osgeo import gdal, ogr

# Enable exception-based error handling
gdal.UseExceptions()
ogr.UseExceptions()

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

# SPDX-compatible allowlist
ALLOWED_LICENSES = {
    "CC-BY-4.0", "CC-BY-SA-4.0", "ODBL-1.0", "MIT", "APACHE-2.0",
    "US-GOV-PD", "EUPL-1.2"
}

# Common metadata keys across GDAL/OGR drivers
LICENSE_KEYS = [
    "LICENSE", "LICENCE", "COPYRIGHT", "DATA_LICENSE",
    "GDAL_LICENSE", "OGC_LICENSE", "AUTHOR"
]

def open_dataset(filepath: str):
    """Attempt to open a dataset as raster, then vector."""
    try:
        ds = gdal.Open(filepath)
        if ds:
            return ds, "raster"
    except RuntimeError:
        pass
    try:
        ds = ogr.Open(filepath)
        if ds:
            return ds, "vector"
    except RuntimeError:
        pass
    return None, None

def normalize_license_string(raw: str) -> Optional[str]:
    """Clean and standardize license strings for deterministic matching."""
    if not raw:
        return None
    cleaned = raw.strip().upper()
    # Strip common prefixes
    for prefix in ("LICENSE:", "COPYRIGHT:", "LICENCE:", "DATA LICENSE:"):
        if cleaned.startswith(prefix):
            cleaned = cleaned[len(prefix):].strip()
    # Normalize spacing and underscores for SPDX alignment
    cleaned = cleaned.replace(" ", "-").replace("_", "-")
    return cleaned if cleaned else None

def extract_and_validate(filepath: str) -> Tuple[str, str, Optional[str], bool]:
    """Extract license metadata and validate against the compliance allowlist."""
    ds, ds_type = open_dataset(filepath)
    if not ds:
        return filepath, "unknown", None, False

    metadata = ds.GetMetadata()
    raw_license = None
    for key in LICENSE_KEYS:
        if key in metadata:
            raw_license = metadata[key]
            break

    normalized = normalize_license_string(raw_license)
    is_compliant = normalized in ALLOWED_LICENSES if normalized else False

    # Explicit cleanup to release file locks (critical for Windows/Shapefiles)
    if ds_type == "raster":
        ds = None
    else:
        ds.Destroy()

    return filepath, ds_type, normalized, is_compliant

How the Validation Pipeline Works

  1. Unified Opening: gdal.Open() attempts raster parsing first. If it fails, ogr.Open() handles vector formats (GeoPackage, Shapefile, GeoJSON). This covers 95% of standard geospatial deliveries.
  2. Metadata Extraction: GDAL stores metadata as flat dictionaries. The script iterates through LICENSE_KEYS to catch driver-specific variations (e.g., LICENCE vs LICENSE).
  3. String Normalization: Raw metadata often contains prefixes, inconsistent casing, or vendor-specific phrasing. The normalize_license_string() function strips boilerplate and converts to uppercase, replacing spaces/underscores with hyphens to align with SPDX License List conventions.
  4. Compliance Evaluation: The normalized string is checked against ALLOWED_LICENSES. Returns a boolean flag for immediate pipeline gating.

Handling Edge Cases & Driver Quirks

Geospatial metadata is notoriously inconsistent. Implement robust fallbacks to prevent pipeline failures:

  • Missing Metadata Keys: Many legacy formats (e.g., older Shapefiles) lack embedded license fields. Return None and route to a manual review queue rather than failing the batch.
  • Domain-Specific Metadata: Some drivers store licenses in custom domains (e.g., IMAGE_STRUCTURE, DERIVED_SUBDATASETS). Use ds.GetMetadataDomainList() to inspect available domains, then call ds.GetMetadata(domain="YOUR_DOMAIN") if needed.
  • Multi-File Archives: ZIP-delivered datasets require extraction before validation. Wrap the script in a tempfile.TemporaryDirectory() context, extract, validate, then archive or reject based on compliance status.
  • Proprietary Formats: Closed-source drivers (e.g., certain CAD or LiDAR formats) may not expose metadata. Log these as unsupported_format and escalate to procurement teams for vendor EULA verification.

Refer to the official GDAL Python API documentation for driver-specific metadata behavior and advanced domain querying.

Scaling to Batch Workflows

For enterprise ingestion, wrap the validation function in a parallel executor and export structured reports:

import concurrent.futures
import csv
from typing import Iterable

def process_batch(filepaths: Iterable[str], output_csv: str) -> None:
    """Run license checks in parallel and write compliance report."""
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        futures = {executor.submit(extract_and_validate, fp): fp for fp in filepaths}
        for future in concurrent.futures.as_completed(futures):
            results.append(future.result())

    # Sort by compliance status for quick triage
    results.sort(key=lambda x: (x[3], x[0]))

    with open(output_csv, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Filepath", "Type", "Normalized_License", "Compliant"])
        writer.writerows(results)

    logging.info(f"Processed {len(results)} files. Report saved to {output_csv}")

Performance Notes:

  • Use ThreadPoolExecutor (not ProcessPoolExecutor) since GDAL/OGR operations are I/O-bound and thread-safe under UseExceptions().
  • Limit max_workers to 4–8 to avoid exhausting file descriptors or hitting database connection limits on spatial databases.
  • For directories exceeding 10,000 files, implement chunked processing and append-mode CSV writing to prevent memory spikes.

Integration Best Practices

  • Fail-Fast Gating: Reject non-compliant files before they enter your spatial database or cloud storage. This prevents downstream licensing violations and simplifies audit trails.
  • Version Control the Allowlist: Store ALLOWED_LICENSES in a centralized configuration file (YAML/JSON) or environment variable. Update it quarterly as organizational policy or open-data mandates change.
  • Log Everything: Capture raw metadata strings alongside normalized results. When a vendor disputes a compliance flag, you can prove exactly what the dataset reported versus what your policy requires.

Automating license checks with Python and OGR transforms a manual, error-prone compliance step into a repeatable, auditable pipeline component. By standardizing metadata extraction, aligning with SPDX identifiers, and embedding validation directly into ingestion scripts, teams can safely scale geospatial data operations while maintaining strict legal and policy adherence.