What are the four dimensions of a geospatial risk scoring framework?

License restrictiveness, attribution complexity, provenance and chain of custody, and metadata completeness. Each is normalized to a 0–10 scale and combined with weighted aggregation to produce a composite risk score.

How do you calibrate risk scoring thresholds for spatial data?

Define three bands aligned to your organization's risk appetite: Green (0–3.5) for automated ingestion, Amber (3.6–7.0) for manual legal review, and Red (7.1–10.0) for quarantine. Adjust per-dimension weights to reflect domain priorities — public agencies typically weight provenance higher; commercial platforms weight license restrictiveness higher.

Can risk scores be integrated into a spatial ETL pipeline?

Yes. The scoring engine is implemented as a Python function that accepts a Pydantic metadata model and returns a RiskScore. It can be called from pandas apply(), Apache Beam transforms, or Airflow task operators. Scores are written back to a catalog JSONB column alongside pipeline run IDs for continuous compliance telemetry.

Geospatial Risk Scoring Frameworks

Managing heterogeneous spatial datasets across enterprise, open-source, and government environments requires more than manual license reviews. As data pipelines scale, organizations need deterministic methods to evaluate compliance exposure before ingestion, transformation, or redistribution. Geospatial risk scoring frameworks translate legal and technical constraints into actionable numeric metrics so GIS data managers, Python automation builders, and agency technical teams can automate ingestion gates, prioritize curation efforts, and maintain audit-ready data catalogs — without blocking every asset behind slow human review cycles.

This methodology extends the foundational principles established in Geospatial Data Licensing & Compliance Fundamentals into programmatic evaluation pipelines that integrate directly with metadata automation stacks. Rather than relying on ad-hoc legal consultations, teams deploy repeatable scoring engines that flag high-risk assets, enforce organizational thresholds, and generate compliance telemetry alongside spatial ETL jobs.

Prerequisites & Environment Configuration

Before deploying a scoring framework, establish baseline technical and policy capabilities. The system must operate deterministically across diverse ingestion sources, so environment configuration and schema validation are non-negotiable first steps.

Python 3.9+ with pip or conda environment management
pydantic>=2.0 — strict schema validation; v2 architecture provides fast type-safe parsing that prevents malformed metadata from corrupting downstream risk calculations
pandas>=1.5 and geopandas>=0.13 — tabular scoring aggregation and spatial asset inspection
lxml>=4.9 or defusedxml>=0.7 — ISO 19115 and FGDC XML parsing; defusedxml prevents XML External Entity (XXE) vulnerabilities when processing untrusted government or third-party catalogs
jsonschema>=4.17 — validation of custom JSON/YAML metadata manifests
pyproj>=3.4 — CRS validation to detect projection mismatches that compound licensing risk during downstream joins
Catalog backend — PostgreSQL/PostGIS with a JSONB compliance column, GeoNetwork, or CKAN; the scoring engine reads metadata from and writes scores back to this store as the single source of truth
Compliance policy matrix — a version-controlled YAML or JSON file defining risk thresholds per use case (internal analysis, public release, commercial redistribution); this file is the calibration backbone for all weight adjustments

Establish a reference mapping table that normalizes raw license strings, EULA clauses, and attribution requirements to standardized SPDX identifiers. This mapping must be version-controlled alongside pipeline code so score changes are attributable to deliberate policy edits, not environment drift.

Risk Score Data-Flow

The diagram below shows how a raw spatial asset moves from ingestion through dimension scoring to catalog routing. Each diamond represents a decision gate; each rectangle represents a deterministic computation step.

Concept & Spec Reference

A risk scoring framework decomposes compliance exposure into four orthogonal dimensions. Each dimension is normalized to a 0–10 scale, where 0 represents minimal compliance burden and 10 indicates severe restrictions or critically missing metadata. The composite score is a weighted sum of the four dimensions.

Dimension	Weight	Scoring logic (0 = low risk, 10 = high risk)	Key signals
License restrictiveness	0.35	CC0/MIT/ODbL → 0–2; restrictive commercial/custom → 7–10; absent → 9	Redistribution prohibition, field-of-use restrictions, prior-consent clauses
Attribution complexity	0.25	None required → 0; single URL citation → 1–3; multi-party + watermarking → 6–10	Number of attribution parties, dynamic notice requirements, derivative propagation
Provenance & chain of custody	0.25	Complete lineage + cryptographic checksums → 0–2; undocumented transformations or orphaned datasets → 8–10	Creator records, processing history, version anchors, custodian contacts
Metadata completeness	0.15	All ISO 19115-1 mandatory fields present → 0–1; missing CRS, empty bounding box, absent temporal coverage → 6–10	Spatial extent, CRS definition, data quality statement, update frequency

ISO 19115-1 (ISO 19115:2014) defines the mandatory fields used in the metadata completeness dimension: MD_Identification.language, MD_DataIdentification.extent, MD_ReferenceSystem.referenceSystemIdentifier, and DQ_DataQuality.report. Missing any of these constitutes a completeness failure.

Weighting coefficients should reflect domain priorities. Public-facing agencies often increase the provenance weight to ensure transparency and reproducibility. Commercial data platforms typically increase the license weight to avoid redistribution liability. Document all weight changes and maintain versioned calibration files alongside pipeline code.

Implementation Walkthrough

Step 1 — Define metadata and score models

Use pydantic v2 to enforce strict boundaries at the model layer. This prevents malformed ingestion records from producing misleading scores downstream.

from __future__ import annotations
from pydantic import BaseModel, Field
from typing import Optional


class SpatialAssetMetadata(BaseModel):
    """Normalized metadata record for a single spatial asset."""
    asset_id: str
    license_id: Optional[str] = None          # SPDX identifier or None if absent
    attribution_required: bool = False
    attribution_parties: int = 0
    provenance_complete: bool = False
    metadata_fields_present: int = 0
    total_required_fields: int = 10           # ISO 19115-1 mandatory field count


class RiskScore(BaseModel):
    """Per-dimension and composite risk scores for one spatial asset."""
    asset_id: str
    license_score: float = Field(ge=0, le=10)
    attribution_score: float = Field(ge=0, le=10)
    provenance_score: float = Field(ge=0, le=10)
    metadata_score: float = Field(ge=0, le=10)
    composite_risk: float = Field(ge=0, le=10)
    band: str                                  # "GREEN" | "AMBER" | "RED"

Step 2 — Implement dimension scoring functions

Each function is pure and deterministic; it accepts only the fields relevant to its dimension so unit tests remain isolated.

# Open SPDX identifiers that carry minimal compliance burden
_OPEN_LICENSES: frozenset[str] = frozenset({
    "CC0-1.0", "MIT", "Apache-2.0", "ODbL-1.0",
    "CC-BY-4.0", "CC-BY-SA-4.0", "PDDL-1.0",
})


def score_license(license_id: Optional[str]) -> float:
    """Return 0–10 license restrictiveness score.

    Absent license is treated as the highest risk (9.0) because the
    compliance obligation is unknown, not necessarily absent.
    """
    if not license_id:
        return 9.0
    if license_id in _OPEN_LICENSES:
        return 1.0
    # Known restrictive or custom commercial / government licenses
    return 6.5


def score_attribution(required: bool, parties: int) -> float:
    """Return 0–10 attribution complexity score."""
    if not required:
        return 0.0
    if parties <= 1:
        return 2.0
    # Each additional party adds pipeline overhead; cap at 10
    return min(10.0, 3.0 + parties * 1.5)


def score_provenance(complete: bool) -> float:
    """Return 0–10 provenance score; incomplete lineage is high-risk."""
    return 0.0 if complete else 8.5


def score_metadata(present: int, total: int) -> float:
    """Return 0–10 metadata completeness score."""
    if total == 0:
        return 10.0
    return max(0.0, round(10.0 - (present / total) * 10.0, 2))

Step 3 — Aggregate into a composite risk score

The weights match the calibration table above. Pass an explicit weights dict to support per-organization overrides without changing function signatures.

_DEFAULT_WEIGHTS = {
    "license": 0.35,
    "attribution": 0.25,
    "provenance": 0.25,
    "metadata": 0.15,
}

_BANDS = [(3.5, "GREEN"), (7.0, "AMBER"), (10.0, "RED")]


def classify_band(composite: float) -> str:
    for threshold, label in _BANDS:
        if composite <= threshold:
            return label
    return "RED"


def evaluate_asset(
    metadata: SpatialAssetMetadata,
    weights: dict[str, float] | None = None,
) -> RiskScore:
    """Compute per-dimension and composite risk for one spatial asset."""
    w = weights or _DEFAULT_WEIGHTS
    assert abs(sum(w.values()) - 1.0) < 1e-6, "Weights must sum to 1.0"

    ls = score_license(metadata.license_id)
    at = score_attribution(metadata.attribution_required, metadata.attribution_parties)
    pv = score_provenance(metadata.provenance_complete)
    md = score_metadata(metadata.metadata_fields_present, metadata.total_required_fields)

    composite = round(
        ls * w["license"]
        + at * w["attribution"]
        + pv * w["provenance"]
        + md * w["metadata"],
        3,
    )
    composite = min(10.0, composite)

    return RiskScore(
        asset_id=metadata.asset_id,
        license_score=round(ls, 2),
        attribution_score=round(at, 2),
        provenance_score=round(pv, 2),
        metadata_score=round(md, 2),
        composite_risk=composite,
        band=classify_band(composite),
    )

Step 4 — Batch-score a catalog with pandas

For large catalogs, vectorize by applying evaluate_asset row-wise. Malformed records are quarantined rather than silently dropped.

import pandas as pd
from pydantic import ValidationError


def score_catalog_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Score every row in a catalog DataFrame.

    Expected columns: asset_id, license_id, attribution_required,
    attribution_parties, provenance_complete, metadata_fields_present,
    total_required_fields.

    Returns the original DataFrame with appended score columns.
    """
    results: list[dict] = []
    quarantine: list[str] = []

    for _, row in df.iterrows():
        try:
            meta = SpatialAssetMetadata(**row.to_dict())
            score = evaluate_asset(meta)
            results.append(score.model_dump())
        except ValidationError as exc:
            quarantine.append(row.get("asset_id", "unknown"))
            print(f"[QUARANTINE] {row.get('asset_id')}: {exc}")

    if quarantine:
        print(f"Quarantined {len(quarantine)} malformed records: {quarantine}")

    scores_df = pd.DataFrame(results)
    return df.merge(
        scores_df[["asset_id", "license_score", "attribution_score",
                   "provenance_score", "metadata_score", "composite_risk", "band"]],
        on="asset_id",
        how="left",
    )

Always route pydantic.ValidationError instances to a quarantine queue rather than raising and halting the entire pipeline. Missing data is a compliance signal, not a processing error.

Validation & CI Integration

Embed score assertions as a CI gate so datasets cannot advance through the pipeline without passing the configured risk threshold. The following patterns work with pytest, pre-commit, and GitHub Actions.

# tests/test_risk_scoring.py
import pytest
from your_pipeline.scoring import (
    SpatialAssetMetadata, evaluate_asset, score_license, score_provenance
)


def test_open_license_low_risk():
    assert score_license("CC0-1.0") == 1.0


def test_absent_license_high_risk():
    assert score_license(None) == 9.0


def test_missing_provenance_high_risk():
    assert score_provenance(False) == 8.5


def test_composite_green_band():
    meta = SpatialAssetMetadata(
        asset_id="test-001",
        license_id="CC0-1.0",
        attribution_required=False,
        attribution_parties=0,
        provenance_complete=True,
        metadata_fields_present=10,
        total_required_fields=10,
    )
    result = evaluate_asset(meta)
    assert result.band == "GREEN"
    assert result.composite_risk <= 3.5


def test_red_band_blocked():
    """Simulate a pipeline gate that blocks RED-band assets."""
    meta = SpatialAssetMetadata(
        asset_id="test-002",
        license_id=None,
        attribution_required=True,
        attribution_parties=4,
        provenance_complete=False,
        metadata_fields_present=2,
        total_required_fields=10,
    )
    result = evaluate_asset(meta)
    assert result.band == "RED", "Asset must be blocked from ingestion"

For catalog-level CI gates, use ogrinfo to verify CRS presence before feeding an asset to the scoring pipeline:

# Confirm CRS metadata is embedded — exit non-zero if absent
ogrinfo -al -so municipal_parcels.gpkg \
  | grep -q "SRS WKT" || { echo "ERROR: missing CRS"; exit 1; }

Wire the test suite into a pre-commit hook via .pre-commit-config.yaml to catch schema regressions before catalog writes:

repos:
  - repo: local
    hooks:
      - id: risk-scoring-tests
        name: Geospatial risk scoring unit tests
        language: python
        entry: python -m pytest tests/test_risk_scoring.py -q
        pass_filenames: false

Derivative & Lineage Management

Every spatial transformation — reprojection, clip, spatial join, rasterization, or dissolve — can alter the compliance obligations attached to a dataset. The scoring framework must re-evaluate after each pipeline stage, not only at initial ingestion.

Reprojection does not change license terms but breaks the provenance chain if the source CRS and target CRS are not logged. Record pyproj.CRS.to_wkt() for both source and output, and store the transform as a lineage event in the catalog. A missing reprojection record scores the provenance dimension as incomplete.

Clip & subset operations inherit the upstream license. If the source is ODbL-1.0, the clipped output remains ODbL-1.0 with the same share-alike obligation. Automated attribution mapping, as described in Automated Attribution Mapping Workflows, ensures the attribution string propagates correctly to every derived layer.

Spatial joins that merge two datasets with different licenses create a new composite asset. The scoring framework must inspect both upstream licenses, apply the more restrictive classification, and flag commercial EULA constraints using the tracking patterns from Commercial EULA Compliance Tracking. Log the join operation with both input asset_id values so the composite lineage is fully traceable.

Rasterization converts vector data to raster. The resulting GeoTIFF inherits the vector source license, but the metadata completeness score resets because ISO 19115-1 raster-specific fields — MD_GridSpatialRepresentation.numberOfDimensions, cell size, and band statistics — must be populated anew.

Maintain an immutable lineage journal (append-only table or event log) that captures: source_asset_id, operation, output_asset_id, timestamp, and operator_id. This journal is the primary evidence artifact for audits and proves compliance posture at every pipeline stage.

Pitfalls & Resolution Table

Pitfall	Root Cause	Resolution Strategy
Absent license scored as low-risk zero	Scoring logic treats `None` as 0 rather than unknown risk	Return 9.0 for absent license; unknown compliance state is always high-risk
Composite score unchanged after metadata remediation	Cached normalized license map not invalidated on update	Version and hash the calibration mapping file; bust cache on file change
CRS field present but malformed WKT causes false green	Metadata parser accepts the field as non-null without validating WKT content	Use `pyproj.CRS.from_wkt()` inside the scoring function; treat `CRSError` as a completeness failure
ODbL share-alike not propagated to derived vector layer	Downstream scoring treats clipped output as a new asset with no upstream license	Attach parent `asset_id` and `license_id` in lineage journal; re-score inherits upstream license
Score drift between runs on identical assets	Stochastic elements (e.g. API calls to resolve license text) inside scoring functions	Scoring functions must be pure and accept pre-resolved inputs; resolve external data in a separate ingestion stage
Provenance score incorrectly green after undocumented reproject	Provenance flag set to `True` at ingestion but not re-evaluated after transformation	Re-set `provenance_complete = False` on any pipeline transformation; require explicit lineage event before re-enabling
Multi-party attribution score underestimates burden for regulatory datasets	`attribution_parties` counts organizations but not per-map rendering requirements	Extend model with `rendering_notices: int` field; add it to the attribution score formula with a separate coefficient

Geospatial Data Licensing & Compliance Fundamentals — parent overview covering licensing models, compliance obligations, and pipeline integration patterns
Creative Commons Licensing for GIS Datasets — CC0, CC-BY, CC-BY-SA, and ODbL specifics that feed directly into the license restrictiveness dimension
Commercial EULA Compliance Tracking — tracking proprietary seat limits, geographic restrictions, and term expiry that override open metadata signals
Automated Attribution Mapping Workflows — automating citation string generation from the attribution dimension outputs

# Geospatial Risk Scoring Frameworks

# Prerequisites & Environment Configuration

# Risk Score Data-Flow

# Concept & Spec Reference

# Implementation Walkthrough

# Step 1 — Define metadata and score models

# Step 2 — Implement dimension scoring functions

# Step 3 — Aggregate into a composite risk score

# Step 4 — Batch-score a catalog with pandas

# Validation & CI Integration

# Derivative & Lineage Management

# Pitfalls & Resolution Table

# Related