What SPDX confidence threshold should trigger human review?

Route any match below 85% similarity to a human review queue. At this threshold most genuine open licenses resolve automatically while ambiguous commercial clauses, custom municipal terms, and truncated license stubs are correctly escalated.

How do I handle datasets that carry multiple licenses across layers?

Record all applicable SPDX IDs per layer in a priority matrix. Run the conflict detector across the combined set before output generation — incompatible pairings (e.g. GPL-3.0 alongside CC-BY-NC-4.0) must halt the pipeline, not be silently merged.

Which metadata formats does the pipeline need to support?

At minimum: ISO 19115 XML, FGDC CSDGM, GeoJSON properties, and plain-text README/LICENSE sidecars. STAC Item assets and DCAT-AP JSON-LD are increasingly common and should be handled by adapter layers.

Automated Attribution Mapping Workflows

Modern spatial products routinely combine municipal parcel boundaries, satellite-derived land cover classifications, open street networks, and proprietary elevation models. Each source carries distinct licensing terms, attribution mandates, and redistribution constraints — and manual tracking of those obligations quickly becomes unsustainable as dataset velocity increases. Automated attribution mapping pipelines solve this by programmatically ingesting metadata, resolving license identifiers against authoritative registries, and generating compliant citation strings before any publication or distribution step. This workflow sits within Geospatial Data Licensing & Compliance Fundamentals, where systematic metadata hygiene transitions from a compliance checkbox to an operational engineering concern.

Prerequisites

Python 3.9+ — managed with pip or conda; isolate every project in a virtual environment
geopandas>=0.14 — vector I/O and coordinate reference system handling
pydantic>=2.0 — strict schema validation for manifest models and attribution objects
lxml>=4.9 — XPath-based parsing of ISO 19115 and FGDC CSDGM XML sidecars
requests>=2.31 and requests-cache>=1.1 — SPDX registry lookups with TTL-backed local caching
spdx-tools>=0.8 — canonical SPDX document parsing and identifier normalisation
Jinja2>=3.1 — attribution template rendering with variable substitution
rapidfuzz>=3.0 — fuzzy similarity scoring for license text fingerprinting
Structured metadata sources: ISO 19115 XML, FGDC CSDGM, or embedded GeoJSON/Shapefile .xml sidecars
A version-controlled repository for pipeline scripts, test fixtures, and attribution templates — the pipeline is code and must be treated as such

Concept & Spec Reference

Attribution mapping pipelines depend on three normalisation layers: metadata format parsing, license identifier resolution, and template rendering. Understanding the spec surface for each layer prevents the most common failure modes.

Metadata formats

Format	Primary use	Key license fields	Parser
ISO 19115 XML	Government and SDI datasets	`MD_Constraints`, `MD_LegalConstraints/useConstraints`, `MD_LegalConstraints/otherConstraints`	`lxml` XPath
FGDC CSDGM	US federal and legacy agency layers	`<distliab>`, `<useconst>`, `<accconst>`	`lxml` XPath
GeoJSON `properties`	Web APIs and open data portals	`license`, `attribution`, `rights` (no standard field names)	`json` / `geopandas`
STAC Item `assets`	Satellite imagery and analysis-ready data	`roles: ["license"]`, `href` pointing to license document	`requests` + `json`
Plain sidecar	Legacy Shapefiles, GeoTIFFs	`LICENSE.txt`, `README.md`, `*.xml` adjacent file	`pathlib` glob

The absence of a universal field name across formats is intentional: build an adapter layer for each format rather than a single generic extractor. Adapter isolation prevents format-specific edge cases from polluting the canonical manifest model.

SPDX identifier resolution

SPDX identifiers (CC-BY-4.0, ODbL-1.0, GPL-3.0-only, etc.) provide a machine-readable vocabulary for license obligations. The resolution pipeline maps free-text license prose — which may be truncated, paraphrased, or wrapped in agency boilerplate — to canonical SPDX IDs using fuzzy similarity scoring via rapidfuzz. For open license clauses, cross-reference Creative Commons licensing for GIS datasets to confirm correct BY, SA, or NC clause handling: the distinction between CC-BY-4.0 and CC-BY-SA-4.0 materially changes derivative obligations.

Attribution template fields

Each resolved SPDX ID maps to an attribution template with the following required substitution variables:

Variable	Source	Fallback
`dataset_name`	Metadata `title` field	filename stem
`publisher`	`MD_Constraints/responsibleParty` or `publisher` field	`"Unknown Publisher"` + validation warning
`publication_year`	`dateStamp` or `date` field	current year + validation warning
`license_short`	SPDX short identifier	raw resolved text
`source_url`	`linkage` or `distributionURL`	empty string (non-fatal)
`modification_note`	Injected by pipeline if transformations applied	omitted if no transforms

Never silently drop required fields. Emit a ValidationWarning and inject a clearly marked fallback string so that reviewers can identify incomplete records in the rendered output.

Implementation Walkthrough

Step 1: Metadata inventory and ingestion

Build a recursive directory scanner that normalises file paths, detects encoding, and extracts raw license text alongside dataset identifiers. The Pydantic model enforces strict typing before any downstream step touches the data.

# metadata_inventory.py
from __future__ import annotations
import json
import csv
from pathlib import Path
from typing import Optional
from pydantic import BaseModel, field_validator
from lxml import etree
import geopandas as gpd
import chardet


class DatasetRecord(BaseModel):
    dataset_id: str
    source_path: str
    metadata_format: str  # iso19115 | fgdc | geojson | stac | readme
    raw_license_text: Optional[str]
    last_modified: float

    @field_validator("raw_license_text")
    @classmethod
    def strip_whitespace(cls, v: Optional[str]) -> Optional[str]:
        return v.strip() if v else None


SIDECAR_SUFFIXES = (".xml", ".txt", ".md")
ISO_LICENSE_XPATH = (
    ".//gmd:MD_LegalConstraints/gmd:otherConstraints/gco:CharacterString"
)
FGDC_LICENSE_XPATH = ".//useconst"
ISO_NS = {
    "gmd": "http://www.isotc211.org/2005/gmd",
    "gco": "http://www.isotc211.org/2005/gco",
}


def _read_text(path: Path) -> str:
    raw = path.read_bytes()
    enc = chardet.detect(raw).get("encoding") or "utf-8"
    return raw.decode(enc, errors="replace")


def _extract_iso(path: Path) -> Optional[str]:
    try:
        tree = etree.parse(str(path))
        nodes = tree.xpath(ISO_LICENSE_XPATH, namespaces=ISO_NS)
        return nodes[0].text if nodes else None
    except etree.XMLSyntaxError:
        return None


def _extract_fgdc(path: Path) -> Optional[str]:
    try:
        tree = etree.parse(str(path))
        nodes = tree.xpath(FGDC_LICENSE_XPATH)
        return nodes[0].text if nodes else None
    except etree.XMLSyntaxError:
        return None


def _extract_geojson(path: Path) -> Optional[str]:
    try:
        data = json.loads(path.read_text(encoding="utf-8"))
        props = data.get("properties") or data
        for key in ("license", "rights", "attribution"):
            if key in props:
                return str(props[key])
        return None
    except (json.JSONDecodeError, UnicodeDecodeError):
        return None


def scan_repository(root: Path) -> list[DatasetRecord]:
    records: list[DatasetRecord] = []
    for path in root.rglob("*"):
        if not path.is_file():
            continue
        suffix = path.suffix.lower()
        record: Optional[DatasetRecord] = None

        if suffix == ".xml":
            raw = _read_text(path)
            if "MD_Metadata" in raw or "gmd:" in raw:
                record = DatasetRecord(
                    dataset_id=path.stem,
                    source_path=str(path),
                    metadata_format="iso19115",
                    raw_license_text=_extract_iso(path),
                    last_modified=path.stat().st_mtime,
                )
            elif "<metadata>" in raw.lower():
                record = DatasetRecord(
                    dataset_id=path.stem,
                    source_path=str(path),
                    metadata_format="fgdc",
                    raw_license_text=_extract_fgdc(path),
                    last_modified=path.stat().st_mtime,
                )
        elif suffix == ".geojson":
            record = DatasetRecord(
                dataset_id=path.stem,
                source_path=str(path),
                metadata_format="geojson",
                raw_license_text=_extract_geojson(path),
                last_modified=path.stat().st_mtime,
            )
        elif suffix in (".txt", ".md") and path.stem.upper() in (
            "LICENSE",
            "README",
            "COPYING",
        ):
            record = DatasetRecord(
                dataset_id=path.parent.stem,
                source_path=str(path),
                metadata_format="readme",
                raw_license_text=_read_text(path)[:4096],
                last_modified=path.stat().st_mtime,
            )

        if record is not None:
            records.append(record)

    return records


def write_manifest(records: list[DatasetRecord], output: Path) -> None:
    fieldnames = list(DatasetRecord.model_fields.keys())
    with output.open("w", newline="", encoding="utf-8") as fh:
        writer = csv.DictWriter(fh, fieldnames=fieldnames)
        writer.writeheader()
        for r in records:
            writer.writerow(r.model_dump())

Step 2: License fingerprinting and SPDX resolution

Raw license text contains custom phrasing, legacy references, and truncated clauses. Direct string matching fails. The fingerprinting layer normalises whitespace, strips boilerplate headers, and computes similarity against the SPDX registry.

# spdx_resolver.py
from __future__ import annotations
import re
import sqlite3
from dataclasses import dataclass
from pathlib import Path
from typing import Optional
import requests
import requests_cache
from rapidfuzz import fuzz

requests_cache.install_cache(
    str(Path.home() / ".cache" / "spdx_lookup"),
    expire_after=86400 * 7,  # 7-day TTL
)

SPDX_LICENSE_LIST_URL = (
    "https://raw.githubusercontent.com/spdx/license-list-data/main/json/licenses.json"
)
_LICENSE_REGISTRY: Optional[list[dict]] = None

PROPRIETARY_SIGNALS = re.compile(
    r"\b(proprietary|all rights reserved|commercial license|"
    r"not for redistribution|confidential)\b",
    re.IGNORECASE,
)
HEADER_STRIP = re.compile(
    r"(terms (of|and) (use|service)|end user license agreement|"
    r"software license agreement)[^\n]*\n",
    re.IGNORECASE,
)


def _load_registry() -> list[dict]:
    global _LICENSE_REGISTRY
    if _LICENSE_REGISTRY is None:
        resp = requests.get(SPDX_LICENSE_LIST_URL, timeout=30)
        resp.raise_for_status()
        _LICENSE_REGISTRY = resp.json()["licenses"]
    return _LICENSE_REGISTRY


def _normalise(text: str) -> str:
    text = HEADER_STRIP.sub("", text)
    text = re.sub(r"\s+", " ", text).strip().lower()
    return text[:2000]  # cap for performance


@dataclass
class ResolutionResult:
    spdx_id: Optional[str]
    confidence: float
    is_proprietary: bool
    raw_license_text: str


def resolve_license(raw_text: str) -> ResolutionResult:
    if not raw_text or not raw_text.strip():
        return ResolutionResult(None, 0.0, False, raw_text or "")

    if PROPRIETARY_SIGNALS.search(raw_text):
        return ResolutionResult(None, 100.0, True, raw_text)

    normalised = _normalise(raw_text)
    registry = _load_registry()
    best_id: Optional[str] = None
    best_score = 0.0

    for entry in registry:
        candidate = _normalise(entry.get("name", "") + " " + entry.get("licenseId", ""))
        score = fuzz.token_set_ratio(normalised, candidate)
        if score > best_score:
            best_score = score
            best_id = entry["licenseId"]

    return ResolutionResult(
        spdx_id=best_id if best_score >= 85 else None,
        confidence=best_score,
        is_proprietary=False,
        raw_license_text=raw_text,
    )

When is_proprietary is True, route the record to the commercial EULA compliance tracking module rather than attempting SPDX mapping. Forcing SPDX identifiers onto vendor-specific redistribution terms produces misleading results and masks actual contractual obligations.

Step 3: Attribution template assembly

Map resolved SPDX IDs to Jinja2 templates. Municipal and regional datasets frequently impose jurisdiction-specific phrasing — store overrides in a YAML configuration that the template engine consults before falling back to generic SPDX templates. See building a license compliance matrix for municipal data for patterns that handle city, county, and state-level variation without hardcoding exceptions.

# attribution_assembler.py
from __future__ import annotations
from dataclasses import dataclass
from typing import Optional
from jinja2 import Environment, BaseLoader, StrictUndefined
import yaml
from pathlib import Path

GENERIC_TEMPLATE = (
    "{{ dataset_name }} ({{ publisher }}, {{ publication_year }}) "
    "licensed under {{ license_short }}. "
    "{% if source_url %}Source: {{ source_url }}{% endif %}"
    "{% if modification_note %} [{{ modification_note }}]{% endif %}"
)

jinja_env = Environment(loader=BaseLoader(), undefined=StrictUndefined)


@dataclass
class AttributionRecord:
    dataset_id: str
    attribution_string: str
    spdx_id: Optional[str]
    warnings: list[str]


def _load_overrides(config_path: Path) -> dict[str, str]:
    if not config_path.exists():
        return {}
    with config_path.open(encoding="utf-8") as fh:
        return yaml.safe_load(fh) or {}


def assemble_attribution(
    dataset_id: str,
    spdx_id: str,
    metadata: dict,
    overrides_path: Path = Path("attribution_overrides.yaml"),
) -> AttributionRecord:
    warnings: list[str] = []
    overrides = _load_overrides(overrides_path)
    template_str = overrides.get(spdx_id) or GENERIC_TEMPLATE
    template = jinja_env.from_string(template_str)

    publisher = metadata.get("publisher")
    if not publisher:
        publisher = "Unknown Publisher"
        warnings.append(f"{dataset_id}: publisher field missing; using fallback")

    pub_year = metadata.get("publication_year")
    if not pub_year:
        from datetime import date
        pub_year = str(date.today().year)
        warnings.append(f"{dataset_id}: publication_year missing; using current year")

    attribution = template.render(
        dataset_name=metadata.get("dataset_name", dataset_id),
        publisher=publisher,
        publication_year=pub_year,
        license_short=spdx_id,
        source_url=metadata.get("source_url", ""),
        modification_note=metadata.get("modification_note", ""),
    )
    return AttributionRecord(
        dataset_id=dataset_id,
        attribution_string=attribution,
        spdx_id=spdx_id,
        warnings=warnings,
    )

Step 4: Conflict detection and output generation

The final stage aggregates resolved attributions, detects incompatible license combinations, and emits publication-ready output in multiple formats. Composite geospatial products inherit the obligations of every constituent layer.

# output_generator.py
from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

# Pairs of SPDX IDs that are legally incompatible for combination/redistribution
INCOMPATIBLE_PAIRS: set[frozenset] = {
    frozenset({"GPL-3.0-only", "CC-BY-NC-4.0"}),
    frozenset({"GPL-2.0-only", "Apache-2.0"}),
    frozenset({"AGPL-3.0-only", "CC-BY-ND-4.0"}),
    frozenset({"ODbL-1.0", "CC-BY-ND-4.0"}),
}


@dataclass
class ConflictReport:
    has_conflicts: bool
    conflicting_pairs: list[tuple[str, str]]
    message: str


def detect_conflicts(spdx_ids: list[str]) -> ConflictReport:
    seen = set(spdx_ids)
    conflicts: list[tuple[str, str]] = []
    for pair in INCOMPATIBLE_PAIRS:
        a, b = tuple(pair)
        if a in seen and b in seen:
            conflicts.append((a, b))
    return ConflictReport(
        has_conflicts=bool(conflicts),
        conflicting_pairs=conflicts,
        message=(
            f"Incompatible licenses detected: {conflicts}" if conflicts else "No conflicts"
        ),
    )


def emit_web_attribution(records: list[dict]) -> str:
    """Return a JSON string suitable for a web map attribution control."""
    return json.dumps([r["attribution_string"] for r in records], ensure_ascii=False)


def emit_citation_cff(records: list[dict], output_path: Path) -> None:
    """Write a CITATION.cff manifest for data packages."""
    lines = ["cff-version: 1.2.0", "message: 'Please cite all data sources.'", "references:"]
    for r in records:
        lines.append(f"  - type: dataset")
        lines.append(f"    title: {r.get('dataset_name', r['dataset_id'])!r}")
        lines.append(f"    license: {r.get('spdx_id', 'LicenseRef-unknown')!r}")
        if r.get("source_url"):
            lines.append(f"    url: {r['source_url']!r}")
    output_path.write_text("\n".join(lines), encoding="utf-8")


def emit_markdown_block(records: list[dict]) -> str:
    """Return a markdown attribution block for static exports."""
    lines = ["## Data Sources\n"]
    for r in records:
        lines.append(f"- {r['attribution_string']}")
    return "\n".join(lines)

Validation & CI Integration

Run the attribution pipeline as a pre-publish gate. Failing the build on unresolved licenses shifts compliance left and prevents non-compliant datasets from reaching staging or production.

# Verify manifest schema
python -m pytest tests/test_manifest_schema.py -v

# Check all SPDX IDs resolved (no None values in manifest)
python -c "
import csv
with open('manifest.csv') as f:
    rows = list(csv.DictReader(f))
unresolved = [r['dataset_id'] for r in rows if not r.get('spdx_id')]
assert not unresolved, f'Unresolved licenses: {unresolved}'
print('All SPDX IDs resolved')
"

# Run conflict detector across full dataset set
python -c "
import json
from output_generator import detect_conflicts
ids = json.load(open('resolved_ids.json'))
report = detect_conflicts(ids)
assert not report.has_conflicts, report.message
print('No license conflicts')
"

For GitHub Actions integration, embed the validation step in your data PR workflow. The policy enforcement gates for data PRs pattern shows how to wire attribution checks into merge gating so that PRs adding new datasets cannot land without passing the full pipeline.

# .github/workflows/attribution-check.yml
name: Attribution compliance check
on:
  pull_request:
    paths:
      - "data/**"
      - "datasets/**"
jobs:
  attribution:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install geopandas pydantic lxml requests requests-cache spdx-tools Jinja2 rapidfuzz pyyaml chardet
      - run: python run_attribution_pipeline.py --root ./data --output ./attribution_report
      - run: python validate_attribution_report.py ./attribution_report

Spatial data schema linting in CI covers complementary schema-level gates that catch malformed metadata before the attribution pipeline ever runs — combining both layers provides defence-in-depth for data quality.

Derivative & Lineage Management

Transformations materially affect attribution obligations. Clipping, reprojecting, joining, or rasterising a licensed dataset typically creates a derivative work under the original license terms. The pipeline must record the transformation type and propagate attribution through the lineage chain.

Maintain a per-layer lineage dictionary structured as follows:

lineage_entry = {
    "dataset_id": "city_parcels_2024",
    "parent_ids": ["source_parcels_raw"],
    "transformations": ["reproject:EPSG:4326->EPSG:3857", "clip:study_area_bbox"],
    "attribution_string": "City Parcel Data (City GIS Division, 2024) CC-BY-4.0 [reprojected, clipped]",
    "spdx_id": "CC-BY-4.0",
    "derivative": True,
}

When a dataset carries a ShareAlike obligation (CC-BY-SA-4.0 or ODbL-1.0), the derivative flag must trigger a downstream license propagation check: the composite product’s license must be compatible with the ShareAlike requirement. Flag these cases for human review rather than automatically propagating — ShareAlike scope in multi-layer composites is a legal determination, not a purely mechanical one.

For STAC-based workflows, embed lineage as derived_from links in the STAC Item links array. This preserves machine-readable provenance that downstream consumers can traverse programmatically. Automating metadata extraction from PostGIS tables — covered in automating metadata extraction from PostGIS tables — can populate lineage fields directly from the database layer without manual annotation.

Pitfalls & Resolution Table

Pitfall	Root Cause	Resolution Strategy
SPDX resolution returns `None` for a known open license	License text is truncated, version-appended, or wrapped in agency boilerplate	Expand the normalisation strip list; add fuzzy pre-match against SPDX name aliases; log full raw text for manual inspection
Attribution string renders with `"Unknown Publisher"` across many datasets	ISO 19115 `responsibleParty` path differs between ISO 19115-1 and ISO 19115-3	Add both XPath variants in the ISO adapter; check `CI_Organisation` and `CI_Individual` nodes
ShareAlike propagation silently dropped in composite products	Conflict detector only checks pairwise combinations, not transitive obligations	Walk the full lineage DAG for each composite layer before emitting output; flag any ShareAlike ancestor
`requests_cache` returns stale SPDX data after a registry update	Default TTL too long or cache not invalidated on pipeline version bump	Use a 7-day TTL and include the pipeline version as a cache key component; provide a `--refresh-cache` CLI flag
CI gate passes but rendered attribution is malformed in PDFs	Jinja2 `StrictUndefined` only raises at render time; PDF template uses different field names	Run a smoke test rendering attributions against all output templates in CI, not just the web format
FGDC `<useconst>` field contains free-form legal prose, not a license name	FGDC CSDGM has no controlled vocabulary for `useconst`	Treat all FGDC `useconst` values as raw text; run through the full fingerprinting pipeline; set confidence floor to 70% for FGDC sources
Dual-licensed raster+vector datasets only resolve to one SPDX ID	Ingestion adapter stops at first match	Record all detected license statements as an array; run conflict detection across the full array before selecting the most restrictive
Attribution not updated after dataset re-publication	Pipeline only runs on new ingestion, not on modification timestamps	Include `last_modified` delta check in CI; re-run attribution resolution whenever `last_modified` exceeds the manifest’s `last_modified`

Geospatial Data Licensing & Compliance Fundamentals — parent section covering the full compliance domain
Commercial EULA Compliance Tracking — managing vendor-specific redistribution limits and audit trails for proprietary layers
Creative Commons Licensing for GIS Datasets — BY, SA, NC, and ND clause obligations in spatial data contexts
Building a License Compliance Matrix for Municipal Data — city, county, and state-level attribution format overrides
Policy Enforcement Gates for Data PRs — wiring attribution checks into merge gating workflows
Geospatial Risk Scoring Frameworks — quantifying compliance exposure across multi-source spatial products

# Automated Attribution Mapping Workflows

# Prerequisites

# Concept & Spec Reference

# Metadata formats

# SPDX identifier resolution

# Attribution template fields

# Implementation Walkthrough

# Step 1: Metadata inventory and ingestion

# Step 2: License fingerprinting and SPDX resolution

# Step 3: Attribution template assembly

# Step 4: Conflict detection and output generation

# Validation & CI Integration

# Derivative & Lineage Management

# Pitfalls & Resolution Table

# Related