How to track CC-BY-SA attribution in shapefiles

To track CC-BY-SA attribution in shapefiles, embed license metadata across three deterministic layers: truncated attribute columns in the .dbf, an ISO 19139-compliant .shp.xml sidecar, and a version-controlled manifest.json mapping SHA-256 hashes to attribution strings. Because the legacy ESRI format lacks native licensing fields, this multi-layer approach ensures compliance survives format conversion, GDAL/OGR transformations, and proprietary GIS parsers.

Why Shapefiles Complicate License Tracking

The ESRI shapefile specification, finalized in 1998, only supports geometry (.shp), spatial indexing (.shx), attributes (.dbf), and projection (.prj). It contains zero native fields for provenance, licensing, or usage restrictions. When publishing under Creative Commons Licensing for GIS Datasets, you are legally required to:

  • Provide a direct link to the CC BY-SA 4.0 license
  • Credit the original creator(s)
  • Note any modifications made to the dataset
  • License derivatives under identical terms

Without automated tracking, downstream workflows routinely strip attribution during format conversion, database ingestion, or web publishing. Embedding licensing metadata directly into the shapefile ecosystem requires a deterministic approach aligned with broader Geospatial Data Licensing & Compliance Fundamentals.

Automated Python Implementation

The script below reads a shapefile, injects CC BY-SA 4.0 tracking fields into the .dbf (respecting the 10-character column name limit and 254-character string limit), generates a minimal ISO 19139-compliant .shp.xml sidecar, and outputs a machine-readable attribution manifest. It relies on geopandas for vector I/O and lxml for standards-compliant XML generation.

import geopandas as gpd
import hashlib
import json
import os
from pathlib import Path
from lxml import etree
from datetime import datetime, timezone

def track_cc_by_sa_attribution(shp_path: str, attribution: str, creator: str, version: str = "1.0.0"):
    """
    Injects CC-BY-SA 4.0 tracking into a shapefile's .dbf, generates .shp.xml metadata,
    and creates a manifest.json for compliance verification.
    """
    base = Path(shp_path).stem
    dir_path = Path(shp_path).parent

    # 1. Load & update .dbf (respecting DBF limits)
    gdf = gpd.read_file(shp_path)
    gdf["lic_type"] = "CC BY-SA 4.0"
    gdf["lic_url"] = "https://creativecommons.org/licenses/by-sa/4.0/"
    gdf["attr_txt"] = attribution[:254]  # .dbf string limit
    gdf.to_file(shp_path, driver="ESRI Shapefile")

    # 2. Generate ISO 19139 .shp.xml sidecar
    ns = {
        "gmd": "http://www.isotc211.org/2005/gmd",
        "gco": "http://www.isotc211.org/2005/gco"
    }
    root = etree.Element(f"{{{ns['gmd']}}}MD_Metadata")

    # File identifier
    file_id = etree.SubElement(root, f"{{{ns['gmd']}}}fileIdentifier")
    etree.SubElement(file_id, f"{{{ns['gco']}}}CharacterString").text = f"{base}.shp"

    # Metadata language
    lang = etree.SubElement(root, f"{{{ns['gmd']}}}language")
    etree.SubElement(lang, f"{{{ns['gco']}}}CharacterString").text = "eng"

    # Licensing / Legal Constraints
    constraints = etree.SubElement(root, f"{{{ns['gmd']}}}resourceConstraints")
    legal = etree.SubElement(constraints, f"{{{ns['gmd']}}}MD_LegalConstraints")

    use_lim = etree.SubElement(legal, f"{{{ns['gmd']}}}useLimitation")
    etree.SubElement(use_lim, f"{{{ns['gco']}}}CharacterString").text = f"Licensed under CC BY-SA 4.0. Attribution: {attribution}"

    other_con = etree.SubElement(legal, f"{{{ns['gmd']}}}otherConstraints")
    etree.SubElement(other_con, f"{{{ns['gco']}}}CharacterString").text = "https://creativecommons.org/licenses/by-sa/4.0/"

    # Write XML
    xml_path = dir_path / f"{base}.shp.xml"
    etree.ElementTree(root).write(xml_path, pretty_print=True, xml_declaration=True, encoding="UTF-8")

    # 3. Generate SHA-256 manifest
    manifest_files = {}
    for ext in [".shp", ".shx", ".dbf", ".prj", ".shp.xml"]:
        file_path = dir_path / f"{base}{ext}"
        if file_path.exists():
            sha256 = hashlib.sha256(file_path.read_bytes()).hexdigest()
            manifest_files[file_path.name] = sha256

    manifest_path = dir_path / "manifest.json"
    manifest_data = {
        "version": version,
        "license": "CC BY-SA 4.0",
        "creator": creator,
        "attribution": attribution,
        "generated_utc": datetime.now(timezone.utc).isoformat(),
        "files": manifest_files
    }
    manifest_path.write_text(json.dumps(manifest_data, indent=2))

    print(f"✅ Attribution tracking complete. Files updated: {', '.join(manifest_files.keys())}")

Validation & Pipeline Integration

After generation, verify that your metadata survives standard GIS transformations. GDAL/OGR preserves .dbf attribute columns but may ignore .shp.xml unless explicitly passed to downstream tools. Use ogrinfo -al dataset.shp to confirm lic_type, lic_url, and attr_txt persist after reprojection or clipping. For automated validation, integrate the manifest.json into your CI/CD pipeline:

  1. Compute SHA-256 hashes of distributed shapefile components
  2. Compare against the manifest’s files dictionary
  3. Fail the build if hashes mismatch or if lic_url is absent from the .dbf schema

Refer to the GDAL/OGR Shapefile Driver Documentation for driver-specific behavior notes regarding attribute truncation and sidecar file handling. When distributing to non-technical stakeholders, pair the shapefile archive with a plaintext LICENSE.txt containing the full Creative Commons Attribution-ShareAlike 4.0 International deed to satisfy human-readable compliance checks.

Long-Term Compliance Best Practices

Shapefiles remain ubiquitous but are structurally unsuited for modern metadata requirements. To maintain CC BY-SA compliance across multi-year projects:

  • Version control the manifest: Treat manifest.json as the single source of truth for attribution lineage. Commit it alongside your data releases.
  • Enforce field naming conventions: Use consistent prefixes (lic_, src_, prov_) to prevent accidental overwrites during spatial joins.
  • Migrate to GeoPackage for new projects: The ISO 19115-1 Geographic Information Metadata standard is natively supported in GeoPackage, allowing license URIs and attribution strings to live inside a single SQLite container without sidecar files.
  • Automate pre-publish checks: Run a lightweight Python validator that scans .dbf schemas for lic_url before pushing to data portals or S3 buckets.

By combining attribute injection, ISO-compliant sidecars, and cryptographic manifests, you create a resilient attribution chain that survives the fragmented shapefile ecosystem while meeting open-data licensing obligations.