How to track CC-BY-SA attribution in shapefiles
To track CC-BY-SA attribution in shapefiles, embed license metadata across three deterministic layers: truncated attribute columns in the .dbf, an ISO 19139-compliant .shp.xml sidecar, and a version-controlled manifest.json mapping SHA-256 hashes to attribution strings. Because the legacy ESRI format lacks native licensing fields, this multi-layer approach ensures compliance survives format conversion, GDAL/OGR transformations, and proprietary GIS parsers.
Why Shapefiles Complicate License Tracking
The ESRI shapefile specification, finalized in 1998, only supports geometry (.shp), spatial indexing (.shx), attributes (.dbf), and projection (.prj). It contains zero native fields for provenance, licensing, or usage restrictions. When publishing under Creative Commons Licensing for GIS Datasets, you are legally required to:
- Provide a direct link to the CC BY-SA 4.0 license
- Credit the original creator(s)
- Note any modifications made to the dataset
- License derivatives under identical terms
Without automated tracking, downstream workflows routinely strip attribution during format conversion, database ingestion, or web publishing. Embedding licensing metadata directly into the shapefile ecosystem requires a deterministic approach aligned with broader Geospatial Data Licensing & Compliance Fundamentals.
Automated Python Implementation
The script below reads a shapefile, injects CC BY-SA 4.0 tracking fields into the .dbf (respecting the 10-character column name limit and 254-character string limit), generates a minimal ISO 19139-compliant .shp.xml sidecar, and outputs a machine-readable attribution manifest. It relies on geopandas for vector I/O and lxml for standards-compliant XML generation.
import geopandas as gpd
import hashlib
import json
import os
from pathlib import Path
from lxml import etree
from datetime import datetime, timezone
def track_cc_by_sa_attribution(shp_path: str, attribution: str, creator: str, version: str = "1.0.0"):
"""
Injects CC-BY-SA 4.0 tracking into a shapefile's .dbf, generates .shp.xml metadata,
and creates a manifest.json for compliance verification.
"""
base = Path(shp_path).stem
dir_path = Path(shp_path).parent
# 1. Load & update .dbf (respecting DBF limits)
gdf = gpd.read_file(shp_path)
gdf["lic_type"] = "CC BY-SA 4.0"
gdf["lic_url"] = "https://creativecommons.org/licenses/by-sa/4.0/"
gdf["attr_txt"] = attribution[:254] # .dbf string limit
gdf.to_file(shp_path, driver="ESRI Shapefile")
# 2. Generate ISO 19139 .shp.xml sidecar
ns = {
"gmd": "http://www.isotc211.org/2005/gmd",
"gco": "http://www.isotc211.org/2005/gco"
}
root = etree.Element(f"{{{ns['gmd']}}}MD_Metadata")
# File identifier
file_id = etree.SubElement(root, f"{{{ns['gmd']}}}fileIdentifier")
etree.SubElement(file_id, f"{{{ns['gco']}}}CharacterString").text = f"{base}.shp"
# Metadata language
lang = etree.SubElement(root, f"{{{ns['gmd']}}}language")
etree.SubElement(lang, f"{{{ns['gco']}}}CharacterString").text = "eng"
# Licensing / Legal Constraints
constraints = etree.SubElement(root, f"{{{ns['gmd']}}}resourceConstraints")
legal = etree.SubElement(constraints, f"{{{ns['gmd']}}}MD_LegalConstraints")
use_lim = etree.SubElement(legal, f"{{{ns['gmd']}}}useLimitation")
etree.SubElement(use_lim, f"{{{ns['gco']}}}CharacterString").text = f"Licensed under CC BY-SA 4.0. Attribution: {attribution}"
other_con = etree.SubElement(legal, f"{{{ns['gmd']}}}otherConstraints")
etree.SubElement(other_con, f"{{{ns['gco']}}}CharacterString").text = "https://creativecommons.org/licenses/by-sa/4.0/"
# Write XML
xml_path = dir_path / f"{base}.shp.xml"
etree.ElementTree(root).write(xml_path, pretty_print=True, xml_declaration=True, encoding="UTF-8")
# 3. Generate SHA-256 manifest
manifest_files = {}
for ext in [".shp", ".shx", ".dbf", ".prj", ".shp.xml"]:
file_path = dir_path / f"{base}{ext}"
if file_path.exists():
sha256 = hashlib.sha256(file_path.read_bytes()).hexdigest()
manifest_files[file_path.name] = sha256
manifest_path = dir_path / "manifest.json"
manifest_data = {
"version": version,
"license": "CC BY-SA 4.0",
"creator": creator,
"attribution": attribution,
"generated_utc": datetime.now(timezone.utc).isoformat(),
"files": manifest_files
}
manifest_path.write_text(json.dumps(manifest_data, indent=2))
print(f"✅ Attribution tracking complete. Files updated: {', '.join(manifest_files.keys())}")
Validation & Pipeline Integration
After generation, verify that your metadata survives standard GIS transformations. GDAL/OGR preserves .dbf attribute columns but may ignore .shp.xml unless explicitly passed to downstream tools. Use ogrinfo -al dataset.shp to confirm lic_type, lic_url, and attr_txt persist after reprojection or clipping. For automated validation, integrate the manifest.json into your CI/CD pipeline:
- Compute SHA-256 hashes of distributed shapefile components
- Compare against the manifest’s
filesdictionary - Fail the build if hashes mismatch or if
lic_urlis absent from the.dbfschema
Refer to the GDAL/OGR Shapefile Driver Documentation for driver-specific behavior notes regarding attribute truncation and sidecar file handling. When distributing to non-technical stakeholders, pair the shapefile archive with a plaintext LICENSE.txt containing the full Creative Commons Attribution-ShareAlike 4.0 International deed to satisfy human-readable compliance checks.
Long-Term Compliance Best Practices
Shapefiles remain ubiquitous but are structurally unsuited for modern metadata requirements. To maintain CC BY-SA compliance across multi-year projects:
- Version control the manifest: Treat
manifest.jsonas the single source of truth for attribution lineage. Commit it alongside your data releases. - Enforce field naming conventions: Use consistent prefixes (
lic_,src_,prov_) to prevent accidental overwrites during spatial joins. - Migrate to GeoPackage for new projects: The ISO 19115-1 Geographic Information Metadata standard is natively supported in GeoPackage, allowing license URIs and attribution strings to live inside a single SQLite container without sidecar files.
- Automate pre-publish checks: Run a lightweight Python validator that scans
.dbfschemas forlic_urlbefore pushing to data portals or S3 buckets.
By combining attribute injection, ISO-compliant sidecars, and cryptographic manifests, you create a resilient attribution chain that survives the fragmented shapefile ecosystem while meeting open-data licensing obligations.