How do I prevent deleting ISO 19115 metadata that downstream derivatives still reference?

Before executing any delete action, query the manifest for artifacts whose parentIdentifier or source_id field references the target record. If dependents exist, block the deletion and escalate the parent's retention tier until all dependencies are resolved or updated to point at a successor record.

What happens when a GDPR erasure request conflicts with a Tier 1 immutable record?

Add a gdpr_erasure_pending flag to the manifest record. Configure the retention engine to treat this flag as an override of the immutable: true rule and route the artifact to a legal review queue rather than direct deletion. Legal counsel must confirm the erasure before the engine removes it.

Should created_at or S3 LastModified be used to evaluate artifact age?

Always use created_at stored in the artifact's custom metadata. S3 LastModified resets whenever an object is copied, compressed, or tagged, which would extend the apparent age of old artifacts and prevent legitimate Tier 3 deletions from ever triggering.

Metadata Artifact Retention Strategies for Geospatial Pipelines

Geospatial pipelines generate substantial metadata overhead: ISO 19115/19139 XML records, FGDC CSDGM files, STAC JSON catalogs, and custom YAML manifests. Without disciplined retention controls these artifacts accumulate, inflate storage costs, obscure audit trails, and introduce compliance risks when stale records contradict live datasets. Aligning retention policy with CI/CD Validation & Policy Enforcement for Spatial Data transforms housekeeping from a manual burden into an enforceable pipeline gate—one that runs on a schedule, posts audit logs, and blocks merges when incoming metadata violates lifecycle rules.

Prerequisites

Python 3.10+ with pyyaml>=6.0, boto3>=1.34, lxml>=5.2, pydantic>=2.6, and hashlib (stdlib).
Familiarity with ISO 19115-1:2014 mandatory elements or your agency’s FGDC/DCAT-AP profile. Review ISO 19115 metadata template generation before applying retention rules to ensure incoming records are structurally sound.
Object storage backend (AWS S3, GCP Cloud Storage, or Azure Blob) with lifecycle management APIs enabled, or a version-controlled artifact registry.
CI/CD runner (GitHub Actions, GitLab CI, or Jenkins) with read/write access to metadata directories and permission to post status checks to pull requests.
Environment variables: METADATA_BUCKET, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY (or equivalent cloud credentials).
A baseline spatial data schema linting in CI step in the pipeline—retention rules should never archive malformed records.

Concept & Spec Reference

Metadata retention in geospatial pipelines is not a single standard but the intersection of three separate regulatory and technical layers:

Layer	Governing framework	Retention implication
Spatial data quality & lineage	ISO 19115-1:2014 §6.5 (lineage), §6.4 (quality)	Lineage records must survive dataset versions; deletion requires explicit deprecation
Federal data management	FGDC CSDGM §2.2 (data quality), §7 (metadata reference)	Agencies may not purge metadata if the underlying dataset is still cited in federal catalogs
Spatial data catalog	OGC API Records, STAC spec §8.1	`links` arrays referencing archived artifacts must be updated or tombstoned
NSDI alignment	OMB Circular A-16, Executive Order 12906	Authoritative spatial data published through NSDI nodes carries indefinite retention obligations
GDPR / location data	GDPR Art. 17 (right to erasure)	Metadata about datasets containing personal location data inherits the dataset’s erasure schedule

Tiered classification maps these obligations to an operational model:

Tier 1 — Permanent/Archival: Officially published spatial datasets, regulatory submissions, baseline reference catalogs. Retain indefinitely with immutable versioning and cryptographic checksums. All lineage, quality, and contact elements from ISO 19115 §6 are required.
Tier 2 — Active/Working: Draft metadata, pull-request staging files, experimental schema extensions. Retain 90–180 days or until merged to a production branch.
Tier 3 — Ephemeral/Transient: CI-generated validation reports, temporary transformation outputs, failed build artifacts. Retain 7–30 days before automated deletion, routed first through a quarantine prefix.

Tag every artifact at ingestion or commit time with retention_tier, created_at (ISO 8601 UTC), spatial_reference_id, and optionally legal_hold: true. A lightweight JSON manifest that tracks artifact state across distributed storage prefixes makes downstream automation deterministic without per-artifact database queries.

Implementation Walkthrough

Step 1 — Define retention policy as code

Encode rules in a YAML configuration file committed alongside pipeline scripts. The dry_run flag enables safe testing before any destructive operation runs in production.

# metadata_retention_policy.yaml
version: "1.0"
defaults:
  checksum_algorithm: sha256
  dry_run: true
  log_level: INFO
  quarantine_days: 7

tiers:
  tier_1_permanent:
    max_age_days: null
    action: archive
    immutable: true
    requires_checksum: true
    legal_hold_override: true
  tier_2_active:
    max_age_days: 180
    action: compress
    immutable: false
    requires_checksum: true
    legal_hold_override: false
  tier_3_ephemeral:
    max_age_days: 30
    action: delete
    immutable: false
    requires_checksum: false
    legal_hold_override: false

quarantine_days introduces a safety buffer for Tier 3 deletions, giving operators a recovery window if an artifact is accidentally down-classified.

Step 2 — Build a deterministic retention engine

The module below reads the policy, evaluates artifact age, verifies checksums, and executes tier-specific actions. It defaults to dry-run, uses UTC timestamps throughout, routes deletions through a quarantine prefix, and logs every decision for the audit ledger.

import hashlib
import json
import logging
import yaml
from datetime import datetime, timezone
from pathlib import Path
from typing import Any

import boto3
from botocore.exceptions import ClientError

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
logger = logging.getLogger(__name__)


class MetadataRetentionEngine:
    def __init__(self, policy_path: str, bucket_name: str, region: str = "us-east-1") -> None:
        with open(policy_path) as f:
            self.policy: dict[str, Any] = yaml.safe_load(f)
        self.bucket = bucket_name
        self.s3 = boto3.client("s3", region_name=region)
        self.dry_run: bool = self.policy["defaults"]["dry_run"]
        self.quarantine_prefix = "quarantine/"

    def compute_checksum(self, file_path: Path) -> str:
        sha256 = hashlib.sha256()
        with open(file_path, "rb") as f:
            for chunk in iter(lambda: f.read(8192), b""):
                sha256.update(chunk)
        return sha256.hexdigest()

    def evaluate_artifact(self, key: str, meta: dict[str, Any]) -> str:
        """Return the action string for this artifact given current policy."""
        tier = meta.get("retention_tier", "tier_3_ephemeral")
        created = datetime.fromisoformat(meta["created_at"])
        if created.tzinfo is None:
            created = created.replace(tzinfo=timezone.utc)
        age_days = (datetime.now(timezone.utc) - created).days
        rule: dict[str, Any] = self.policy["tiers"].get(
            tier, self.policy["tiers"]["tier_3_ephemeral"]
        )

        if rule["max_age_days"] is None or age_days <= rule["max_age_days"]:
            return "retain"
        if meta.get("legal_hold", False) and rule["legal_hold_override"]:
            logger.info("Legal hold active. Retaining: %s", key)
            return "retain"
        return rule["action"]

    def execute_action(self, key: str, action: str) -> None:
        if self.dry_run:
            logger.info("[DRY RUN] Would %s artifact: %s", action, key)
            return

        try:
            if action == "delete":
                quarantine_key = f"{self.quarantine_prefix}{key}"
                self.s3.copy_object(
                    CopySource={"Bucket": self.bucket, "Key": key},
                    Bucket=self.bucket,
                    Key=quarantine_key,
                )
                self.s3.delete_object(Bucket=self.bucket, Key=key)
                logger.info("Quarantined (pending deletion): %s", key)
            elif action == "archive":
                archive_key = f"archive/{key}"
                self.s3.copy_object(
                    CopySource={"Bucket": self.bucket, "Key": key},
                    Bucket=self.bucket,
                    Key=archive_key,
                    MetadataDirective="REPLACE",
                    Metadata={
                        "archived_at": datetime.now(timezone.utc).isoformat(),
                        "status": "immutable",
                    },
                )
                self.s3.delete_object(Bucket=self.bucket, Key=key)
                logger.info("Archived to immutable storage: %s", key)
            elif action == "compress":
                # Flag for compression; actual gzip/zstd step handled separately
                logger.info("Flagged for compression: %s", key)
        except ClientError as exc:
            logger.error("Failed to %s %s: %s", action, key, exc)

    def run(self, manifest_path: Path) -> None:
        with open(manifest_path) as f:
            catalog: dict[str, Any] = json.load(f)

        audit: list[dict[str, Any]] = []
        for item in catalog.get("items", []):
            key: str = item["id"]
            action = self.evaluate_artifact(key, item)
            self.execute_action(key, action)
            audit.append(
                {
                    "key": key,
                    "tier": item.get("retention_tier"),
                    "action": action,
                    "evaluated_at": datetime.now(timezone.utc).isoformat(),
                }
            )

        audit_path = manifest_path.parent / "retention_audit.jsonl"
        with open(audit_path, "a") as f:
            for entry in audit:
                f.write(json.dumps(entry) + "\n")
        logger.info("Audit log appended: %d entries → %s", len(audit), audit_path)

The engine is idempotent: re-running against the same catalog state produces the same outcome. Checksums verify that archived artifacts have not been silently modified between the classification step and the archive write.

Step 3 — Schedule retention as a CI/CD workflow

The following GitHub Actions workflow runs every Monday at 02:00 UTC and can also be triggered manually. It keeps a seven-day rolling dry-run window in PRs and applies live deletes only on the scheduled branch.

# .github/workflows/metadata-retention.yml
name: Metadata Retention & Cleanup
on:
  schedule:
    - cron: "0 2 * * 1"
  workflow_dispatch:

jobs:
  enforce-retention:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install pyyaml boto3

      - name: Run retention engine
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          BUCKET_NAME: ${{ vars.METADATA_BUCKET }}
        run: |
          python retention_engine.py \
            --policy metadata_retention_policy.yaml \
            --catalog catalog.json \
            --bucket "$BUCKET_NAME"

      - name: Upload audit log
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: retention-audit-${{ github.run_id }}
          path: retention_audit.jsonl
          retention-days: 90

Integrating automated broken link and reference detection in the same workflow prevents purged artifacts from leaving dangling pointers in downstream catalogs, spatial indexes, or web mapping service configurations.

Validation & CI Integration

Before enabling live deletions, verify the policy engine’s output against a controlled test catalog:

# 1. Run schema validation on the policy file itself
python -c "
import yaml, jsonschema, json, pathlib
policy = yaml.safe_load(pathlib.Path('metadata_retention_policy.yaml').read_text())
assert 'tiers' in policy and 'defaults' in policy, 'Missing top-level keys'
for name, rule in policy['tiers'].items():
    assert 'action' in rule, f'Missing action for tier {name}'
print('Policy schema: OK')
"

# 2. Validate the artifact manifest against a minimal JSON Schema
python -c "
import json, jsonschema
schema = {
    'type': 'object',
    'required': ['items'],
    'properties': {
        'items': {
            'type': 'array',
            'items': {
                'type': 'object',
                'required': ['id', 'retention_tier', 'created_at'],
            },
        }
    },
}
catalog = json.load(open('catalog.json'))
jsonschema.validate(catalog, schema)
print('Catalog schema: OK')
"

# 3. Dry-run the engine against the catalog and inspect the audit log
python retention_engine.py \
    --policy metadata_retention_policy.yaml \
    --catalog catalog.json \
    --bucket "$METADATA_BUCKET"

# 4. Inspect structured audit output
python -c "
import json
with open('retention_audit.jsonl') as f:
    for line in f:
        entry = json.loads(line)
        print(entry['action'].ljust(12), entry['key'])
"

Add the dry-run step as a required status check on pull requests that modify catalog.json or metadata_retention_policy.yaml. Pair it with policy enforcement gates for data PRs to block merges if the simulated run would delete Tier 1 records that still have active lineage references.

Derivative & Lineage Management

Transforming a dataset—reprojecting to a new CRS, clipping to a study area, joining attribute tables, or rasterizing vector boundaries—produces derivative metadata that inherits the source record’s retention obligations with modifications:

Coordinate Reference System changes. When a dataset is reprojected (e.g., EPSG:4326 → EPSG:3857), ISO 19115 requires a new referenceSystemInfo block and updated transformationDimensionResolution if present. The source CRS metadata record remains Tier 1 for as long as any derivative references it. Deleting the source CRS record before all downstream derivatives are updated orphans the transformation chain.

Clip and subset operations. Spatial subsetting must propagate the source lineage LI_Source element to the derived record. If the FGDC-to-ISO 19115 conversion pipeline generated the source record, verify that the subset inherits process_step metadata referencing the clipping operation.

Rasterization. Converting vector geometry to raster representation introduces resolution and resampling metadata under ISO 19115 §6.5.4 (DQ_GriddedDataPositionalAccuracy). Derived raster records should carry a parentIdentifier pointing to the vector source so the retention engine can evaluate the dependency graph before purging the source.

Practical rule: before downgrading or deleting a Tier 1 record, query the manifest for any parentIdentifier or source_id references to that record. If dependents exist, either update their lineage to a successor record or escalate the retention tier of the parent until the dependency is resolved.

Pitfalls & Resolution Table

Pitfall	Root Cause	Resolution Strategy
Archiving malformed ISO 19115 XML	Retention runs before schema linting; structural errors propagate into immutable storage	Add a `lxml`-based XSD validation step ahead of the `archive` action; quarantine rather than archive invalid records
Orphaned `LI_Lineage` references after deletion	Tier 3 purge removes a record that Tier 1 derivatives still cite in `source` elements	Run a reference graph traversal before any `delete` action; block deletion if the artifact is cited as a lineage source
UTC/local timezone mismatch in `created_at`	Metadata generated by tools that stamp local time without a `+HH:MM` suffix	Normalise all `created_at` fields to UTC at ingestion using `datetime.fromisoformat(...).astimezone(timezone.utc)`
Duplicate `archive/` copies on re-run	`copy_object` succeeds even when the destination key already exists	Check for the archive key’s existence with `head_object` before copying; skip and log if already archived
Legal hold field silently ignored	`legal_hold` missing from manifest; engine defaults to `False`	Enforce `legal_hold` as a required field in the manifest JSON Schema; fail validation if absent
Quarantine prefix not excluded from lifecycle scans	Quarantine prefix is re-evaluated on the next run, triggering double-deletion errors	Filter out the `quarantine/` and `archive/` prefixes at the start of the evaluation loop
Tier 2 compression widens the retention window unintentionally	Compressed files reset `LastModified` in S3, tricking age calculations	Preserve original `created_at` in custom S3 object metadata; always evaluate age from `created_at`, never from `LastModified`
GDPR erasure request conflicts with Tier 1 immutability	Dataset contains personal location data classified as permanent	Add a `gdpr_erasure_pending` flag that overrides `immutable: true`; route to a legal review queue rather than direct deletion

CI/CD Validation & Policy Enforcement for Spatial Data — parent section covering the full automation framework
Spatial Data Schema Linting in CI — prerequisite gate that validates metadata structure before retention rules apply
Automated Broken Link and Reference Detection — companion workflow to purge dangling catalog pointers after retention runs
Policy Enforcement Gates for Data PRs — PR-level checks that can block merges when retention policy changes are detected
ISO 19115 Metadata Template Generation — generates the structured records that the retention engine classifies and archives
FGDC-to-ISO 19115 Conversion Pipelines — lineage context for federal datasets whose retention obligations span both standards

# Metadata Artifact Retention Strategies for Geospatial Pipelines

# Concept & Spec Reference

# Implementation Walkthrough

# Step 1 — Define retention policy as code

# Step 2 — Build a deterministic retention engine

# Step 3 — Schedule retention as a CI/CD workflow

# Validation & CI Integration

# Derivative & Lineage Management

# Pitfalls & Resolution Table

# Related