Generating ISO 19115 Metadata from GeoTIFF Headers: A Production-Ready Python Workflow

Generating ISO 19115 metadata from GeoTIFF headers requires parsing embedded GDAL metadata tags, mapping deterministic spatial attributes to the ISO 19115-1:2014 schema, and serializing the result into validated XML. The most reliable production workflow uses rasterio to extract projection, bounding box, and resolution tags, then maps them to ISO 19115 elements via a Python XML builder. Because GeoTIFF headers inherently lack administrative, licensing, and provenance fields, automation pipelines must supplement extracted values with agency defaults and validate against the OGC schema before publication.

Header-to-Schema Mapping Logic

GeoTIFF files store geospatial context in standard TIFF tags (ModelPixelScale, ModelTiepoint, GeoKeyDirectoryTag) and GDAL-specific metadata domains (IMAGE_STRUCTURE, TIFFTAGS, AREA_OR_POINT). The ISO 19115 Metadata Template Generation framework expects a strict hierarchy: MD_MetadataidentificationInfoextentgeographicBoundingBox, alongside referenceSystemInfo, dataQualityInfo, and distributionInfo. Direct 1:1 mapping is structurally impossible because raster headers do not encode contact information, licensing terms, or lineage documentation.

Automation bridges this gap by extracting deterministic spatial attributes, injecting organizational defaults, and generating a compliant XML skeleton. The mapping strategy prioritizes fields guaranteed to exist in valid GeoTIFFs:

GeoTIFF Attribute ISO 19115 Element Notes
src.bounds EX_GeographicBoundingBox Must be transformed to EPSG:4326 for strict compliance
src.crs.to_epsg() MD_ReferenceSystem / RS_Identifier Fallback to WKT string if EPSG is undefined
src.res (pixel size) abstract or dataQualityInfo Used to document spatial resolution
src.count + src.driver MD_SpatialRepresentationTypeCode Maps to grid or image representation
src.tags()['AREA_OR_POINT'] spatialRepresentationType Determines whether values represent cell centers or corners

For enterprise deployments, pairing this extraction logic with Automated Metadata Generation & Schema Mapping pipelines ensures consistent vocabulary control, mandatory field population, and repeatable output across heterogeneous raster datasets.

Complete Python Implementation

The following script extracts header values, constructs ISO 19115-compliant XML, and writes it to disk. It uses lxml for namespace-aware serialization and includes explicit error handling for malformed or header-stripped files.

import os
import rasterio
from rasterio.errors import RasterioError
from lxml import etree
from lxml.builder import ElementMaker
from datetime import datetime

# ISO 19115-1:2014 Namespace definitions
GMD = "http://www.isotc211.org/2005/gmd"
GCO = "http://www.isotc211.org/2005/gco"
XSI = "http://www.w3.org/2001/XMLSchema-instance"

NSMAP = {"gmd": GMD, "gco": GCO, "xsi": XSI}

# Element factories for cleaner XML construction
G = ElementMaker(namespace=GMD, nsmap=NSMAP)
C = ElementMaker(namespace=GCO, nsmap=NSMAP)

def generate_iso19115_from_geotiff(tiff_path: str, output_xml: str, org_name: str = "Default Agency") -> None:
    if not os.path.exists(tiff_path):
        raise FileNotFoundError(f"GeoTIFF not found: {tiff_path}")

    try:
        with rasterio.open(tiff_path) as src:
            bounds = src.bounds
            crs = src.crs
            epsg = crs.to_epsg() or 0
            res = src.res
            band_count = src.count
            area_or_point = src.tags().get("AREA_OR_POINT", "Area")
    except RasterioError as e:
        raise RuntimeError(f"Failed to parse GeoTIFF headers: {e}")

    # Build ISO 19115 XML structure
    metadata = G.MD_Metadata(
        G.fileIdentifier(C.CharacterString(os.path.basename(tiff_path))),
        G.language(C.CharacterString("eng")),
        G.dateStamp(C.Date(datetime.now().strftime("%Y-%m-%d"))),
        G.contact(G.CI_ResponsibleParty(
            G.organisationName(C.CharacterString(org_name)),
            G.role(G.CI_RoleCode(codeListValue="pointOfContact"))
        )),
        G.identificationInfo(G.MD_DataIdentification(
            G.citation(G.CI_Citation(
                G.title(C.CharacterString(os.path.splitext(os.path.basename(tiff_path))[0])),
                G.date(G.CI_Date(
                    G.date(C.Date(datetime.now().strftime("%Y-%m-%d"))),
                    G.dateType(G.CI_DateTypeCode(codeListValue="publication"))
                ))
            )),
            G.abstract(C.CharacterString(
                f"Raster dataset extracted from {os.path.basename(tiff_path)}. "
                f"Resolution: {res[0]} x {res[1]} units. Bands: {band_count}. "
                f"Area/Point representation: {area_or_point}."
            )),
            G.extent(G.EX_Extent(
                G.geographicElement(G.EX_GeographicBoundingBox(
                    G.westBoundLongitude(C.Decimal(str(bounds.left))),
                    G.eastBoundLongitude(C.Decimal(str(bounds.right))),
                    G.southBoundLatitude(C.Decimal(str(bounds.bottom))),
                    G.northBoundLatitude(C.Decimal(str(bounds.top)))
                ))
            ))
        )),
        G.referenceSystemInfo(G.MD_ReferenceSystem(
            G.referenceSystemIdentifier(G.RS_Identifier(
                G.code(C.CharacterString(str(epsg) if epsg else crs.to_wkt())),
                G.codeSpace(C.CharacterString("EPSG" if epsg else "OGC"))
            ))
        )),
        G.distributionInfo(G.MD_Distribution(
            G.distributionFormat(G.MD_Format(
                G.name(C.CharacterString("GeoTIFF")),
                G.version(C.CharacterString("1.0"))
            ))
        ))
    )

    # Serialize with proper indentation and XML declaration
    tree = etree.ElementTree(metadata)
    etree.indent(tree, space="  ")
    tree.write(output_xml, xml_declaration=True, encoding="UTF-8", pretty_print=True)
    print(f"ISO 19115 metadata written to: {output_xml}")

# Example usage
if __name__ == "__main__":
    generate_iso19115_from_geotiff("sample.tif", "sample_metadata.xml", "USGS Earth Explorer")

Handling Missing Administrative & Provenance Data

GeoTIFF headers are optimized for spatial referencing, not catalog compliance. They omit mandatory ISO 19115 fields such as contact, resourceConstraints, lineage, and useLimitation. Production systems must inject these programmatically using agency-controlled templates or external configuration files (YAML/JSON). Common supplementation patterns include:

  1. Static defaults: Hardcode organization name, contact email, and default license (e.g., CC-BY 4.0) in the pipeline configuration.
  2. Sidecar files: Parse adjacent .json or .xml files containing human-curated metadata and merge them into the generated tree before serialization.
  3. Database lookup: Query a spatial catalog or asset management system using the filename or checksum to retrieve authoritative lineage and processing history.

When injecting defaults, always mark auto-generated sections with gco:nilReason="unknown" or gco:nilReason="missing" to maintain transparency during audits. The official ISO 19115-1:2014 specification explicitly permits nil values when data is unavailable, provided the reason is documented.

Production Validation & Pipeline Integration

Raw XML generation is only half the workflow. Before publishing to a catalog or data portal, validate the output against the official OGC XSD. You can use lxml.etree.XMLSchema or the xmlschema Python package to enforce strict compliance:

from lxml import etree

xsd_path = "iso19139.xsd"  # Download from OGC or ISO repository
schema = etree.XMLSchema(etree.parse(xsd_path))
doc = etree.parse("sample_metadata.xml")

if schema.validate(doc):
    print("Validation passed.")
else:
    print("Validation failed:", schema.error_log)

Key production considerations:

  • CRS Transformation: ISO 19115 geographic bounding boxes must be in WGS84 (EPSG:4326). If your source raster uses a projected CRS, use pyproj to transform bounds before writing to EX_GeographicBoundingBox.
  • Namespace Prefix Consistency: Catalog harvesters (GeoNetwork, CKAN, ArcGIS) expect strict gmd:, gco:, and gml: prefixes. Always declare nsmap at the root element.
  • Batch Processing: Wrap the generator in a multiprocessing pool or async queue. Raster I/O is disk-bound; parallelizing extraction across cores reduces pipeline latency by 60–80%.
  • Header Stripping: Some compression pipelines (e.g., gdal_translate -co COMPRESS=DEFLATE) strip non-essential tags. Always verify src.crs and src.bounds return valid values before serialization.

For deeper integration patterns, consult the rasterio documentation on metadata domains and coordinate reference system handling.