Automating metadata extraction from PostGIS tables
Automating metadata extraction from PostGIS tables requires querying PostgreSQL’s information_schema alongside PostGIS-specific catalog views like geometry_columns and spatial_ref_sys. The objective is to map raw database attributes—geometry types, SRIDs, constraints, and table comments—into standardized formats such as ISO 19115 or DCAT. Production-grade implementations rely on Python with psycopg or psycopg2 to pull spatial metadata, serialize it to JSON/XML, and pipe it directly into validation pipelines. This eliminates manual cataloging, enforces cross-agency consistency, and supports automated geospatial data licensing workflows.
Core Architecture & Extraction Workflow
PostGIS distributes spatial metadata across multiple system catalogs. Unlike standard relational databases, spatial tables require explicit extraction of geometry type, coordinate reference system (CRS), bounding box, and spatial index status. A robust automation pipeline follows three deterministic phases:
- Catalog Query: Extract table-level and column-level attributes from
pg_catalog,information_schema, and PostGIS views. This phase avoids scanning actual table rows, keeping execution time predictable regardless of dataset size. - Schema Mapping: Translate PostgreSQL types and EPSG codes into standardized metadata fields. For example,
geometrybecomesdcat:spatial, and SRID4326maps tohttp://www.opengis.net/def/crs/EPSG/0/4326. This step aligns with established Automated Metadata Generation & Schema Mapping practices, ensuring extracted attributes map cleanly to discovery portals and compliance requirements. - Serialization & Validation: Output machine-readable JSON or ISO 19115 XML and run it through a schema linter before publishing. Validation catches missing mandatory fields, malformed URIs, and non-compliant CRS definitions early in the pipeline.
Prerequisites & Environment Constraints
| Component | Minimum Version | Notes |
|---|---|---|
| PostgreSQL | 12+ (14–16 recommended) | Older versions lack robust jsonb aggregation and improved pg_catalog performance |
| PostGIS | 3.0+ (3.3+ recommended) | 3.0 stabilizes the geometry_columns view; 3.3+ improves ST_Transform caching and extent calculations |
| Python | 3.9+ | Use psycopg2-binary (2.9.9+) or modern psycopg (3.x). Do not mix major driver versions |
| Database Roles | SELECT on target schemas + pg_read_all_stats |
Required for reading system catalogs, table comments, and index metadata |
| External Reference | PostGIS Documentation | Review catalog view changes and deprecation warnings before upgrading minor releases |
Production-Ready Python Pipeline
The following script queries a target schema, extracts spatial and relational metadata, and outputs a structured JSON document ready for downstream validation. It uses connection pooling, parameterized queries, and safe JSON serialization.
import psycopg2
import json
from psycopg2.extras import RealDictCursor
from datetime import datetime, timezone
def extract_postgis_metadata(db_host, db_name, db_user, db_pass, target_schema="public"):
"""Extracts table, column, and spatial metadata from a PostGIS database."""
conn = psycopg2.connect(
host=db_host, dbname=db_name, user=db_user, password=db_pass
)
metadata = {"extracted_at": datetime.now(timezone.utc).isoformat(), "schema": target_schema, "tables": []}
try:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
# 1. Fetch base tables with descriptions
cur.execute("""
SELECT t.table_name,
pg_catalog.obj_description(c.oid, 'pg_class') AS table_comment
FROM information_schema.tables t
JOIN pg_catalog.pg_class c ON t.table_name = c.relname
WHERE t.table_schema = %s AND t.table_type = 'BASE TABLE'
ORDER BY t.table_name;
""", (target_schema,))
tables = cur.fetchall()
# 2. Fetch spatial metadata from PostGIS catalog
cur.execute("""
SELECT f_table_name, f_geometry_column, srid, type
FROM geometry_columns
WHERE f_table_schema = %s;
""", (target_schema,))
spatial_map = {row["f_table_name"]: row for row in cur.fetchall()}
# 3. Fetch column-level metadata
cur.execute("""
SELECT table_name, column_name, data_type, is_nullable,
pg_catalog.col_description(c.oid, ordinal_position) AS col_comment
FROM information_schema.columns cols
JOIN pg_catalog.pg_class c ON cols.table_name = c.relname
WHERE cols.table_schema = %s
ORDER BY cols.table_name, cols.ordinal_position;
""", (target_schema,))
columns = cur.fetchall()
# 4. Assemble structured output
for table in tables:
table_meta = {
"name": table["table_name"],
"description": table["table_comment"],
"columns": [],
"spatial": None
}
# Attach columns
for col in columns:
if col["table_name"] == table["table_name"]:
table_meta["columns"].append({
"name": col["column_name"],
"type": col["data_type"],
"nullable": col["is_nullable"] == "YES",
"description": col["col_comment"]
})
# Attach spatial metadata if present
if table["table_name"] in spatial_map:
sp = spatial_map[table["table_name"]]
table_meta["spatial"] = {
"geometry_column": sp["f_geometry_column"],
"srid": sp["srid"],
"geometry_type": sp["type"]
}
metadata["tables"].append(table_meta)
finally:
conn.close()
return json.dumps(metadata, indent=2, ensure_ascii=False)
# Example usage
# print(extract_postgis_metadata("localhost", "gis_db", "admin", "secure_pass", "public"))
Key Implementation Notes
- Performance: Catalog queries execute in milliseconds regardless of table size. Avoid
ST_Extent()in automated pipelines unless explicitly required; it forces full table scans on unindexed geometries. - Type Mapping: PostgreSQL
jsonb,uuid, andtimestamptypes should be mapped to DCAT/ISO equivalents during post-processing. - Error Resilience: Wrap the function in a retry decorator for transient connection drops in cloud-hosted environments (e.g., AWS RDS, Azure Database for PostgreSQL).
Validation, CI/CD Integration & Best Practices
Once extracted, metadata must pass structural and semantic checks before entering a data catalog. Integrate the JSON output with a schema validator like jsonschema or pydantic to enforce required fields (title, license, spatial:crs). This step directly supports Metadata Schema Validation and Linting workflows by catching missing descriptions, invalid EPSG codes, or non-standard date formats before publication.
For CI/CD integration, embed the extraction script in a GitHub Actions or GitLab CI pipeline triggered on schema migrations. Use a lightweight linter to diff generated metadata against the previous commit, flagging drift in column types or SRID changes. Pair this with automated license tagging (e.g., CC-BY-4.0, OGL) to maintain compliance across agency datasets.
Consult the PostgreSQL System Catalogs documentation for advanced catalog joins, and reference the OGC GeoSPARQL and ISO 19115 standards when mapping spatial attributes to open data portals. By standardizing extraction, validation, and publishing, teams can scale geospatial metadata management without manual intervention.