Which PostGIS views hold spatial metadata?

geometry_columns and spatial_ref_sys are the primary PostGIS catalog views. geometry_columns lists every registered geometry column with its SRID and type; spatial_ref_sys maps SRID integers to full PROJ/WKT definitions.

Does catalog extraction require scanning table rows?

No. All attributes in pg_catalog, information_schema, and geometry_columns are stored as catalog metadata and resolve in milliseconds regardless of how many rows the table contains.

Which database role is required for catalog queries?

The connecting role needs SELECT on the target schemas plus pg_read_all_stats. This grants access to system catalogs, table comments stored via obj_description(), and index metadata without needing superuser privileges.

Automating Metadata Extraction from PostGIS Tables

Automate PostGIS metadata extraction by querying geometry_columns, information_schema, and pg_catalog with Python, then mapping SRID integers and geometry types to DCAT or ISO 19115 fields before passing the result to a schema linter.

This operation is more involved than querying a standard relational database because spatial tables distribute their structural information across at least three separate system catalogs. Understanding where each attribute lives, and how to join those catalogs efficiently, is the difference between a fragile one-off script and a pipeline that survives schema migrations. This page sits within the Metadata Schema Validation and Linting workflow, which is part of the broader Automated Metadata Generation & Schema Mapping pillar covering the full lifecycle from template generation to compliance gating.

Prerequisites

Before running the extraction script, confirm the following environment conditions are met:

PostgreSQL 14+ with PostGIS 3.3+ installed. PostGIS 3.0 stabilised geometry_columns; 3.3 improves ST_Transform caching and bounding-box extent accuracy.
Python 3.9+ with psycopg2-binary >= 2.9.9 or psycopg >= 3.1 (do not mix major driver versions in the same virtualenv).
Database role with SELECT on the target schema, SELECT on information_schema.tables, SELECT on information_schema.columns, and the built-in pg_read_all_stats role granted. This enables obj_description() and col_description() calls without superuser access.
Familiarity with the target standard: ISO 19115 metadata template generation for international catalog publishing, or DCAT-AP spatial profile mapping for European open data portals.
A downstream linter ready to receive JSON output — see the metadata schema validation and linting guide for jsonschema and pydantic setup.

How PostGIS Distributes Spatial Metadata

PostGIS stores spatial table attributes across three separate locations. Understanding this split prevents incomplete extractions and silent data loss.

The key implication: you must join information_schema.tables with pg_catalog.pg_class (to reach obj_description()) and left-join with geometry_columns (which will be empty for non-spatial tables). Missing any of these layers produces metadata that fails downstream spatial data schema linting in CI.

Automated Python Implementation

The script below is self-contained and runnable. It queries all three catalog layers for a given schema, assembles a structured dictionary, and serializes it to JSON. Inline comments explain each decision point.

import psycopg2
import json
from psycopg2.extras import RealDictCursor
from datetime import datetime, timezone


def extract_postgis_metadata(
    db_host: str,
    db_name: str,
    db_user: str,
    db_pass: str,
    target_schema: str = "public",
    db_port: int = 5432,
) -> str:
    """
    Extract table, column, and spatial metadata from a PostGIS database schema.

    Returns a JSON string. Raises psycopg2.OperationalError on connection failure
    and psycopg2.ProgrammingError if the target schema does not exist.
    """
    conn = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass,
    )
    output = {
        "extracted_at": datetime.now(timezone.utc).isoformat(),
        "schema": target_schema,
        "tables": [],
    }

    try:
        with conn.cursor(cursor_factory=RealDictCursor) as cur:

            # --- Phase 1: base tables + table-level comments ---
            # pg_class.oid is the bridge between information_schema and pg_catalog.
            # relnamespace restricts the join to the target schema, avoiding
            # name collisions when identically named tables exist in multiple schemas.
            cur.execute(
                """
                SELECT
                    t.table_name,
                    pg_catalog.obj_description(c.oid, 'pg_class') AS table_comment
                FROM information_schema.tables t
                JOIN pg_catalog.pg_class c
                    ON t.table_name = c.relname
                JOIN pg_catalog.pg_namespace n
                    ON c.relnamespace = n.oid AND n.nspname = t.table_schema
                WHERE t.table_schema = %s
                  AND t.table_type = 'BASE TABLE'
                ORDER BY t.table_name;
                """,
                (target_schema,),
            )
            tables = cur.fetchall()

            # --- Phase 2: spatial metadata from PostGIS ---
            # geometry_columns is a view in PostGIS 3.x; it only contains rows
            # for columns explicitly typed as geometry or geography.
            cur.execute(
                """
                SELECT
                    f_table_name,
                    f_geometry_column,
                    srid,
                    type,
                    coord_dimension
                FROM geometry_columns
                WHERE f_table_schema = %s;
                """,
                (target_schema,),
            )
            # Build a lookup keyed by table name. If a table has multiple
            # geometry columns, only the first is stored here; extend this
            # into a list if multi-geometry tables are in scope.
            spatial_map = {row["f_table_name"]: row for row in cur.fetchall()}

            # --- Phase 3: column-level metadata + column comments ---
            # col_description() requires the pg_class OID and the ordinal position
            # (1-indexed) from information_schema.columns.
            cur.execute(
                """
                SELECT
                    cols.table_name,
                    cols.column_name,
                    cols.data_type,
                    cols.is_nullable,
                    cols.ordinal_position,
                    pg_catalog.col_description(c.oid, cols.ordinal_position::int)
                        AS col_comment
                FROM information_schema.columns cols
                JOIN pg_catalog.pg_class c
                    ON cols.table_name = c.relname
                JOIN pg_catalog.pg_namespace n
                    ON c.relnamespace = n.oid AND n.nspname = cols.table_schema
                WHERE cols.table_schema = %s
                ORDER BY cols.table_name, cols.ordinal_position;
                """,
                (target_schema,),
            )
            all_columns = cur.fetchall()

            # Group columns by table name for O(n) assembly below.
            cols_by_table: dict[str, list] = {}
            for col in all_columns:
                cols_by_table.setdefault(col["table_name"], []).append(col)

            # --- Phase 4: assemble output ---
            for table in tables:
                name = table["table_name"]
                table_meta: dict = {
                    "name": name,
                    "description": table["table_comment"],
                    "columns": [],
                    "spatial": None,
                }

                for col in cols_by_table.get(name, []):
                    table_meta["columns"].append(
                        {
                            "name": col["column_name"],
                            "type": col["data_type"],
                            "nullable": col["is_nullable"] == "YES",
                            "description": col["col_comment"],
                        }
                    )

                if name in spatial_map:
                    sp = spatial_map[name]
                    # Map SRID integer to the canonical OGC CRS URI.
                    # SRID 0 means "unknown"; do not emit a CRS URI for it.
                    crs_uri = (
                        f"http://www.opengis.net/def/crs/EPSG/0/{sp['srid']}"
                        if sp["srid"] and sp["srid"] != 0
                        else None
                    )
                    table_meta["spatial"] = {
                        "geometry_column": sp["f_geometry_column"],
                        "srid": sp["srid"],
                        "crs_uri": crs_uri,
                        "geometry_type": sp["type"],
                        "coord_dimension": sp["coord_dimension"],
                    }

                output["tables"].append(table_meta)

    finally:
        conn.close()

    return json.dumps(output, indent=2, ensure_ascii=False)


# --- Example usage ---
# result = extract_postgis_metadata(
#     db_host="localhost",
#     db_name="gis_db",
#     db_user="readonly_svc",
#     db_pass="s3cur3",
#     target_schema="public",
# )
# print(result)

Key design decisions

Namespace-scoped joins. Joining through pg_namespace prevents false matches when tables with identical names exist in different schemas — a common pattern in multi-tenant PostGIS deployments.
No row-level scans. All three phases query system catalog tables only. Execution time is O(1) relative to table row count, making the script safe to run on multi-hundred-million-row production databases.
Canonical CRS URIs. Emitting http://www.opengis.net/def/crs/EPSG/0/4326 rather than the bare integer 4326 lets downstream validators enforce the URI format required by DCAT-AP spatial profile mapping without an extra transformation step.
SRID 0 handling. PostGIS permits geometry columns with SRID 0 (“unknown”). The script emits null for crs_uri in this case rather than a malformed URI, which a downstream linter can flag as a required-field failure.

Validation and Pipeline Integration

Once the JSON is produced, three verification layers confirm it is ready for catalog ingestion.

1. Structural JSON Schema check

Install jsonschema and point it at a schema document that enforces the presence of name, spatial.srid, spatial.crs_uri, and at least one column with a non-null description:

python -m jsonschema \
    --instance extracted_metadata.json \
    --schema schemas/postgis_metadata_v1.schema.json

Exit code 0 means the document conforms. Exit code 1 prints the first failing path. Wire this into your CI step immediately after the extraction run — see setting up GitHub Actions for ISO 19115 validation for a reusable workflow YAML template.

2. SRID cross-reference check

Not every SRID integer registered in geometry_columns exists in spatial_ref_sys. An unmapped SRID will cause ST_Transform to fail at runtime and will break CRS URI resolution in catalog APIs. Run:

import psycopg2

def check_srid_registered(conn_params: dict, srid: int) -> bool:
    with psycopg2.connect(**conn_params) as conn:
        with conn.cursor() as cur:
            cur.execute(
                "SELECT 1 FROM spatial_ref_sys WHERE srid = %s;",
                (srid,),
            )
            return cur.fetchone() is not None

Flag any table whose SRID returns False as a blocking validation error before metadata reaches the catalog.

3. Missing-comment audit

Table and column comments are the primary source of human-readable description fields in ISO 19115 and DCAT records. Run a quick audit to surface tables or columns with null descriptions before publishing:

import json

def audit_missing_descriptions(metadata_json: str) -> list[str]:
    data = json.loads(metadata_json)
    issues = []
    for table in data["tables"]:
        if not table.get("description"):
            issues.append(f"Table '{table['name']}' has no comment")
        for col in table["columns"]:
            if not col.get("description"):
                issues.append(
                    f"Column '{table['name']}.{col['name']}' has no comment"
                )
    return issues

Feed the resulting list into a pytest assertion or a CI gate that blocks publication when the proportion of undocumented columns exceeds a configured threshold (e.g., 20%).

CI/CD integration pattern

Embed the extraction and all three checks into a single pipeline step triggered on schema migrations or scheduled daily for drift detection:

# .github/workflows/metadata-extract.yml  (partial)
- name: Extract and validate PostGIS metadata
  run: |
    python scripts/extract_postgis_metadata.py \
        --schema public \
        --output extracted_metadata.json
    python -m jsonschema \
        --instance extracted_metadata.json \
        --schema schemas/postgis_metadata_v1.schema.json
    python scripts/audit_missing_descriptions.py extracted_metadata.json

Pair the diff output with policy enforcement gates for data PRs to block merges that introduce SRID changes or drop table comments without a corresponding metadata update.

Long-term Compliance Best Practices

Set COMMENT ON TABLE and COMMENT ON COLUMN as a migration discipline, not an afterthought. Add a linting step to your Alembic or Flyway migration runner that rejects any CREATE TABLE migration lacking a COMMENT ON TABLE statement. This keeps descriptions populated at creation time rather than requiring a backfill campaign.
Version the extracted JSON alongside the schema migrations. Store extracted_metadata.json in version control and diff it in CI. A SRID change or a geometry type change appearing in the diff is an explicit signal that downstream catalog records and license attributions need review — directly supporting automated metadata generation & schema mapping governance workflows.
Use a read-only service role for extraction. Never run the extraction pipeline with a role that has INSERT or DDL privileges. A dedicated metadata_reader role with GRANT SELECT ON ALL TABLES IN SCHEMA public TO metadata_reader and GRANT pg_read_all_stats TO metadata_reader is sufficient and limits blast radius from credential compromise.
Cross-check spatial_ref_sys during ingestion, not just extraction. If your PostGIS instance was restored from a dump that predates a custom SRID registration, geometry_columns may reference SRIDs absent from spatial_ref_sys. Running the SRID cross-reference check at ingestion time catches these gaps before they silently corrupt CRS URI fields in published catalog records.
Extend the script to capture bounding-box extents only when an index exists. ST_Extent() forces a full sequential scan on unindexed geometry columns. Query pg_indexes first; emit a bounding box only when a GiST or SP-GiST index is present on the geometry column, and mark the field "extent_source": "index_scan" to distinguish it from a full-table estimate.
Schedule extraction runs after major data loads, not just after schema changes. Attribute changes (new columns, SRID re-registrations, comment updates) are the obvious triggers, but row-count estimates from pg_stat_user_tables also drift after large ingestions. A daily scheduled run keeps catalog statistics fresh without requiring a pipeline trigger on every data load.

Metadata Schema Validation and Linting — parent page covering full validation pipeline architecture, rule profiles, and CI/CD integration patterns
Automated Metadata Generation & Schema Mapping — pillar covering the lifecycle from template generation to schema compliance
DCAT-AP Spatial Profile Mapping — mapping extracted PostGIS attributes to DCAT-AP distribution and spatial dataset fields
Spatial Data Schema Linting in CI — integrating schema linters into GitHub Actions and pre-commit hooks
Validating FGDC Metadata Against XML Schemas — parallel validation workflow for FGDC CSDGM output

# Automating Metadata Extraction from PostGIS Tables

# Prerequisites

# How PostGIS Distributes Spatial Metadata

# Automated Python Implementation

# Key design decisions

# Validation and Pipeline Integration

# 1. Structural JSON Schema check

# 2. SRID cross-reference check

# 3. Missing-comment audit

# CI/CD integration pattern

# Long-term Compliance Best Practices

# Related