Spatial Data Schema Linting in CI

Automated validation of geospatial assets has shifted from post-ingestion quality control to pre-merge pipeline enforcement. Spatial Data Schema Linting in CI establishes deterministic checks against coordinate reference systems, attribute dictionaries, geometry validity, and metadata compliance before datasets reach production repositories. For GIS data managers, open-source maintainers, and government agency tech teams, embedding these checks directly into pull request pipelines eliminates costly downstream reconciliation and enforces consistent licensing and provenance tracking.

When paired with broader CI/CD Validation & Policy Enforcement for Spatial Data frameworks, schema linting becomes a foundational gate that guarantees structural integrity while preserving contributor velocity. By catching malformed geometries, mismatched CRS URNs, and missing mandatory attributes at commit time, teams avoid propagating topological errors into analytical workflows or public-facing map services.

Prerequisites and Environment Baseline

Before implementing automated linting, ensure your development and CI environments meet the following baseline requirements:

  • Python 3.9+ with venv or conda environment isolation
  • GDAL/OGR 3.4+ compiled with PROJ 9+ for robust CRS transformations and format translation
  • Core Python Packages: pydantic, jsonschema, geopandas, fiona, shapely, lxml (for XML/ISO metadata), pylint
  • CI Runner Access: GitHub Actions, GitLab CI, or Jenkins with sufficient memory for vector file parsing (≥4 GB recommended for large GeoPackages)
  • Baseline Schema Definition: A machine-readable schema representing your organization’s mandatory fields, geometry types, and CRS constraints

If your workflow requires strict compliance with international metadata standards, consult Setting up GitHub Actions for ISO 19115 validation to align schema definitions with ISO 19115-1:2014 requirements. The official specification provides the authoritative baseline for geographic information metadata, which should directly inform your linting rule sets and attribute dictionaries.

Authoring Machine-Readable Geospatial Schemas

A robust spatial linter relies on a strict, version-controlled schema that defines both structural and semantic expectations. JSON Schema remains the industry standard for declarative validation, offering native support for pattern matching, enum constraints, and nested object definitions. For Python-heavy pipelines, Pydantic provides runtime validation with automatic type coercion and detailed error reporting.

When designing your schema, enforce the following geospatial constraints:

  1. Geometry Type Restrictions: Explicitly whitelist allowed types (Point, Polygon, MultiLineString, etc.) to prevent mixed-geometry files that break downstream rendering engines.
  2. CRS Enforcement: Require a standardized crs field using EPSG codes or OGC URNs (e.g., EPSG:4326 or urn:ogc:def:crs:EPSG::3857). Reject implicit or undefined projections.
  3. Attribute Dictionaries: Define mandatory fields, data types, and value ranges. For categorical fields, use enum arrays to restrict inputs to approved taxonomies.
  4. Licensing and Provenance: Mandate SPDX license identifiers and source attribution fields to satisfy open-data compliance requirements.

Refer to the official JSON Schema documentation for syntax specifications and draft version compatibility. Version your schema files alongside your data repositories, and implement semantic versioning to track breaking changes in attribute requirements.

Step-by-Step CI Pipeline Configuration

Implementing spatial schema linting follows a deterministic sequence that integrates cleanly with existing version control practices. The goal is to fail fast, provide actionable error messages, and avoid blocking legitimate contributions with false positives.

flowchart LR
    classDef ok fill:#d7efef,stroke:#0e7c86,color:#0a5d65;
    classDef bad fill:#fde0dd,stroke:#c0392b,color:#922b21;
    A(["Commit"]) --> P{"Pre-commit hook"}
    P -->|fail| FIX["Fix locally"]
    FIX --> A
    P -->|pass| PUSH["Push / open PR"]
    PUSH --> CI["CI: schema validation"]
    CI --> GEO["Geometry validity"]
    GEO --> CRS["CRS verification"]
    CRS --> REF["Reference integrity"]
    REF --> FMT["OGC format compliance"]
    FMT --> Z{"All checks pass?"}
    Z -->|no| BLOCK["Exit non-zero, block merge"]
    Z -->|yes| OK["Allow merge"]
    class OK ok
    class BLOCK bad

1. Local Pre-Commit Hook

Run the linter locally using pre-commit to catch structural violations before pushing. This reduces CI queue congestion and accelerates developer feedback loops. Configure .pre-commit-config.yaml to trigger only on staged geospatial files:

- repo: local
  hooks:
    - id: spatial-schema-lint
      name: Validate Geospatial Assets
      entry: python -m scripts.spatial_lint
      language: python
      types_or: [file]
      files: \.(geojson|gpkg|shp|csv|parquet)$
      pass_filenames: true

2. CI Trigger Configuration

Attach the linting job to pull_request and push events targeting protected branches. Filter execution to modified files using path globbing (**/*.geojson, **/*.gpkg, **/*.shp). This ensures that documentation-only commits or unrelated code changes do not trigger unnecessary spatial validation.

3. Policy Enforcement Integration

Configure the CI job to exit with a non-zero status code when validation fails, automatically blocking merges until issues are resolved. Align this behavior with Policy Enforcement Gates for Data PRs to standardize approval workflows across data and code repositories. Require at least one reviewer with spatial data expertise to override lint failures when exceptions are justified.

4. GitHub Actions Implementation

Below is a production-ready workflow that installs dependencies, caches GDAL binaries, and executes the linter:

name: Spatial Schema Lint
on:
  pull_request:
    paths:
      - 'data/**/*.geojson'
      - 'data/**/*.gpkg'
      - 'data/**/*.shp'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      - name: Install GDAL & Dependencies
        run: |
          sudo apt-get update && sudo apt-get install -y gdal-bin libgdal-dev
          pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run Schema Validation
        run: python scripts/validate_spatial.py --strict --report-format json
      - name: Upload Validation Report
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: lint-report
          path: reports/validation_errors.json

Geometry, CRS, and Reference Validation Logic

Schema validation alone cannot guarantee topological integrity or coordinate accuracy. A comprehensive linter must parse the actual geometries and verify spatial relationships against declared constraints.

Geometry Validity Checks

Use shapely or geopandas to evaluate is_valid flags on every feature. Common violations include self-intersecting polygons, unclosed rings, and duplicate vertices. Implement automated repair routines for minor issues, but flag severe topological errors for manual review.

CRS Transformation Verification

When datasets declare a CRS, verify that coordinates fall within the expected bounding box for that projection. Reject files where coordinates exceed ±180° longitude in WGS84 or contain obvious meter/degree mismatches. The GDAL/OGR documentation provides authoritative guidance on projection handling and coordinate transformation pipelines that can be integrated into your validation scripts.

Geospatial assets frequently contain external references to style files, tile servers, or linked attribute tables. Broken references degrade map rendering and break downstream ETL processes. Implement Automated Broken Link and Reference Detection alongside schema validation to verify that all href, url, and path fields resolve correctly before merging.

OGC Format Compliance

For GeoPackage and Shapefile workflows, enforce OGC specification compliance. Validate that .gpkg files contain valid SQLite headers, proper spatial index tables, and consistent geometry type declarations across layers. Reject Shapefiles with missing .prj, .shx, or .dbf components, as incomplete archives cause silent failures in production environments.

Integrating Static Analysis with Spatial Validators

Spatial linting should not operate in isolation from code quality standards. Many geospatial pipelines rely on Python scripts for transformation, aggregation, and export. Combining spatial validation with static code analysis ensures that both data and processing logic meet organizational standards.

Use pylint or ruff to enforce type hints, import ordering, and security best practices in your validation scripts. When spatial metadata validators are executed as Python modules, static analysis catches undefined variables, deprecated API calls, and inefficient memory allocations before they impact CI performance. For implementation details, review Integrating PyLint with spatial metadata validators to configure rule sets that align with geospatial development patterns.

Unified Reporting Pipeline

Consolidate outputs from spatial linters, static analyzers, and metadata checkers into a single JSON or SARIF report. Map each finding to a severity level (error, warning, info) and attach line numbers or feature IDs for rapid triage. Publish reports as PR comments or CI artifacts to maintain an auditable trail of validation history.

Performance Optimization and CI Runner Tuning

Spatial validation can become a bottleneck when processing multi-gigabyte vector datasets or running on under-provisioned runners. Optimize your pipeline with the following strategies:

  • Incremental Validation: Only lint files that changed in the current commit. Use git diff --name-only to filter inputs and skip unchanged datasets.
  • Parallel Execution: Partition large GeoPackages or Shapefile directories and run validation across multiple CI jobs. Aggregate results using a final reporting step.
  • Memory Management: Stream features using fiona or pyogrio instead of loading entire datasets into memory. Implement chunked validation for files exceeding 500 MB.
  • Dependency Caching: Cache pip wheels, GDAL binaries, and PROJ data directories to reduce installation time from minutes to seconds.
  • Timeout Configuration: Set explicit job timeouts (e.g., timeout-minutes: 15) to prevent runaway processes from consuming CI quotas.

When tuning runner specifications, prioritize CPU cores over raw RAM for vector validation tasks. Geometry parsing and CRS transformations are compute-bound, and parallelizing validation across multiple cores yields better throughput than increasing memory alone.

Troubleshooting Common Validation Failures

Even well-configured pipelines encounter edge cases. Address these frequent failure modes proactively:

Failure Type Root Cause Resolution
InvalidGeometryError Self-intersecting polygons or unclosed rings Run shapely.make_valid() in a preprocessing step, or reject and request corrected source data
CRSMismatchException Coordinates outside declared projection bounds Verify source data export settings; enforce explicit CRS assignment during ingestion
SchemaValidationError Missing mandatory fields or incorrect data types Update schema to reflect actual data structure, or enforce strict typing in contributor guidelines
MemoryLimitExceeded Large file loaded entirely into RAM Switch to streaming parsers, implement chunked validation, or increase runner resources
DependencyConflict Incompatible GDAL/PROJ versions Pin exact package versions in requirements.txt and use containerized CI runners

Maintain a runbook that documents these scenarios and their resolutions. Share it with contributors to reduce support overhead and accelerate onboarding for new data engineers.

Conclusion

Spatial Data Schema Linting in CI transforms geospatial quality assurance from a reactive cleanup process into a proactive engineering discipline. By enforcing strict attribute dictionaries, validating geometry integrity, and verifying CRS declarations at merge time, teams eliminate downstream reconciliation costs and maintain consistent data standards across distributed repositories. When integrated with policy gates, static analysis, and reference validation, schema linting becomes a reliable foundation for scalable, production-ready geospatial pipelines.

Start with a minimal schema, automate local pre-commit checks, and gradually expand validation rules as your data ecosystem matures. Continuous refinement of linting thresholds, combined with clear contributor documentation, ensures that spatial assets remain accurate, compliant, and ready for analytical consumption.