From ArcGIS REST to BigQuery: Automating Geneva SITG Layers for Building Analytics

Luca Zosso | Mar 20, 2025 min read

Municipal and cantonal open geodata is powerful and noisy. REST feature services expose dozens of layers, attribute names drift, and row counts sit in the hundreds of thousands. Analysts who need one clean building-level table for dashboards do not want to click through ArcGIS every week. They want a repeatable pipeline: extract, land, load, model, expose.

This repository is the automation I built around SITG (Geneva’s spatial data infrastructure), pulling from vector.sitg.ge.ch ArcGIS endpoints into Google Cloud Storage and BigQuery, then materializing a wide analytical table used downstream for reporting. Today, a client still runs this pipeline to refresh their reporting base: less manual export work, fewer copy-paste errors, and more energy left for interpretation instead of file wrangling.

This article is written for other data engineers and hiring managers who care about idempotent-ish loads, schema discipline, incremental operational maturity (scheduling, secrets, observability hooks), and honest tradeoffs.


Prerequisites

  • Python 3.8+ (the bundled GitHub Action uses 3.8; align locally or bump the workflow in lockstep).
  • Google Cloud project with BigQuery and Cloud Storage, plus a service account secret wired in CI.
  • Comfort with ArcGIS REST query parameters (where, outFields, resultOffset, resultRecordCount).
  • Basic SQL (CTEs, LEFT JOIN, window functions).

The problem: many layers, one reporting grain

Building-centric reporting in the Geneva context often needs EGID (building identifier), EGRID (parcel), addresses, boiler attributes, energy indices, green roof flags, and more. Those attributes live in different hosted layers with different cardinalities: one row per boiler, one row per index history, one row per geometry variant.

The engineering task is twofold:

  1. Ingest each source layer reliably (pagination, retries at the call site, field projection).
  2. Resolve many-to-one relationships into a single grain (one row per building joined to layer extracts with DISTINCT, GROUP BY, and deduplication windows where history exists).

Ingestion layer: controlled fields and paged REST reads

etl_processing_v2.py maps each layer id to an ArcGIS FeatureServer URL and to an explicit FIELDS_PER_LAYER allow list. That is a deliberate data contract: you only pay to move and store columns your warehouse actually uses, and you reduce surprise type explosions in BigQuery autodetect.

Fetching uses a count-first query, then a while loop with resultOffset and a fixed resultRecordCount, updating a tqdm bar. That pattern is the standard answer when ArcGIS caps page size.

# etl_processing_v2.py
with tqdm(total=total_count, desc=f"Fetching records for layer {layer_id}", unit="record") as pbar:
    while result_offset < total_count:
        params['resultOffset'] = result_offset
        response = requests.get(query_url, params=params)
        response.raise_for_status()
        features = response.json().get("features", [])
        if features:
            all_features.extend(features)
            pbar.update(len(features))
            result_offset += record_count_limit
            if len(features) < record_count_limit:
                break

Each layer is written as newline-delimited JSON to a deterministic object name ({layer_id}.jsonl), uploaded to GCS, then loaded into BigQuery with NEWLINE_DELIMITED_JSON, autodetect=True, and WRITE_TRUNCATE so each run replaces the staging table for that layer. process_all_layers wraps each layer in try / except so one bad endpoint does not kill the whole batch.


When one layer needs special treatment

2177_layer_process.py targets the SCANE energy index style service with a larger page size, optional discovery of layer fields via the .../0?f=json metadata call, and a pandas path to to_gbq for that table. Splitting “default bulk ETL” from “heavy or wide layer” scripts is a normal operational choice: you can schedule them on different cadences or give the slow layer more memory without over-provisioning every job.


Modeling layer: CTEs, dedup windows, and a versioned handoff

hestera_db_v2.py is where the work looks like analytics engineering. Before overwriting the main curated table, it exports the current BigQuery table to GCS as a dated snapshot (manage_previous_version_in_gcs), so you keep a rollback story if a join goes wrong.

# hestera_db_v2.py
extract_job = client.extract_table(
    bq_table_ref,
    destination_uri,
    location="US"
)
extract_job.result()

The core is a long SQL string: a base CTE from a manual upload table, then one CTE per SITG-derived layer (often SELECT DISTINCT or GROUP BY to collapse duplicates), then a stack of LEFT JOINs onto EGID / EGRID keys. Where a layer carries history (for example energy indices with DATE_SAISIE), the SQL uses ROW_NUMBER() OVER (PARTITION BY EGID ORDER BY DATE_SAISIE DESC) and keeps rn = 1. That is textbook slowly-changing-ish source handling without introducing a full SCD2 framework on day one.

-- excerpt from hestera_db_v2.py (string literal, shortened)
, layer_2177 AS (
    SELECT DISTINCT
        EGID,
        SRE,
        INDICE_MOY3,
        DATE_SAISIE
    FROM (
        SELECT DISTINCT
            EGID,
            SRE,
            INDICE_MOY3,
            DATE_SAISIE,
            ROW_NUMBER() OVER (PARTITION BY EGID ORDER BY DATE_SAISIE DESC) AS rn
        FROM `hestera-database.sitg_gcs_bucket.2177`
        WHERE EGID IS NOT NULL
    ) t
    WHERE rn = 1
)

The result is loaded back with load_table_from_dataframe and WRITE_TRUNCATE, which keeps the contract simple for BI tools: one table name, always the latest build.


Automation and observability hooks

GitHub Actions installs requirements.txt and runs python etl_processing_v2.py on workflow_dispatch, injecting H_SERVICEACCOUNT_CREDENTIALS from a secret. That matches how many small teams run GCP batch jobs: no long-lived key on laptops, one runner, explicit human trigger until cron is trusted.

monitoring.py exposes a small psutil decorator that prints CPU time, memory delta, and elapsed wall time around a function. It is not Prometheus, but it is the right seed for later hardening when a layer doubles in size.


Conclusion

This pipeline is data engineering in the small that pays off at scale for a real client: paged extraction, explicit schemas, GCS as a landing zone, BigQuery as the analytical warehouse, SQL that respects keys and history, and GitHub Actions for repeatable execution. My near-term improvements would be README completion, dataset location alignment for extract jobs, and replacing long-lived base64 keys with Workload Identity Federation where the runtime allows it.