Methodology

v1.0 · last updated 2026-05-23

Substation Spain exists for one reason: to give institutional buyers — infra funds, hyperscaler corp-dev, regulators, analysts — a feed of Spanish data center regulatory activity that survives a one-hour call with a Stonepeak MD asking to trace 5 records to their primary source.

Every design decision below stems from that single requirement.

1. Source whitelist — 15 fuentes oficiales en 3 capas

We pull only from primary government sources. Coverage built layer-by-layer to match how the Spanish regulatory pipeline actually works — national → autonómico → municipal — plus the procurement channel that surfaces obras DC before the EIA hits the boletín.

a. Capa nacional (3 fuentes)

BOE — Boletín Oficial del Estado (JSON sumario API)
BORME — Registro Mercantil (constituciones SL, ampliaciones de capital, cambios de objeto social hyperscalers — early-warning corp activity 3-12 meses antes del EIA)
PCSP — Plataforma de Contratación del Sector Público (Atom syndication; filtered por 24 CPV codes substation/HV-MV/HVAC + fallback keyword DC). Anticipa obras DC antes del EIA.

b. Capa autonómica (6 CCAAs)

BOA — Aragón (JSONAPP endpoint — AWS Region Aragón, hyperscaler hub)
BOJA — Andalucía (REST/JSON Elasticsearch)
BOCM — Madrid (sumario HTML + items, PDF fallback)
DOGC — Cataluña normativa (Socrata open-data dataset)
DOGC HTML — Cataluña anuncis + acords (eadop REST POST — catches Acord GOV/64/2025 and OGAU ambientals)
DOCM — Castilla-La Mancha (HTML sumario + body fetch — Meta Talavera campus)

c. Capa provincial (1 fuente — aggregator BCN)

BOPB — Boletín Oficial Provincia Barcelona. Una sola fuente cubre 311 municipios de la provincia de Barcelona (incluye Cerdanyola, Sant Adrià, Hospitalet, Mataró, Cornellà, Móra la Nova, etc.). Lag típico 2-4 semanas vs publicación municipal directa, pero coverage masiva con búsqueda por keyword.

d. Capa municipal (17 ayuntamientos activos en 5 plataformas)

Gestiona / esPublico — Villanueva de Gállego (AWS), Algete, Móstoles, Alcalá de Henares, Las Rozas, Getafe · template extensible a ~3.000 municipios ES
SEDIPUALBA — Sagunto (Stargate Valencia), Almansa, Hellín, Petrer, Paterna
Drupal (PDF feed) — Alcobendas (Equinix MD3x cluster)
Liferay snowflake — San Sebastián de los Reyes (Equinix expansion)
SEDIC + STA snowflakes — Coslada (Iron Mountain / NTT), Sant Cugat del Vallès (tech corridor BCN), Tarragona (Zona Franca corridor)

Backlog técnico (sin cobertura activa hoy): e-TAULER 10 munis catalanes (React SPA — sustituido por BOPB provincial), ABSIS 4 munis (Talavera/Meta + Toledo + Tres Cantos + Huesca — pendiente AJAX POST emulation), DOGV Comunidad Valenciana (iframe Angular SPA — pendiente headless o pivot a bop.dival.es). Scraper code está en el repo; el cron diario está desactivado hasta resolver bloqueador técnico.

Each source is a Source plugin under scripts/. Adding a new municipio en una plataforma template = 1 línea de código.

We never use as source:

Trade press (DCD, Capacity Media, Cinco Días, Expansión, El Economista)
Operator press releases or investor materials
LinkedIn, Twitter/X, or any social media
Web search results in general

Trade press is useful for cross-checking our work, never for sourcing it. The complete whitelist with audit trail lives in docs/KICKOFF.md §5.1.

2. Provenance metadata per field

Every published value carries four pieces of metadata:

source_url — the primary source URL where we read the value
extracted_at — ISO 8601 timestamp of extraction
raw_text — 50–200 char snippet of the source containing the value
confidence — float 0.0–1.0 score

Missing any one of the four → the field is invalid → the record is not published. Enforced at the database layer.

3. NOT_FOUND is the default

If a DIA does not literally state a MW figure within a DC-context window (regex within ±100 chars of a DC keyword), we publish NOT_FOUND. We do not:

Estimate MW from square metres
Infer PUE from operator brand
Cross-reference trade press to fill gaps
Use "industry typicals"

NOT_FOUND is a feature, not a bug. Transparency is the moat.

4. Three public confidence tiers

Records are tagged with exactly one of three tiers:

VERIFIED (confidence ≥ 0.95) — direct field read from primary source, two-pass verification passed
CROSS-REFERENCED (0.70–0.94) — joined from two whitelisted sources
NOT_FOUND — no whitelisted source contains the value

There is no ESTIMATED tier. The invariant is tested programmatically — see tests/test_smoke_sprint1.py check "No ESTIMATED tier exists".

5. LLM use restricted to three operations

We use Claude only for:

MW extraction from DIA description text (only if regex matches in DC context)
Fuzzy matching of company names between BOE/Catastro/BORME (Jaro-Winkler ≥ 0.90)
Translating natural-language alert rules into deterministic SQL queries

The LLM never:

Generates MW values when source is silent
Generates PUE / cooling / water values
Fills gaps with "industry typicals"
Writes analyst commentary

6. Two-pass verification + spot check

Before any record is marked VERIFIED in the database:

The source URL is re-fetched (cache bypassed)
The same extractor pipeline runs again
Field values are compared bit-by-bit against the first pass
Any mismatch → VERIFICATION_FAILED, record excluded from public read

Before any release, 10 random records are manually opened and checked against the live source. Failure → halt + investigate root cause.

7. Public audit log

Every scrape run, verification pass, spot check and correction is logged to a public table. Daily snapshots are committed to github.com/beltransimo/substation-spain/audit. Every change is reproducible.

8. Corrections workflow

Found an error? Open a correction issue on GitHub. Every correction is cited in the next version's audit log. The friction of corrections is intentional — it demonstrates discipline.

Questions? contact@substation.es