Methodology
v1.0 · last updated 2026-05-23
Substation Spain exists for one reason: to give institutional buyers — infra funds, hyperscaler corp-dev, regulators, analysts — a feed of Spanish data center regulatory activity that survives a one-hour call with a Stonepeak MD asking to trace 5 records to their primary source.
Every design decision below stems from that single requirement.
1. Source whitelist
We pull only from these primary sources:
- BOE — Boletín Oficial del Estado (national, JSON sumario API)
- BOJA — Boletín Oficial de la Junta de Andalucía (REST/JSON Elasticsearch)
- BOA — Boletín Oficial de Aragón (JSONAPP endpoint)
- BOCM — Boletín Oficial de la Comunidad de Madrid (PDF sumario + items)
- MITECO participación pública (cross-reference only — does not contain project-level EIAs)
We never use as source:
- Trade press (DCD, Capacity Media, Cinco Días, Expansión, El Economista)
- Operator press releases or investor materials
- LinkedIn, Twitter/X, or any social media
- Web search results in general
Trade press is useful for cross-checking our work, never for sourcing it. The complete whitelist with audit trail lives in docs/KICKOFF.md §5.1.
2. Provenance metadata per field
Every published value carries four pieces of metadata:
source_url— the primary source URL where we read the valueextracted_at— ISO 8601 timestamp of extractionraw_text— 50–200 char snippet of the source containing the valueconfidence— float 0.0–1.0 score
Missing any one of the four → the field is invalid → the record is not published. Enforced at the database layer.
3. NOT_FOUND is the default
If a DIA does not literally state a MW figure within a DC-context window
(regex within ±100 chars of a DC keyword), we publish NOT_FOUND.
We do not:
- Estimate MW from square metres
- Infer PUE from operator brand
- Cross-reference trade press to fill gaps
- Use "industry typicals"
NOT_FOUND is a feature, not a bug. Transparency is the moat.
4. Three public confidence tiers
Records are tagged with exactly one of three tiers:
- VERIFIED (confidence ≥ 0.95) — direct field read from primary source, two-pass verification passed
- CROSS-REFERENCED (0.70–0.94) — joined from two whitelisted sources
- NOT_FOUND — no whitelisted source contains the value
There is no ESTIMATED tier. The invariant is tested
programmatically — see tests/test_smoke_sprint1.py check
"No ESTIMATED tier exists".
5. LLM use restricted to three operations
We use Claude only for:
- MW extraction from DIA description text (only if regex matches in DC context)
- Fuzzy matching of company names between BOE/Catastro/BORME (Jaro-Winkler ≥ 0.90)
- Translating natural-language alert rules into deterministic SQL queries
The LLM never:
- Generates MW values when source is silent
- Generates PUE / cooling / water values
- Fills gaps with "industry typicals"
- Writes analyst commentary
6. Two-pass verification + spot check
Before any record is marked VERIFIED in the database:
- The source URL is re-fetched (cache bypassed)
- The same extractor pipeline runs again
- Field values are compared bit-by-bit against the first pass
- Any mismatch →
VERIFICATION_FAILED, record excluded from public read
Before any release, 10 random records are manually opened and checked against the live source. Failure → halt + investigate root cause.
7. Public audit log
Every scrape run, verification pass, spot check and correction is logged to a public table. Daily snapshots are committed to github.com/beltransimo/substation-spain/audit. Every change is reproducible.
8. Corrections workflow
Found an error? Open a correction issue on GitHub. Every correction is cited in the next version's audit log. The friction of corrections is intentional — it demonstrates discipline.
Questions? contact@substation.es