Reality note (2026-06-12, F-1353 / D2-11). Thedocs/operations/runbooks/cascade-window-drain.mdrunbook referenced in the consequences/links below was deleted — the cascade-drain orchestrator was retired and per-source catch-up is nowstellarindex-ops projector-replay(seedocs/operations/runbooks/projector-replay.md). The coverage invariant itself stands.
Status: Accepted (2026-05-28) Context wave: rc.87 Supersedes: — Superseded by: —
Context
The F-0020 cascade-window incident (Postgres back-pressure halted the soroban_events writer across 103,396 contiguous ledgers, ~14 h of network time) went undetected for the entire outage because every existing coverage signal measured process state, not data state:
- The cursor-derived density projection in
/v1/diagnostics/ingestionread 100% because the live cursor'slast_ledgeradvanced past the gap window once back-pressure cleared. - The per-source
*_total{outcome="ok"}counters kept incrementing once the writer recovered; no metric captured the "what was missed during the halt" semantic. - The smoke-script + per-binary heartbeat signals are alive/dead bits, not coverage measurements.
The rc.85 / rc.86 series shipped the response: an honest cursor-derived density, a periodic data-derived gap detector against soroban_events, and a find-data-gaps operator CLI. Then rc.87 generalised the gap detector to every per-source hypertable (PR #2), added the orchestrator for cascade remediation (PR #1), and registered SDEX as the 14th target so the classic-DEX path has symmetric coverage (PR #3).
But the response itself created a new failure mode: a future engineer adding a per-source hypertable could forget to register the target. The system would silently regress to the pre-F-0020 state for that new source — no signal, no alert, no honest density.
Decision
Every per-source hypertable MUST be registered as a `GapDetectorTarget` in `internal/storage/timescale/per_source_gaps.go` within the same PR that creates it. The Go test TestGapDetectorTargetsCoverAllPerSourceHypertables makes this binding by introspecting migrations/*.up.sql for CREATE TABLE <name> statements matching the per-source naming pattern (*_events|*_liquidity|*_positions|*_emissions|*_admin|*_transfers|*_swaps|*_stake_events|*_supply_events|*_auctions) and failing CI if any unregistered table is found.
A table can be exempted only by adding it to excludedFromGapDetector with a documented prose reason — "leftover from refactor" is not a valid reason; delete the entry or the table instead. Current exemptions: freeze_events, mev_events, api_usage_events (each a system-state table, not per-source ingest).
Three additional binding sub-decisions:
- Headline density MUST be data-derived, not cursor-derived. The
/v1/diagnostics/ingestionhandler's per-sourcedensity_pctalready came from cursor coverage; rc.85 cleaned up the NULL→genesis fallback but the underlying source is still cursors. The next step (separate PR, scoped for rc.88+) is to surface the data-derived gap gauge values in the same response, so consumers (status page, dashboards) read the honest signal directly.
- Coverage labels are `{source, table}`, not just `{source}`. A single source may span multiple tables (Blend has three:
blend_positions+blend_emissions+blend_admin). Per-table granularity is necessary for diagnostic depth; alerts aggregate viamax by (source)to preserve paging-dedup behaviour.
- No table-name interpolation from user input. Identifier values fed to
FindPerSourceLedgerGaps(Table,LedgerColumn,WhereFilter) MUST come fromDefaultGapDetectorTargets(a compile-time const). The query string-interpolates them because Postgres doesn't support$Nbinding for identifiers; the safety comes from upstream provenance, not from the SQL builder. This is the same pattern as the existing schema migration runner.
Consequences
Positive:
- A future cascade in any per-source table fires the same paging alert within ~45-60 min of formation. No new alerts to add per source.
- The lint guard makes adding a new Soroban DeFi source a one-PR operation: migration + decoder + target registration + projector source registration, all in one mental unit. CI enforces the discipline.
- The exclusion list documents *why* a table is exempt; future engineers don't have to guess whether
freeze_eventswas forgotten or deliberate.
Negative:
- Adding a per-source hypertable now requires touching
per_source_gaps.goeven if the engineer doesn't otherwise care about gap detection. This is the intended friction. - The per-cycle scan time grows linearly with target count; 13 targets × 30s = ~7 min today, fits inside the 30-min cadence with headroom. Beyond ~30 targets (or if the per-target scans grow), revisit the
soroban_event_ledgersmaterialised-view optimisation noted ingap_detector.go. - The
WhereFilterfield onGapDetectorTargetis an attack surface if a future engineer puts user-controlled data into it. The godoc + ADR call this out, but the linter doesn't enforce it. Mitigation: keep the field's use to the existing single instance (sdexfiltering bysource = 'sdex') unless a new use case justifies the audit.
Alternatives considered
- Per-source data-derived density in the diagnostic handler instead of a separate gap-detector worker. Would consolidate the signal but couples the API handler to the (slow) LAG scan. Rejected: the worker pattern lets us bound the scan to a 30-min cadence, while a handler call would either block requests or require its own cache layer.
- Single `{source}` label, sum across multiple tables in the worker. Simpler metric set, but loses the per-table diagnostic detail. Rejected: when the F-0020-class incident hits *one of Blend's three tables* but not the others, the source-aggregate gauge would still light up — but the operator would have no signal for *which* table to drain. Per-table labels remove a layer of guessing.
- Manual target list, no lint guard. Easier to ship, but exactly the failure mode F-0020 inflicted: a new table can drift into existence without coverage. Rejected: the whole point of this ADR is to make the discipline binding, not optional.
Reference
internal/storage/timescale/per_source_gaps.go— target registry + scan functioninternal/storage/timescale/gap_detector.go— periodic workerinternal/storage/timescale/gap_targets_test.go— lint guarddocs/operations/runbooks/ingest-gap-detected.md— operator responsedocs/operations/runbooks/cascade-window-drain.md— orchestrator runbookdocs/operations/runbooks/sdex-gap-detected.md— SDEX-specific surface- F-0020 (audit-2026-05-26) — motivating incident