Amendment (2026-06-12, F-1353 / D2-07 + D2-08). Two enumerations below have since been superseded: - The Redis "cluster (3 masters + 3 replicas + Sentinel)" hot tier was decided as Sentinel, not Cluster — see ADR-0024. - Thestellar-core/stellar-rpcwatcher services listed in the storage/replication tiers were removed from production on 2026-04-23 (invariant 6); production ingest is Galexie → dispatcher → decoders, andstellar-coresurvives only as Galexie's captive-core subprocess. See docs/architecture/ingest-pipeline.md.
>
The decision below is preserved as the original record.
Context
The availability SLA requires ≥ 99.99 % uptime (coverage-matrix S9.1). At one nine of slack against full failure that's 52 min/year of downtime — well below the cost of a single cold-start of stellar-core's catchup-recent. So the HA target forces per-component redundancy *and* a graceful-degradation contract that defines what the API serves when individual planes fail.
`docs/architecture/ha-plan.md` captures the full design (558 lines — physical topology, per-component HA, capacity math, failure matrix, degradation modes, backup/restore). That doc is comprehensive but was tagged status: draft pending Week-2 design review. This ADR ratifies the load-bearing decisions from it as the binding commitment for Phase 6 (Weeks 8–9) infrastructure work.
Cross-references:
- ADR-0001 — Horizon out of architecture (constrains ingest path).
- ADR-0002 — S3-compatible storage (constrains MinIO + DR target).
- ADR-0004 — Tier-1 three-validator aspiration (constrains
stellar-core redundancy to N+2).
- ADR-0007 — Redis as hot-path cache + rate-limit (constrains the
hot tier).
- ADR-0015 — Closed-bucket-only API serving (constrains the
cross-region invariant).
- ADR-0016 — Per-region storage strategies (R1/R2/R3 storage
shapes; this ADR is the *per-region* HA topology, ADR-0016 is *across regions*).
Decision
Adopt the per-region HA topology specified in `ha-plan.md`, binding the following decisions:
1. Single-region HA first; multi-region DR second
At launch we run exactly one region (R1 / Frankfurt) at full HA. R2 and R3 join over Weeks 6–8 with the same per-region shape. Multi-region active/active is explicitly out of scope for v1. What we have is per-region full HA + cold DR in cloud + cross- region async replicas (per ADR-0016) for read-only failover.
2. Decouple ingest from serving — three failure domains
- Hot tier (≤30s window): Redis cluster (3 masters + 3
replicas + Sentinel). Per ADR-0007.
- Warm tier (≤90 days raw, indefinite for 1h+ aggregates):
TimescaleDB Patroni-managed HA — 1 primary + 2 sync replicas.
- Cold tier (raw history, indefinite): MinIO erasure-coded
EC(6+3) across 9 hosts; bucket versioning on.
Three tiers, three failure domains. Ingest must never block serving. If the ingestion plane slows, the serving plane returns stale-marked responses (per the ADR-0015 closed-bucket contract + the envelope flags.stale=true). Never errors.
3. Redundancy is N+1 minimum; N+2 for stellar-core
Every stateful component runs ≥ N+1. stellar-core / galexie / stellar-rpc run N+2 because the Tier-1 aspiration (ADR-0004) requires three independent archives post-launch and we want the Tier-1 fleet to survive a single-host failure plus a single-host maintenance window concurrently.
4. Stateless services scale horizontally; one leader-elected aggregator
stellarindex-api runs as N=3 stateless instances behind HAProxy (keepalived VIP). stellarindex-indexer runs one process per configured source (per-source orchestration). stellarindex- aggregator runs one active + one standby, leader-elected via a Redis lease — only one instance writes to the trades hypertable at a time to avoid duplicate emissions.
5. Colocated bare metal primary; cloud DR
Captive-core + galexie + Postgres + MinIO live on dedicated R640-class colocated hardware (per ADR-0002 alternatives — the 3× cost differential vs cloud IOPS-matched instances at our scale ratifies this). Cloud (AWS) is the DR target:
- Stateless services: warm-standby in AWS, scale-to-zero,
scale-out on DNS flip.
- TimescaleDB: async logical replica via pg_logical at AWS
RDS, 5-minute RPO budget.
- Redis: NOT replicated cross-region. Warm-standby is cold;
re-hydrates from Timescale within minutes after failover.
- MinIO:
mc mirrorto AWS S3, 1-hour RPO for the
archive bucket; galexie-live/ replicated at 5 min.
- stellar-core / stellar-rpc: NOT replicated. Rebuilt from
our own MinIO archive on DR activation (~4 h to CATCHUP_RECENT). Running captive-cores in AWS would violate the cost envelope.
6. Every component has a defined degraded mode up front
The API never silently fails. Per ha-plan.md §9, each component failure has a documented degradation:
- Indexer source down → response includes
sourceslist with the
outage marked + envelope flags.reduced_redundancy=true.
- Aggregator down → API serves the last successfully published
aggregate row + flags.stale=true with as_of timestamp.
- Redis down → API queries Timescale directly +
flags.stale=true. - Timescale primary down → automatic failover to sync replica via
Patroni (RPO 0; RTO ~30s).
- All three regions disagree on a closed bucket →
cross-region-monitor alerts; serving continues from each region's local view.
Anything that changes a flags.* value gets an explicit ADR or test pinning the contract.
7. Aspirational cost envelope — colo + cloud DR ≤ $80k/year
Per ha-plan.md §12 the per-region cost target is ~$30k/year hardware amortisation + ~$20k/year cloud DR + ~$30k/year colo power/bandwidth for R1. R2 and R3 land under ADR-0016's hybrid shape so they're cheaper than a third full-stack copy. This is a target, not a binding constraint — the architectural shape is load-bearing; the budget is informative.
Consequences
- Positive — covers the 99.99 % SLA without vendor lock-in.
Self-hosted bare metal on rented colo space; cloud as DR fallback; every component has a defined failure mode. The numbers align per the napkin math in ha-plan.md §4.
- **Positive — three-tier separation makes the degradation
contract enforceable.** Each tier's failure has a clear answer for the API. Reviewers can call out a PR that introduces a path bypassing the contract (e.g. a handler that reads MinIO synchronously on a /v1/price hit) without arguing about whether it's "really" a violation.
- **Positive — N+2 for stellar-core defends the ADR-0004
aspiration.** Three independent archives are a hard requirement for Tier-1 quality; N+2 deployment ensures we don't fall below three even during single-host maintenance.
- Negative — operational complexity. Patroni, keepalived,
Sentinel, leader-election, mc mirror schedules — every redundancy layer is a moving part. Mitigated by the runbook catalog (docs/operations/runbooks/) requiring one runbook per alert + the SEV playbook tying everything together.
- Negative — colo + cloud is a hybrid posture. Egress charges
on cloud, manual hardware refresh on colo, two failure-mode sets to operationalize. Justified by the cost envelope (cloud-only at our IOPS profile is 3× more expensive) and the existing R1 hardware. Re-evaluate if either factor changes.
- **Operational impact — every PR that adds a service needs to
declare its tier, redundancy, and degradation mode.** Captured as a checklist line in the PR template (added in a follow-up to this ADR).
- **Downstream design impact — this ADR fixes the shape; specific
decisions about backup retention, alert thresholds, and failover procedures are runbooks in docs/operations/runbooks/.** Their content evolves; this ADR doesn't.
Alternatives considered
- Full cloud (no colo) — rejected per ADR-0002 alternatives.
The 3× cost differential at our IOPS profile + the captive-core fleet's existing R640 provisioning make hybrid the right call.
- Multi-region active/active at v1 — rejected. The initial
build window doesn't permit multi-master Postgres / Redis at launch. ADR-0016 picks up cross-region read replicas with ADR-0015's closed-bucket invariance providing the "byte-equivalent across regions" property; that's enough for v1.
- Single-replica Postgres (warm standby only) — rejected.
Patroni with two sync replicas costs negligibly more in storage and earns RPO=0 + automatic failover. The ops cost of manual standby promotion under stress would be far higher than the marginal infra cost.
- Run stellar-core in AWS for DR — rejected. Captive-core
IOPS in cloud at 8 vCPU / 32 GB scale violates the cost envelope. Re-bootstrapping from our own MinIO archive in ~4h is acceptable for a DR scenario where the entire colo is offline (which is itself a multi-failure scenario beyond what 99.99 % uptime targets).
- Stateless API + serverless aggregator (Lambda / Cloud Run)
— rejected. Cold-start latency would breach the p95 ≤ 200ms / p99 ≤ 500ms targets (ADR-0009 — to land — for the latency contract). Redis-leader-elected daemon in colo is the right profile for steady-state aggregation.
References
full design; this ADR ratifies its binding decisions.
§S9.1 — the 99.99 % uptime requirement this ADR closes.
— cross-region (R1/R2/R3) topology layered on top of this per-region one.
— per-host hardware spec for the colo fleet.
- ADR-0001, 0002, 0004, 0007, 0015, 0016 (cross-references above).