Skip to main content
Back to research
ADR-0024Accepted2026-04-30

Redis HA via Sentinel (not Cluster)

View source on GitHub

Context

`docs/architecture/ha-plan.md §3.4` describes the Redis topology as:

"3 masters + 3 replicas, Redis-Cluster mode (hash slots). Replicas on separate hosts from their masters. 3 sentinels on independent hosts for failover vote."

This is internally inconsistent. Redis Cluster and Redis Sentinel are two different HA modes:

ModeShardingFailoverSentinel processes?
Redis Clusteryes (hash slots; data spread across N masters)internal — cluster nodes vote among themselvesno — Cluster has no Sentinel
Redis Sentinelno — single primary + replicasexternal — N (≥3) Sentinel processes monitor the primary, elect a new one on failureyes

The ha-plan's "Cluster mode... 3 sentinels for failover vote" combines the two — likely the original author wrote "Cluster mode" colloquially meaning "clustered deployment" and the ADR process never tightened the term.

This ADR ratifies the Sentinel choice and asks ha-plan §3.4 to be amended for terminological consistency the next time it ships.

Decision

Redis HA via Sentinel. Concretely:

  • 1 primary + 2 replicas across 3 hosts (cache-01 / 02 / 03).
  • 3 Sentinels co-located on the same 3 cache hosts (one per

host), quorum=2 for promotion votes.

  • No sharding. The keyspace lives entirely on the primary; each

replica holds a full async copy.

  • Persistence: AOF everysec + RDB nightly (per ha-plan §3.4

unchanged).

  • Failover RTO: 15-30 s (per ha-plan §3.4 unchanged).
  • Client connect via go-redis/v9 FailoverClient, which

consults Sentinel for the current primary; no HAProxy or keepalived VIP needed in front of Redis.

Why Sentinel rather than Cluster

  1. Hot-data set is small. ha-plan §3.4 enumerates the

hot-data categories (price cache + VWAP precompute + rate-limit + SEP-1 + asset-metadata + SSE registry). The total fits comfortably in single-primary RAM at expected launch scale. Sharding adds complexity that solves no current capacity problem.

  1. Operational simplicity. Sentinel has fewer moving parts

to debug under SEV-1 stress (3 Sentinels + 3 redis-server processes vs Cluster's per-node gossip + slot-migration bookkeeping).

  1. Migration path stays open. If we outgrow Sentinel's

capacity ceiling later, migrating to Cluster is a one-time cost, not an ongoing tax. Premature sharding would tax every feature that touches Redis (keyspace partitioning, SLOT-aware pipelining, etc.).

  1. Client-library support is uniform. go-redis/v9 exposes

FailoverClient for Sentinel and ClusterClient for Cluster. We already use the simpler Client shape; the ClientFailoverClient change is one constructor call. ClusterClient requires more invasive changes (slot-aware hashing of key prefixes for multi-key ops, etc.).

Consequences

  • Positive — ops surface stays small. Three Sentinels +

three redis-servers, no extra processes, no HAProxy in front.

  • Positive — `internal/cachekeys` change is minimal. Switch

the constructor from redis.NewClient to redis.NewFailoverClient(redis.FailoverOptions{ MasterName, SentinelAddrs, ... }). Cache-key surface unchanged; existing call sites untouched.

  • Negative — capacity ceiling. All hot data through one

primary's RAM. Mitigation: monitoring at 75 % maxmemory (warn) + 90 % (page); migration plan to Cluster documented if/when we approach the ceiling.

  • Negative — single-primary write path. Reads can be served

from replicas via ReadOnlyReplicas mode if needed; writes only the primary. Acceptable since writes are dominated by the aggregator's bulk refresh which is throughput-tolerant.

  • Negative — Sentinel itself can split-brain. With quorum=2

on 3 Sentinels, a 2-1 partition lets the larger side promote; the partitioned-1 side won't. Correct behaviour, documented in the runbook.

Alternatives considered

  • Redis Cluster (the as-written ha-plan §3.4 wording) —

rejected per "Why Sentinel" above. Reconsider only if hot- data approaches the single-primary RAM ceiling.

  • Single-host Redis with replication-only (no Sentinel) —

rejected because failover would be manual; the SEV-1 drill-scenario timeline (docs/operations/drills/scenarios/) assumes automatic failover.

  • Managed Redis SaaS (Elasticache / Upstash / Redis Labs)

rejected because we self-host everything else by ADR-0008; introducing a SaaS dependency for HA-cache only is asymmetric.

  • KeyDB (multi-master Redis fork) — rejected because the

fork's bus factor is uncertain and we'd be the first Stellar-side team running it; not the right risk profile for the cache layer.

Implementation notes

The Patroni ansible role design note (Task #72 sub-role) and the Redis Sentinel ansible role design note (companion) cover the implementation shape. ha-plan §3.4 should be amended in the same PR that ships the Redis Sentinel ansible role to remove the Cluster/Sentinel terminology contradiction.

References