Skip to main content
High Availability Setup

Beyond the Redundancy Checklist: A Dappled Comparison of High Availability Workflow Philosophies

When a system goes down, the post-mortem rarely points to a missing redundant power supply. More often, the root cause is a workflow that assumed failover would be instant, or a team that never tested the recovery sequence under load. Redundancy checklists are comforting — count the NICs, check the RAID, verify the dual PSUs — but they miss the messy human layer: how does the team actually respond when the primary node stops talking? This guide compares three high availability workflow philosophies — active-passive, active-active, and N+1 clustering — through the lens of process and operations, not just hardware specs. We'll walk through who needs each approach, what prerequisites matter, and the real-world trade-offs that determine whether a HA plan survives first contact with a production incident.

When a system goes down, the post-mortem rarely points to a missing redundant power supply. More often, the root cause is a workflow that assumed failover would be instant, or a team that never tested the recovery sequence under load. Redundancy checklists are comforting — count the NICs, check the RAID, verify the dual PSUs — but they miss the messy human layer: how does the team actually respond when the primary node stops talking?

This guide compares three high availability workflow philosophies — active-passive, active-active, and N+1 clustering — through the lens of process and operations, not just hardware specs. We'll walk through who needs each approach, what prerequisites matter, and the real-world trade-offs that determine whether a HA plan survives first contact with a production incident.

Who Needs a HA Workflow Philosophy — and What Goes Wrong Without One

A manufacturing plant I once read about suffered a 14-hour outage because their active-passive failover script had a hardcoded IP address that the replacement server didn't use. The checklist had all the right boxes checked — dual network cards, redundant storage, generator backup — but the workflow assumed that the standby node would be configured identically, down to the last alias. It wasn't.

Teams that skip the philosophy step often end up with a hybrid mess: active-passive for the database, active-active for the web tier, and a vague hope that the monitoring system will figure out the rest. When the database fails, the web tier keeps sending traffic to the now-dead primary because the load balancer's health check still sees the secondary as read-only. The result is a partial outage that no single checklist could have predicted.

Who really needs a deliberate workflow philosophy? Any team that operates a service with a recovery time objective (RTO) under 30 minutes, or a recovery point objective (RPO) under 5 minutes. If you can afford an hour of downtime and losing 15 minutes of data, a simple active-passive script with manual verification may suffice. But if you're serving real-time payments, live video, or any system where minutes of lost data mean lost revenue or compliance violations, you need a coherent philosophy — not a checklist.

The most common failure without a workflow philosophy is what we call the "split-brain trap." When network partitions occur, two nodes both think the other is dead. Without a clear consensus protocol, both may try to serve writes, or both may step down, leaving the system completely unavailable. A checklist might say "configure quorum," but the philosophy dictates how you detect partitions, what weight each node has, and whether you prefer availability over consistency.

Signs Your Team Needs to Rethink HA Workflow

If any of these sound familiar, it's time to move beyond the checklist: your failover documentation is longer than the actual recovery procedure; the last three incidents involved manual SSH sessions to promote a standby; your monitoring alerts fire but the runbook says "call the senior engineer" with no further detail; or you've never tested a failover during peak traffic hours.

Prerequisites: What to Settle Before Choosing a Philosophy

Before comparing active-passive, active-active, and N+1, you need to answer three foundational questions. First, what is your acceptable data loss window? If you cannot lose a single transaction, most active-passive setups with asynchronous replication will fail you — you need synchronous replication, which constrains latency and geography. Second, what is your team's operational capacity? Active-active sounds attractive, but it doubles the surface area for configuration drift, monitoring, and debugging. A two-person SRE team may be better served by a well-documented active-passive system than by a half-baked active-active one that nobody fully understands.

Third, what is your tolerance for complexity in the failover decision itself? In active-passive, the decision is binary: is the primary dead? In active-active, the decision is continuous: is this node healthy enough to keep serving? The latter requires sophisticated health checks, traffic draining, and session persistence logic. Many teams underestimate the engineering effort required to make active-active actually work without dropping sessions or corrupting state.

Network and Storage Prerequisites

All HA philosophies assume a reliable, low-latency network between nodes. If your datacenter links have 50ms latency or frequent packet loss, synchronous replication becomes impractical. You may need to accept asynchronous replication and thus a longer RPO, or you may need to colocate nodes. Similarly, shared storage (like a SAN) simplifies active-passive failover but adds a single point of failure at the storage layer. Most modern HA workflows prefer replicated storage with consensus (e.g., etcd, Galera, or Ceph) to avoid that bottleneck.

Organizational Prerequisites

A HA workflow philosophy is only as good as the team that maintains it. You need a documented runbook that includes not just the happy path but the failure modes of the failover itself — what happens if the script hangs, if the network partition heals mid-failover, or if the standby node is already degraded. Regularly scheduled game days (at least quarterly) are non-negotiable. Without practice, the workflow becomes fiction.

Core Workflow: Three Philosophies in Practice

The core workflow for any HA setup follows a loop: detect a failure, decide which node(s) should serve traffic, execute the transition, and verify that the system is healthy. Each philosophy implements this loop differently.

Active-Passive Workflow

In the simplest form, one node (primary) handles all traffic; a second node (standby) sits idle or runs a warm copy of the data. Detection usually relies on a heartbeat mechanism — if the standby does not hear from the primary for N seconds, it assumes failure. The decision is made by a quorum agent or a third-party orchestrator (like Pacemaker or keepalived). Execution involves promoting the standby, reassigning the virtual IP, and verifying that the service responds correctly.

The catch is that heartbeats can be misleading. A network glitch may cause a false positive, triggering a failover that actually makes things worse. A common mitigation is to require multiple failure signals (e.g., loss of heartbeat plus a failed application-level health check) before acting. Another pitfall is the "fencing" problem: you must ensure the old primary is truly dead before the new primary takes over, or you risk split-brain and data corruption. Fencing often involves cutting power to the old node via IPMI or a network switch, which adds its own failure mode.

Active-Active Workflow

Here, all nodes serve traffic simultaneously, typically behind a load balancer. Detection is continuous: the load balancer runs health checks on each node and removes unhealthy ones from the pool. The decision is distributed — each node must handle session state carefully, either by storing sessions in a shared cache (like Redis) or by using sticky sessions with a timeout.

The execution step is graceful: the load balancer stops sending new traffic to the degraded node, waits for existing sessions to drain, and then takes the node out of rotation. This avoids a sudden cutoff but requires the application to handle in-flight transactions properly. A common failure is that draining takes too long, or that the node becomes unresponsive before draining completes, leading to dropped sessions.

N+1 Clustering Workflow

N+1 means you run N active nodes plus one extra that can take over if any of the N fails. This is common in distributed databases (e.g., MongoDB replica sets, PostgreSQL with Patroni). Detection uses a consensus protocol (like Raft or Paxos) to elect a leader. The decision is made by a majority of nodes — if the leader fails and a new leader is elected, the workflow is automatic.

The execution involves the new leader assuming the role, and the remaining nodes reconfiguring replication. The pitfall here is the election itself: if network partitions cause a split-brain, the old leader may continue accepting writes while a new leader is elected on the other side. A strict majority rule (e.g., 3 out of 5 nodes) prevents this but means you lose availability if you lose more than half your nodes. The trade-off is availability vs. consistency.

Tools, Setup, and Environment Realities

Choosing a philosophy is only half the battle; the tools you use will shape your workflow in ways you may not anticipate. For active-passive, keepalived and Pacemaker are mature but require careful configuration of priority, preemption, and fencing. A common mistake is to set the failover delay too short, causing flapping during network blips. A delay of 10–15 seconds is usually safe for internal networks; longer for WAN links.

For active-active, HAProxy and NGINX are the workhorses. Both support active health checks (probes that send real traffic) and passive health checks (monitoring error rates). The key setting is the "rise" and "fall" counts — how many successful probes before a node is considered healthy again, and how many failures before it's removed. Setting these too aggressively (e.g., 1 failure = removal) causes flapping; too conservatively (e.g., 10 failures) means degraded service persists for longer.

For N+1 clustering, Patroni (for PostgreSQL) and MongoDB's built-in replica set management are popular. Both use etcd or Consul for consensus. A common setup pitfall is running the consensus store on the same nodes as the database — if the database node fails, you lose both a data replica and a consensus vote. Always run the consensus store on separate nodes or at least ensure an odd number of nodes for the store.

Environment Realities: Cloud vs. On-Prem

In cloud environments, virtual IPs are tricky because IP addresses are ephemeral. Most cloud-native HA workflows use DNS-based failover (e.g., Route53 health checks) or managed load balancers (AWS NLB, GCP TCP LB). The catch is DNS propagation delay — TTLs must be set low (30–60 seconds) to achieve fast failover, but low TTLs increase DNS query volume and cost. For on-prem, virtual IPs with gratuitous ARP are still the fastest method, but they require layer-2 adjacency.

Latency is another reality check. If your nodes are spread across regions, synchronous replication becomes impossible due to speed-of-light limits. You must accept asynchronous replication and thus a longer RPO, or you must use a multi-region active-active design with conflict resolution — which is far more complex.

Variations for Different Constraints

Not every team can afford three-node clusters or low-latency links. Here are variations for common constraints.

Two-Node Active-Passive with a Witness

If you have only two nodes and cannot add a third, you risk split-brain. The solution is a lightweight witness (a small VM or a cloud function) that casts a tie-breaking vote. The witness must be on a separate failure domain (different rack, different availability zone). Keepalived supports this with the "vrrp_script" and "unicast_peer" options, but you must configure the witness to never become a primary itself.

Active-Active with Read Replicas

If your workload is read-heavy, you can use an active-active philosophy for reads and active-passive for writes. This is common in web applications: one primary database handles writes; multiple read replicas serve reads. The load balancer directs write requests to the primary and read requests to any replica. The variation requires the application to distinguish between read and write queries, and to tolerate slightly stale reads from replicas.

N+1 with Geo-Distributed Nodes

For disaster recovery across regions, you can run an N+1 cluster with nodes in two regions and a third region for the tie-breaking consensus vote. This is expensive but gives you automatic failover with no data loss if you use synchronous replication within a region and asynchronous between regions. The variation requires careful tuning of network latency and a clear policy on which region is preferred.

Budget Constraint: Manual Failover with Good Runbooks

If you have no budget for clustering software or additional nodes, a manual failover workflow with detailed runbooks can still achieve a 15-minute RTO. The key is to practice the runbook monthly, automate the verification steps, and use a simple DNS TTL of 60 seconds. This is not a philosophy but a pragmatic fallback — acknowledge the limitation and monitor the failover time in every drill.

Pitfalls, Debugging, and What to Check When It Fails

Even with a well-chosen philosophy, failures happen. Here are the most common pitfalls and how to diagnose them.

Split-Brain Scenarios

When two nodes both think they are primary, data corruption is almost certain. The first symptom is usually conflicting writes that cause replication errors. To debug, check the consensus log (e.g., etcd logs) for multiple leaders. The fix is to implement strict fencing: make sure the old primary cannot serve traffic after a failover. In active-passive, this means using STONITH (Shoot The Other Node In The Head) — a power switch or IPMI command that forcibly reboots the old primary. In active-active, split-brain is harder to detect because both nodes may serve traffic for a while; use application-level conflict detection and a manual reconciliation process.

Failover Flapping

If a node fails and recovers quickly, the system may flip back and forth, causing repeated short outages. This often happens when health checks are too sensitive — a brief CPU spike triggers a failover, but the node recovers seconds later. The fix is to add a "cooldown" period: once a failover occurs, the system must wait a minimum time (e.g., 60 seconds) before failing back. Also, use a "hold-down" timer on health checks: require multiple consecutive failures before removing a node.

Incomplete Drain During Active-Active Failover

When a node is removed from the load balancer pool, existing connections may linger if the application doesn't support graceful shutdown. The symptom is increased error rates on the remaining nodes as they inherit half-open connections. To debug, check the load balancer logs for connections that were not drained. The fix is to implement a connection drain timeout (e.g., 30 seconds) and ensure the application listens for SIGTERM and finishes in-flight requests.

Configuration Drift

Over time, standby nodes may diverge from the primary in subtle ways — different kernel parameters, missing packages, or outdated certificates. The symptom is a failover that works on paper but fails because the standby's firewall blocks a port that the primary had open. The fix is to use infrastructure-as-code (Terraform, Ansible) to provision all nodes identically, and to run regular "configuration compliance" checks. Game days should include a test where the standby is promoted and the old primary is rebuilt from scratch.

What to Check First When a Failover Fails

Start with the logs: the orchestrator logs (e.g., Pacemaker, Patroni), the load balancer logs, and the application logs on both nodes. Look for timeouts, authentication errors, or resource contention. Next, verify network connectivity between nodes — a firewall change or routing issue is often the culprit. Then check the health check endpoint: does the standby actually respond correctly when probed? Finally, verify that the virtual IP or DNS record has been updated — DNS caching can cause lingering traffic to the dead node.

After each incident, update the runbook with the actual steps taken and any unexpected behaviors. The goal is not to eliminate all failures — that's impossible — but to make each failure a learning opportunity that strengthens the workflow.

Share this article:

Comments (0)

No comments yet. Be the first to comment!