Skip to content

HOWTO: Run a DR Failback After PostgreSQL Recovery

Purpose: Restore the original PostgreSQL primary as a replica after a failover, re-synchronise it, and optionally promote it back to primary with full run-record capture.

Difficulty: Advanced

Track: Disaster Recovery Automation


Overview

Failback is often neglected in DR rehearsals: but it is the step that proves you can return to a known-good topology without data loss or extended downtime. In HybridOps, failback is a structured module operation with checkpoints at each stage and run records produced throughout. This HOWTO covers the full path.


1. Pre-Failback Assessment

  • Current cluster topology after failover.
  • Original primary node state: is it recoverable?
  • Decision: rebuild from backup or re-sync from replica.

2. Rebuilding the Original Node

  • Stopping PostgreSQL on the original primary.
  • pgbackrest restore from the latest backup.
  • Patroni configuration for standby mode.

3. Re-attaching as a Replica

  • Starting Patroni on the rebuilt node.
  • Confirming Patroni member registration in DCS.
  • Monitoring WAL catch-up and lag.

4. Validation Before Switchback

  • Replication lag below threshold.
  • Original node listed as healthy replica in Patroni.
  • pgBackRest stanza check on the rebuilt node.

5. Optional Controlled Switchback

  • patronictl switchover to promote original primary.
  • Application connection pool drain and reconnect.
  • Post-switchback cluster state snapshot.

6. Closing the DR Drill Record

  • Final cluster topology run record.
  • Total elapsed time from failover to clean failback.
  • DR drill run record closure with all linked records.

References


License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.