Skip to content

PostgreSQL HA DR Cycle

Executive summary

  • Runs a three-node PostgreSQL HA cluster on-prem using Patroni and etcd, then rehearses failover and controlled return through separate recovery lanes.
  • Uses pgBackRest object storage as the recovery backbone so the platform does not depend on a permanently active duplicate database estate.
  • Proves both infrastructure recovery and application-data fidelity with seeded row counts and checksum validation.
  • Keeps the live primary lane isolated while recovery logic is exercised elsewhere.

Case study – how this was used in practice

  • Context: The platform needed a DR path that could be rehearsed without paying for a full duplicate cloud estate all year.
  • Challenge: Failover alone was not enough; the recovery path had to preserve application data and support a controlled return on-prem.
  • Approach: The HA cluster, backup flow, DNS cutover, failover, and failback were exercised as separate but connected blueprints.
  • Outcome: The platform now has a repeatable DR cycle that can be run in isolated lanes without compromising the live database path.

Demo

Video walkthrough

  • Video placeholder: replace VIDEO_URL_HERE with the published walkthrough URL.
  • Suggested embed target: VIDEO_URL_HERE

Show seeded application data on the source lane, the failover run into GCP, checksum verification on the recovery leader, and the controlled return into the on-prem drill lane.

Screenshots

Add these files when they are ready:

![Source PostgreSQL HA leader with seeded dataset counts](./images/postgresql-ha-dr-cycle-01-source-counts.png)
![GCP recovery leader showing matching checksums](./images/postgresql-ha-dr-cycle-02-gcp-checksums.png)
![On-prem return lane showing the same dataset after failback](./images/postgresql-ha-dr-cycle-03-failback-validation.png)

Architecture

  • Primary lane: three-node PostgreSQL HA on-prem.
  • Recovery lane: GCP restore cluster built from pgBackRest-backed backups.
  • Control plane: DNS cutover and decision logic run separately from the database lane itself.
  • Return path: an isolated on-prem lane restores from the GCP-backed repository without disturbing the live primary lane.

Implementation highlights

Evidence and run records

Relevant run records typically live under:

  • envs/dev/logs/module/platform__postgresql-ha/...
  • envs/dev/logs/module/platform__postgresql-ha-backup/...
  • envs/dev/logs/module/platform__network__dns-routing/...
  • envs/drill/logs/module/platform__postgresql-ha/...

Learning outcomes

  • Understand how backup-driven DR differs from keeping a second always-on primary.
  • See how DNS cutover, database recovery, and validation checks fit into one rehearsed cycle.
  • Recognize the difference between proving infrastructure recovery and proving data fidelity.

Reuse and extensions

  • Extend the same pattern to a managed PostgreSQL recovery lane later if the operating model requires it.
  • Reuse the same drill approach for additional application datasets or service-specific recovery paths.

Status and versioning

Validated against the current self-managed PostgreSQL HA DR path, including isolated app-data seeding and checksum verification. Video and screenshots are still to be added.

Maintainer

  • Owner: HybridOps
  • Primary contact: platform-docs
  • Last reviewed: 2026-03-09