PostgreSQL HA DR Cycle¶
Executive summary¶
- Runs a three-node PostgreSQL HA cluster on-prem using Patroni and etcd, then rehearses failover and controlled return through separate recovery lanes.
- Uses pgBackRest object storage as the recovery backbone so the platform does not depend on a permanently active duplicate database estate.
- Proves both infrastructure recovery and application-data fidelity with seeded row counts and checksum validation.
- Keeps the live primary lane isolated while recovery logic is exercised elsewhere.
Case study – how this was used in practice¶
- Context: The platform needed a DR path that could be rehearsed without paying for a full duplicate cloud estate all year.
- Challenge: Failover alone was not enough; the recovery path had to preserve application data and support a controlled return on-prem.
- Approach: The HA cluster, backup flow, DNS cutover, failover, and failback were exercised as separate but connected blueprints.
- Outcome: The platform now has a repeatable DR cycle that can be run in isolated lanes without compromising the live database path.
Demo¶
Video walkthrough¶
- Video placeholder: replace
VIDEO_URL_HEREwith the published walkthrough URL. - Suggested embed target:
VIDEO_URL_HERE
Show seeded application data on the source lane, the failover run into GCP, checksum verification on the recovery leader, and the controlled return into the on-prem drill lane.
Screenshots¶
Add these files when they are ready:



Architecture¶
- Primary lane: three-node PostgreSQL HA on-prem.
- Recovery lane: GCP restore cluster built from pgBackRest-backed backups.
- Control plane: DNS cutover and decision logic run separately from the database lane itself.
- Return path: an isolated on-prem lane restores from the GCP-backed repository without disturbing the live primary lane.
Implementation highlights¶
- Primary blueprint:
onprem/postgresql-ha@v1 - Failover blueprint:
dr/postgresql-ha-failover-gcp@v1 - Failback blueprint:
dr/postgresql-ha-failback-onprem@v1 - Supporting runbooks:
- Failover to GCP
- Failback on-prem
- Repeatable app-data drill
- Cleanup and destroy flow
Evidence and run records¶
Relevant run records typically live under:
envs/dev/logs/module/platform__postgresql-ha/...envs/dev/logs/module/platform__postgresql-ha-backup/...envs/dev/logs/module/platform__network__dns-routing/...envs/drill/logs/module/platform__postgresql-ha/...
Learning outcomes¶
- Understand how backup-driven DR differs from keeping a second always-on primary.
- See how DNS cutover, database recovery, and validation checks fit into one rehearsed cycle.
- Recognize the difference between proving infrastructure recovery and proving data fidelity.
Reuse and extensions¶
- Extend the same pattern to a managed PostgreSQL recovery lane later if the operating model requires it.
- Reuse the same drill approach for additional application datasets or service-specific recovery paths.
Related¶
Related reading¶
- Failover PostgreSQL HA to GCP (HyOps Blueprint)
- Failback PostgreSQL HA to On-Prem (HyOps Blueprint)
- Repeatable PostgreSQL App-Data DR Drill
- Cleanup the PostgreSQL App-Data DR Proof Lanes
Status and versioning¶
Validated against the current self-managed PostgreSQL HA DR path, including isolated app-data seeding and checksum verification. Video and screenshots are still to be added.
Maintainer¶
- Owner: HybridOps
- Primary contact: platform-docs
- Last reviewed: 2026-03-09