PostgreSQL HA DR Cycle¶

Executive summary¶

Runs a three-node PostgreSQL HA cluster on-prem using Patroni and etcd, then rehearses failover and controlled return through separate recovery lanes.
Uses pgBackRest object storage as the recovery backbone so the platform does not depend on a permanently active duplicate database estate.
Proves both infrastructure recovery and application-data fidelity with seeded row counts and checksum validation.
Keeps the live primary lane isolated while recovery logic is exercised elsewhere.

Case study – how this was used in practice¶

Context: The platform needed a DR path that could be rehearsed without paying for a full duplicate cloud estate all year.
Challenge: Failover alone was not enough; the recovery path had to preserve application data and support a controlled return on-prem.
Approach: The HA cluster, backup flow, DNS cutover, failover, and failback were exercised as separate but connected blueprints.
Outcome: The platform now has a repeatable DR cycle that can be run in isolated lanes without compromising the live database path.

Demo¶

Video walkthrough¶

Video placeholder: replace VIDEO_URL_HERE with the published walkthrough URL.
Suggested embed target: VIDEO_URL_HERE

Show seeded application data on the source lane, the failover run into GCP, checksum verification on the recovery leader, and the controlled return into the on-prem drill lane.

Screenshots¶

Add these files when they are ready:

![Source PostgreSQL HA leader with seeded dataset counts](./images/postgresql-ha-dr-cycle-01-source-counts.png)
![GCP recovery leader showing matching checksums](./images/postgresql-ha-dr-cycle-02-gcp-checksums.png)
![On-prem return lane showing the same dataset after failback](./images/postgresql-ha-dr-cycle-03-failback-validation.png)

Architecture¶

Primary lane: three-node PostgreSQL HA on-prem.
Recovery lane: GCP restore cluster built from pgBackRest-backed backups.
Control plane: DNS cutover and decision logic run separately from the database lane itself.
Return path: an isolated on-prem lane restores from the GCP-backed repository without disturbing the live primary lane.

Implementation highlights¶

Primary blueprint: onprem/postgresql-ha@v1
Failover blueprint: dr/postgresql-ha-failover-gcp@v1
Failback blueprint: dr/postgresql-ha-failback-onprem@v1
Supporting runbooks:
Failover to GCP
Failback on-prem
Repeatable app-data drill
Cleanup and destroy flow

Evidence and run records¶

Relevant run records typically live under:

envs/dev/logs/module/platform__postgresql-ha/...
envs/dev/logs/module/platform__postgresql-ha-backup/...
envs/dev/logs/module/platform__network__dns-routing/...
envs/drill/logs/module/platform__postgresql-ha/...

Learning outcomes¶

Understand how backup-driven DR differs from keeping a second always-on primary.
See how DNS cutover, database recovery, and validation checks fit into one rehearsed cycle.
Recognize the difference between proving infrastructure recovery and proving data fidelity.

Reuse and extensions¶

Extend the same pattern to a managed PostgreSQL recovery lane later if the operating model requires it.
Reuse the same drill approach for additional application datasets or service-specific recovery paths.

Status and versioning¶

Validated against the current self-managed PostgreSQL HA DR path, including isolated app-data seeding and checksum verification. Video and screenshots are still to be added.

Maintainer¶

Owner: HybridOps
Primary contact: platform-docs
Last reviewed: 2026-03-09