Repeatable PostgreSQL App-Data DR Drill¶
Purpose: Prove the self-managed PostgreSQL DR path end to end by seeding deterministic application data, failing over to an isolated GCP lane, taking a fresh proof backup, and restoring back into a separate on-prem proof lane.
Owner: Platform engineering / SRE
Trigger: Scheduled resilience drill or release-readiness verification for the PostgreSQL HA DR lane
Impact: Creates and validates temporary proof clusters in GCP and on-prem drill lanes. The live primary database lane and live DNS remain untouched.
Severity: P2
Pre-reqs: GCP ops runner is healthy, on-prem drill source cluster is healthy, runtime vault secrets exist in dev and drill, and operators have workstation access to the Proxmox bastion and GCP IAP.
Rollback strategy: Use the dedicated cleanup runbook to destroy only the proof lanes and preserve the live primary, source drill cluster, shared DNS, and shared networking state.
Context¶
This drill proves two separate claims:
- infrastructure recovery works across the shipped self-managed PostgreSQL HA failover and failback blueprints
- application data survives the full cycle, not just the cluster rebuild
The run uses three distinct lanes:
- source drill lane: existing on-prem drill PostgreSQL HA cluster at
10.12.0.41 - GCP proof lane: isolated failover target restored into
platform/gcp/platform-vm#gcp_pg_vms_app_proof - on-prem failback proof lane: isolated return target restored into
platform/onprem/platform-vm#postgres_ha_vms_app_proof_drill
The deterministic proof dataset lives in:
$HOME/.hybridops/envs/drill/config/drproof/drproof_seed.sql$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql
Expected validation output for a successful run:
tenants=20services=100events=1000tenant_checksum=d7fe13ac01e157e8e5f01f4c0469debdservice_checksum=3825e7553d2dba7b6bef2dbca2b2be79event_checksum=3895de221035a795e853fad560a31078
Preconditions and safety checks¶
- Confirm the isolated source drill lane is healthy.
jq '.outputs.db_host, .outputs.cap_db_postgresql_ha' \
"$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/latest.json"
Expected result:
10.12.0.41-
ready -
Confirm the GCP ops runner exists.
jq '.status' \
"$HOME/.hybridops/envs/dev/state/modules/platform__linux__ops-runner/instances/gcp_ops_runner_bootstrap.json"
Expected result:
-
"ok" -
Confirm the required secrets exist before starting.
cd /home/user/hybridops-studio/hybridops-core
./.venv/bin/hyops secrets ensure --env dev \
PATRONI_SUPERUSER_PASSWORD \
PATRONI_REPLICATION_PASSWORD \
NETBOX_DB_PASSWORD \
PG_BACKUP_GCS_SA_JSON
./.venv/bin/hyops secrets ensure --env drill \
PATRONI_SUPERUSER_PASSWORD \
PATRONI_REPLICATION_PASSWORD \
NETBOX_DB_PASSWORD \
PG_BACKUP_GCS_SA_JSON
- Confirm the proof overlays exist and are not pointed at the live primary lane.
sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-proof.yml"
sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-proof.yml"
sed -n '1,260p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml"
Check specifically for these isolated state instances:
gcp_pg_vms_app_proofpostgresql_restore_gcp_app_proofpostgresql_backup_run_gcp_app_proofpostgres_ha_vms_app_proof_drillpostgresql_restore_onprem_app_proof_drillpostgresql_backup_config_onprem_app_proof_drill
Steps¶
- Seed the deterministic application dataset on the source drill leader
Action: load the seeded proof schema and data into the isolated on-prem source drill lane.
Command or procedure:
ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drproof/drproof_seed.sql"
Expected result:
drproof_appis recreated cleanly on10.12.0.41- no changes are made to the live primary
devlane
Evidence:
- operator shell transcript
-
$HOME/.hybridops/envs/drill/config/drproof/drproof_seed.sql -
Verify the source drill dataset before failover
Action: confirm row counts and checksums on the source drill leader.
Command or procedure:
ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql"
Expected result:
- counts match
20 / 100 / 1000 - checksums match the values listed in this runbook
Evidence:
- operator shell transcript
-
$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql -
Restore the isolated GCP proof lane
Action: run the failover proof blueprint from the GCP runner.
Command or procedure:
cd /home/user/hybridops-studio/hybridops-core
HYOPS_CORE_ROOT=/home/user/hybridops-studio/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
--runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
--sync-env PATRONI_SUPERUSER_PASSWORD \
--sync-env PATRONI_REPLICATION_PASSWORD \
--sync-env NETBOX_DB_PASSWORD \
--sync-env PG_BACKUP_GCS_SA_JSON \
--file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-proof.yml" \
--execute
Expected result:
platform/gcp/platform-vm#gcp_pg_vms_app_proofreachesstatus=okplatform/postgresql-ha#postgresql_restore_gcp_app_proofreachesstatus=ok- restored leader endpoint is
10.72.16.27
Evidence:
$HOME/.hybridops/envs/dev/logs/runner/$HOME/.hybridops/envs/dev/state/modules/platform__gcp__platform-vm/instances/gcp_pg_vms_app_proof.json-
$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_proof.json -
Verify the restored GCP proof lane
Action: confirm the restored proof data on the GCP leader through IAP.
Command or procedure:
gcloud compute ssh platform-proof-pgapp-01 \
--project hybridops-platform-prod \
--zone europe-west2-a \
--tunnel-through-iap \
--command 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql"
Expected result:
- counts match
20 / 100 / 1000 - checksums match the source drill lane
Evidence:
- operator shell transcript
-
$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_proof.json -
Take a fresh pinned backup from the GCP proof leader
Action: run an on-demand proof backup after the GCP restore is validated.
Command or procedure:
cd /home/user/hybridops-studio/hybridops-core
HYOPS_CORE_ROOT=/home/user/hybridops-studio/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
--runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
--sync-env PATRONI_SUPERUSER_PASSWORD \
--sync-env PATRONI_REPLICATION_PASSWORD \
--sync-env PG_BACKUP_GCS_SA_JSON \
--file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-proof.yml" \
--execute
Expected result:
platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_proofreachesstatus=ok- a new pgBackRest backup label exists in the repository
- the backup label and timeline are published in module state for the failback step
Evidence:
$HOME/.hybridops/envs/dev/logs/runner/-
$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha-backup/instances/postgresql_backup_run_gcp_app_proof.json -
Point the failback overlay at the fresh backup-run state
Action: update the failback overlay to consume the latest backup metadata from the backup-run state.
Command or procedure:
sed -n '1,220p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml"
Set these fields in the failback overlay:
backup_state_ref: platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_proofbackup_state_env: dev
Leave these fields blank unless there is no backup-run state to consume:
restore_setrestore_target_timeline
Expected result:
- failback overlay points at the backup-run state created in the previous step
- no old backup label or timeline is left behind by mistake
Evidence:
- operator shell transcript
-
$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml -
Restore the isolated on-prem failback proof lane
Action: rebuild and restore the separate on-prem proof lane from the fresh GCP proof backup.
Command or procedure:
cd /home/user/hybridops-studio/hybridops-core
HYOPS_CORE_ROOT=/home/user/hybridops-studio/hybridops-core \
./.venv/bin/hyops blueprint deploy --env drill \
--file "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml" \
--execute
Expected result:
platform/onprem/platform-vm#postgres_ha_vms_app_proof_drillreachesstatus=okplatform/postgresql-ha#postgresql_restore_onprem_app_proof_drillreachesstatus=ok- restored leader endpoint is
10.12.0.51
Evidence:
$HOME/.hybridops/envs/drill/logs/module/platform__onprem__platform-vm/$HOME/.hybridops/envs/drill/logs/module/platform__postgresql-ha/-
$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_proof_drill.json -
Verify the on-prem failback proof lane
Action: confirm the same dataset now exists on the isolated failback proof leader.
Command or procedure:
ssh -J root@192.168.0.27 opsadmin@10.12.0.51 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql"
Expected result:
- counts match
20 / 100 / 1000 - checksums match the source drill lane and GCP proof lane
Evidence:
- operator shell transcript
$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_proof_drill.json
Verification¶
Confirm all three checkpoints line up:
- source drill leader:
10.12.0.41 - GCP proof leader:
10.72.16.27 - on-prem failback proof leader:
10.12.0.51
Final success criteria:
- all three locations return identical counts and checksums
platform/postgresql-ha#postgresql_restore_gcp_app_proofisokplatform/postgresql-ha-backup#postgresql_backup_run_gcp_app_proofisokplatform/postgresql-ha#postgresql_restore_onprem_app_proof_drillisokplatform/postgresql-ha-backup#postgresql_backup_config_onprem_app_proof_drillisok
Useful state checks:
jq '.status, .outputs.db_host' \
"$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_proof.json"
jq '.status, .outputs.db_host' \
"$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_proof_drill.json"
Post-actions and clean-up¶
- Do not repoint live DNS from this drill.
- Do not destroy the source drill lane at
10.12.0.41from this procedure. - Preserve the proof logs and state until the drill report is accepted.
- When the drill is complete, use the companion cleanup flow:
- Cleanup the PostgreSQL App-Data DR Proof Lanes
Related¶
Related reading¶
- Cleanup the PostgreSQL App-Data DR Proof Lanes
- PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)
- Failover PostgreSQL HA to GCP (HyOps Blueprint)
- Failback PostgreSQL HA to On-Prem (HyOps Blueprint)
- Showcase – PostgreSQL HA DR Cycle
References¶
~/.hybridops/envs/drill/config/drproof/drproof_seed.sql~/.hybridops/envs/drill/config/drproof/drproof_verify.sql~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-proof.yml~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-proof.yml~/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml
Maintainer: HybridOps
License: MIT-0 for code, CC-BY-4.0 for documentation