Skip to content

Repeatable PostgreSQL App-Data DR Drill

Purpose: Prove the self-managed PostgreSQL DR path end to end by seeding deterministic application data, failing over to an isolated GCP lane, taking a fresh proof backup, and restoring back into a separate on-prem proof lane.
Owner: Platform engineering / SRE
Trigger: Scheduled resilience drill or release-readiness verification for the PostgreSQL HA DR lane
Impact: Creates and validates temporary proof clusters in GCP and on-prem drill lanes. The live primary database lane and live DNS remain untouched.
Severity: P2
Pre-reqs: GCP ops runner is healthy, on-prem drill source cluster is healthy, runtime vault secrets exist in dev and drill, and operators have workstation access to the Proxmox bastion and GCP IAP.
Rollback strategy: Use the dedicated cleanup runbook to destroy only the proof lanes and preserve the live primary, source drill cluster, shared DNS, and shared networking state.

Context

This drill proves two separate claims:

  • infrastructure recovery works across the shipped self-managed PostgreSQL HA failover and failback blueprints
  • application data survives the full cycle, not just the cluster rebuild

The run uses three distinct lanes:

  • source drill lane: existing on-prem drill PostgreSQL HA cluster at 10.12.0.41
  • GCP proof lane: isolated failover target restored into platform/gcp/platform-vm#gcp_pg_vms_app_proof
  • on-prem failback proof lane: isolated return target restored into platform/onprem/platform-vm#postgres_ha_vms_app_proof_drill

The deterministic proof dataset lives in:

  • $HOME/.hybridops/envs/drill/config/drproof/drproof_seed.sql
  • $HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql

Expected validation output for a successful run:

  • tenants=20
  • services=100
  • events=1000
  • tenant_checksum=d7fe13ac01e157e8e5f01f4c0469debd
  • service_checksum=3825e7553d2dba7b6bef2dbca2b2be79
  • event_checksum=3895de221035a795e853fad560a31078

Preconditions and safety checks

  1. Confirm the isolated source drill lane is healthy.
jq '.outputs.db_host, .outputs.cap_db_postgresql_ha' \
  "$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/latest.json"

Expected result:

  • 10.12.0.41
  • ready

  • Confirm the GCP ops runner exists.

jq '.status' \
  "$HOME/.hybridops/envs/dev/state/modules/platform__linux__ops-runner/instances/gcp_ops_runner_bootstrap.json"

Expected result:

  • "ok"

  • Confirm the required secrets exist before starting.

cd /home/user/hybridops-studio/hybridops-core

./.venv/bin/hyops secrets ensure --env dev \
  PATRONI_SUPERUSER_PASSWORD \
  PATRONI_REPLICATION_PASSWORD \
  NETBOX_DB_PASSWORD \
  PG_BACKUP_GCS_SA_JSON

./.venv/bin/hyops secrets ensure --env drill \
  PATRONI_SUPERUSER_PASSWORD \
  PATRONI_REPLICATION_PASSWORD \
  NETBOX_DB_PASSWORD \
  PG_BACKUP_GCS_SA_JSON
  1. Confirm the proof overlays exist and are not pointed at the live primary lane.
sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-proof.yml"
sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-proof.yml"
sed -n '1,260p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml"

Check specifically for these isolated state instances:

  • gcp_pg_vms_app_proof
  • postgresql_restore_gcp_app_proof
  • postgresql_backup_run_gcp_app_proof
  • postgres_ha_vms_app_proof_drill
  • postgresql_restore_onprem_app_proof_drill
  • postgresql_backup_config_onprem_app_proof_drill

Steps

  1. Seed the deterministic application dataset on the source drill leader

Action: load the seeded proof schema and data into the isolated on-prem source drill lane.

Command or procedure:

ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drproof/drproof_seed.sql"

Expected result:

  • drproof_app is recreated cleanly on 10.12.0.41
  • no changes are made to the live primary dev lane

Evidence:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/config/drproof/drproof_seed.sql

  • Verify the source drill dataset before failover

Action: confirm row counts and checksums on the source drill leader.

Command or procedure:

ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql"

Expected result:

  • counts match 20 / 100 / 1000
  • checksums match the values listed in this runbook

Evidence:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql

  • Restore the isolated GCP proof lane

Action: run the failover proof blueprint from the GCP runner.

Command or procedure:

cd /home/user/hybridops-studio/hybridops-core

HYOPS_CORE_ROOT=/home/user/hybridops-studio/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
  --runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
  --sync-env PATRONI_SUPERUSER_PASSWORD \
  --sync-env PATRONI_REPLICATION_PASSWORD \
  --sync-env NETBOX_DB_PASSWORD \
  --sync-env PG_BACKUP_GCS_SA_JSON \
  --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-proof.yml" \
  --execute

Expected result:

  • platform/gcp/platform-vm#gcp_pg_vms_app_proof reaches status=ok
  • platform/postgresql-ha#postgresql_restore_gcp_app_proof reaches status=ok
  • restored leader endpoint is 10.72.16.27

Evidence:

  • $HOME/.hybridops/envs/dev/logs/runner/
  • $HOME/.hybridops/envs/dev/state/modules/platform__gcp__platform-vm/instances/gcp_pg_vms_app_proof.json
  • $HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_proof.json

  • Verify the restored GCP proof lane

Action: confirm the restored proof data on the GCP leader through IAP.

Command or procedure:

gcloud compute ssh platform-proof-pgapp-01 \
  --project hybridops-platform-prod \
  --zone europe-west2-a \
  --tunnel-through-iap \
  --command 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql"

Expected result:

  • counts match 20 / 100 / 1000
  • checksums match the source drill lane

Evidence:

  • operator shell transcript
  • $HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_proof.json

  • Take a fresh pinned backup from the GCP proof leader

Action: run an on-demand proof backup after the GCP restore is validated.

Command or procedure:

cd /home/user/hybridops-studio/hybridops-core

HYOPS_CORE_ROOT=/home/user/hybridops-studio/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
  --runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
  --sync-env PATRONI_SUPERUSER_PASSWORD \
  --sync-env PATRONI_REPLICATION_PASSWORD \
  --sync-env PG_BACKUP_GCS_SA_JSON \
  --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-proof.yml" \
  --execute

Expected result:

  • platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_proof reaches status=ok
  • a new pgBackRest backup label exists in the repository
  • the backup label and timeline are published in module state for the failback step

Evidence:

  • $HOME/.hybridops/envs/dev/logs/runner/
  • $HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha-backup/instances/postgresql_backup_run_gcp_app_proof.json

  • Point the failback overlay at the fresh backup-run state

Action: update the failback overlay to consume the latest backup metadata from the backup-run state.

Command or procedure:

sed -n '1,220p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml"

Set these fields in the failback overlay:

  • backup_state_ref: platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_proof
  • backup_state_env: dev

Leave these fields blank unless there is no backup-run state to consume:

  • restore_set
  • restore_target_timeline

Expected result:

  • failback overlay points at the backup-run state created in the previous step
  • no old backup label or timeline is left behind by mistake

Evidence:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml

  • Restore the isolated on-prem failback proof lane

Action: rebuild and restore the separate on-prem proof lane from the fresh GCP proof backup.

Command or procedure:

cd /home/user/hybridops-studio/hybridops-core

HYOPS_CORE_ROOT=/home/user/hybridops-studio/hybridops-core \
./.venv/bin/hyops blueprint deploy --env drill \
  --file "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml" \
  --execute

Expected result:

  • platform/onprem/platform-vm#postgres_ha_vms_app_proof_drill reaches status=ok
  • platform/postgresql-ha#postgresql_restore_onprem_app_proof_drill reaches status=ok
  • restored leader endpoint is 10.12.0.51

Evidence:

  • $HOME/.hybridops/envs/drill/logs/module/platform__onprem__platform-vm/
  • $HOME/.hybridops/envs/drill/logs/module/platform__postgresql-ha/
  • $HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_proof_drill.json

  • Verify the on-prem failback proof lane

Action: confirm the same dataset now exists on the isolated failback proof leader.

Command or procedure:

ssh -J root@192.168.0.27 opsadmin@10.12.0.51 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drproof/drproof_verify.sql"

Expected result:

  • counts match 20 / 100 / 1000
  • checksums match the source drill lane and GCP proof lane

Evidence:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_proof_drill.json

Verification

Confirm all three checkpoints line up:

  • source drill leader: 10.12.0.41
  • GCP proof leader: 10.72.16.27
  • on-prem failback proof leader: 10.12.0.51

Final success criteria:

  • all three locations return identical counts and checksums
  • platform/postgresql-ha#postgresql_restore_gcp_app_proof is ok
  • platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_proof is ok
  • platform/postgresql-ha#postgresql_restore_onprem_app_proof_drill is ok
  • platform/postgresql-ha-backup#postgresql_backup_config_onprem_app_proof_drill is ok

Useful state checks:

jq '.status, .outputs.db_host' \
  "$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_proof.json"

jq '.status, .outputs.db_host' \
  "$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_proof_drill.json"

Post-actions and clean-up

  • Do not repoint live DNS from this drill.
  • Do not destroy the source drill lane at 10.12.0.41 from this procedure.
  • Preserve the proof logs and state until the drill report is accepted.
  • When the drill is complete, use the companion cleanup flow:
  • Cleanup the PostgreSQL App-Data DR Proof Lanes

References

  • ~/.hybridops/envs/drill/config/drproof/drproof_seed.sql
  • ~/.hybridops/envs/drill/config/drproof/drproof_verify.sql
  • ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-proof.yml
  • ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-proof.yml
  • ~/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-proof.yml

Maintainer: HybridOps
License: MIT-0 for code, CC-BY-4.0 for documentation