๐ฐ Introduction
Traditional disaster recovery (DR) procedures often rely on manual intervention:
administrators receive alerts, log into systems, locate backups, and manually trigger restores.
While this approach works in theory, during real-world incidents โ such as data center outages or ransomware attacks โ
manual recovery is slow, inconsistent, and error-prone.
Modern enterprises are shifting to Automated Disaster Recovery (ADR),
where systems detect, respond, and recover automatically based on defined events and policies.
This article covers:
1๏ธโฃ The architecture and workflow of ADR
2๏ธโฃ Integrating Proxmox VE, PBS, and APIs with Ansible and Terraform
3๏ธโฃ Real-world automation examples for recovery orchestration
๐งฉ 1. Automated Disaster Recovery (ADR) Architecture Overview
Architecture Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Monitoring Layer โ
โ Prometheus โ Grafana โ Alertmanager โ Webhook/Slack โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Event Trigger
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
โ Orchestration Layer โ
โ Ansible / Terraform / N8N / AWX โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ Automated Execution
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
โ Execution Layer โ
โ Proxmox VE + PBS + API + Ceph โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ง 2. ADR Workflow Concept
| Stage | Trigger Source | Action | Tool |
|---|---|---|---|
| 1๏ธโฃ Event Detection | Prometheus / Grafana | Detect node or PBS outage | Alertmanager |
| 2๏ธโฃ Notification | Webhook / Slack / Email | Notify administrators & automation system | Alertmanager / Webhook |
| 3๏ธโฃ Orchestration | Trigger Playbook / Script | Execute recovery workflow | Ansible / Terraform |
| 4๏ธโฃ Data Recovery | Restore from remote PBS / Cloud backup | Recreate VMs, networks, and services | Proxmox API + PBS |
| 5๏ธโฃ Verification | Validate recovery status | Confirm and report completion | API + Grafana |
โ๏ธ 3. Proxmox API + Ansible Integration Example
Proxmoxโs RESTful API exposes nearly all operations,
making it ideal for integration with Ansible to automate DR processes.
1๏ธโฃ Ansible Inventory and Variables
/etc/ansible/hosts
[pve_cluster]
pve-node01 ansible_host=10.0.0.11
pve-node02 ansible_host=10.0.0.12
Variable definitions:
proxmox_api_url: "https://10.0.0.11:8006/api2/json"
proxmox_user: "root@pam"
proxmox_token_id: "dr-automation"
proxmox_token_secret: "xxxxxxx"
2๏ธโฃ Automated Recovery Playbook
---
- name: Proxmox Automated VM Recovery
hosts: localhost
gather_facts: no
tasks:
- name: Restore VM from remote PBS
uri:
url: "{{ proxmox_api_url }}/nodes/pve-node02/qemu"
method: POST
headers:
Authorization: "PVEAPIToken={{ proxmox_user }}!{{ proxmox_token_id }}={{ proxmox_token_secret }}"
body_format: json
body:
vmid: 301
restore: "pbs:remote-pbs/vm-301"
unique: 1
pool: "production"
register: restore_result
- name: Print restore job status
debug:
var: restore_result
This playbook can be triggered by Alertmanager, AWX, or Webhook,
automatically restoring a VM on a remote site โ without manual action.
โ๏ธ 4. Terraform Integration for Automated Rebuild
Terraform can automate infrastructure provisioning at remote or cloud DR sites.
Terraform Example
provider "proxmox" {
pm_api_url = "https://10.0.0.11:8006/api2/json"
pm_user = "root@pam"
pm_api_token_id = "dr-automation"
pm_api_token_secret = "xxxxxxx"
}
resource "proxmox_vm_qemu" "dr_vm" {
name = "dr-webserver"
target_node = "pve-node02"
clone = "ubuntu-template"
cores = 4
memory = 8192
disk {
size = "40G"
storage = "local-lvm"
}
network {
bridge = "vmbr0"
}
}
Execute the automated deployment:
terraform init
terraform apply -auto-approve
This process can automatically provision standby infrastructure
after PBS synchronization completes.
๐งฎ 5. Alertmanager + N8N Workflow Example
Workflow Diagram
[Prometheus]
โ
[Grafana Alert]
โ
[Alertmanager]
โ (Webhook Trigger)
[N8N / Ansible Playbook]
โ
[Proxmox API โ PBS โ Restore VM]
โ
[Slack Notification / Report Delivery]
Example Webhook Payload
When Prometheus detects a node failure, Alertmanager sends:
{
"receiver": "proxmox-dr",
"status": "firing",
"alerts": [
{
"labels": {
"alertname": "PVE_Node_Down",
"instance": "pve-node01",
"severity": "critical"
},
"annotations": {
"description": "Proxmox node pve-node01 is unreachable"
}
}
]
}
N8N or Ansible parses this payload and automatically executes the DR workflow.
๐งฐ 6. Automated Validation and Reporting
After a recovery job completes, automatically verify VM status:
pvesh get /nodes/pve-node02/qemu/301/status/current
If the result is:
{"status":"running"}
The restore is successful โ
Then report back via API or notification:
โ
VM 301 successfully restored from remote PBS and started.
๐ง 7. Multi-Region Cloud Orchestration
Extend your DR automation beyond on-premises โ
deploy across multiple regions and clouds for full hybrid orchestration.
| Cloud / Region | Automated Task | Tool |
|---|---|---|
| AWS | Launch temporary EC2 nodes and attach PBS S3 backups | Terraform / AWS CLI |
| Azure | Activate Blob snapshot as backup source | Azure Functions |
| GCP | Use Cloud Run or Cloud Scheduler to trigger DR workflows | N8N / API |
| On-Prem | Automatically restart or rebuild Proxmox nodes | Ansible / API |
โ Conclusion
Through the combination of Proxmox VE + PBS + API + Automation Tools,
enterprises can establish a fully autonomous disaster recovery system capable of:
- Real-time failure detection
- Automated VM and data recovery
- Cross-site replication
- Cloud-based orchestration
This framework not only reduces human error and response time
but also elevates the resilience and continuity of enterprise IT operations.
๐ฌ In the next and final article of the Proxmox Enterprise Series:
โProxmox Enterprise Governance Framework and Best Practices,โ
weโll consolidate virtualization, backup, security, and cloud strategies
into a complete enterprise-grade open virtualization governance blueprint.