Proxmox Automated Disaster Recovery and Cloud Orchestration Implementation

🔰 Introduction

Traditional disaster recovery (DR) procedures often rely on manual intervention:
administrators receive alerts, log into systems, locate backups, and manually trigger restores.

While this approach works in theory, during real-world incidents — such as data center outages or ransomware attacks —
manual recovery is slow, inconsistent, and error-prone.

Modern enterprises are shifting to Automated Disaster Recovery (ADR),
where systems detect, respond, and recover automatically based on defined events and policies.

This article covers:
1️⃣ The architecture and workflow of ADR
2️⃣ Integrating Proxmox VE, PBS, and APIs with Ansible and Terraform
3️⃣ Real-world automation examples for recovery orchestration

🧩 1. Automated Disaster Recovery (ADR) Architecture Overview

Architecture Diagram

 ┌──────────────────────────────────────────────────────┐
 │                 Monitoring Layer                     │
 │ Prometheus → Grafana → Alertmanager → Webhook/Slack   │
 └───────────────┬──────────────────────────────────────┘
                 │  Event Trigger
 ┌────────────────┴────────────────┐
 │            Orchestration Layer  │
 │ Ansible / Terraform / N8N / AWX │
 └──────────────┬──────────────────┘
                 │  Automated Execution
 ┌────────────────┴────────────────┐
 │           Execution Layer       │
 │  Proxmox VE + PBS + API + Ceph  │
 └────────────────────────────────┘

🧠 2. ADR Workflow Concept

Stage	Trigger Source	Action	Tool
1️⃣ Event Detection	Prometheus / Grafana	Detect node or PBS outage	Alertmanager
2️⃣ Notification	Webhook / Slack / Email	Notify administrators & automation system	Alertmanager / Webhook
3️⃣ Orchestration	Trigger Playbook / Script	Execute recovery workflow	Ansible / Terraform
4️⃣ Data Recovery	Restore from remote PBS / Cloud backup	Recreate VMs, networks, and services	Proxmox API + PBS
5️⃣ Verification	Validate recovery status	Confirm and report completion	API + Grafana

⚙️ 3. Proxmox API + Ansible Integration Example

Proxmox’s RESTful API exposes nearly all operations,
making it ideal for integration with Ansible to automate DR processes.

1️⃣ Ansible Inventory and Variables

/etc/ansible/hosts

[pve_cluster]
pve-node01 ansible_host=10.0.0.11
pve-node02 ansible_host=10.0.0.12

Variable definitions:

proxmox_api_url: "https://10.0.0.11:8006/api2/json"
proxmox_user: "root@pam"
proxmox_token_id: "dr-automation"
proxmox_token_secret: "xxxxxxx"

2️⃣ Automated Recovery Playbook

---
- name: Proxmox Automated VM Recovery
  hosts: localhost
  gather_facts: no
  tasks:
    - name: Restore VM from remote PBS
      uri:
        url: "{{ proxmox_api_url }}/nodes/pve-node02/qemu"
        method: POST
        headers:
          Authorization: "PVEAPIToken={{ proxmox_user }}!{{ proxmox_token_id }}={{ proxmox_token_secret }}"
        body_format: json
        body:
          vmid: 301
          restore: "pbs:remote-pbs/vm-301"
          unique: 1
          pool: "production"
      register: restore_result

    - name: Print restore job status
      debug:
        var: restore_result

This playbook can be triggered by Alertmanager, AWX, or Webhook,
automatically restoring a VM on a remote site — without manual action.

☁️ 4. Terraform Integration for Automated Rebuild

Terraform can automate infrastructure provisioning at remote or cloud DR sites.

Terraform Example

provider "proxmox" {
  pm_api_url = "https://10.0.0.11:8006/api2/json"
  pm_user    = "root@pam"
  pm_api_token_id = "dr-automation"
  pm_api_token_secret = "xxxxxxx"
}

resource "proxmox_vm_qemu" "dr_vm" {
  name        = "dr-webserver"
  target_node = "pve-node02"
  clone       = "ubuntu-template"
  cores       = 4
  memory      = 8192
  disk {
    size    = "40G"
    storage = "local-lvm"
  }
  network {
    bridge = "vmbr0"
  }
}

Execute the automated deployment:

terraform init
terraform apply -auto-approve

This process can automatically provision standby infrastructure
after PBS synchronization completes.

🧮 5. Alertmanager + N8N Workflow Example

Workflow Diagram

[Prometheus] 
   ↓
[Grafana Alert] 
   ↓
[Alertmanager]
   ↓  (Webhook Trigger)
[N8N / Ansible Playbook]
   ↓
[Proxmox API → PBS → Restore VM]
   ↓
[Slack Notification / Report Delivery]

Example Webhook Payload

When Prometheus detects a node failure, Alertmanager sends:

{
  "receiver": "proxmox-dr",
  "status": "firing",
  "alerts": [
    {
      "labels": {
        "alertname": "PVE_Node_Down",
        "instance": "pve-node01",
        "severity": "critical"
      },
      "annotations": {
        "description": "Proxmox node pve-node01 is unreachable"
      }
    }
  ]
}

N8N or Ansible parses this payload and automatically executes the DR workflow.

🧰 6. Automated Validation and Reporting

After a recovery job completes, automatically verify VM status:

pvesh get /nodes/pve-node02/qemu/301/status/current

If the result is:

{"status":"running"}

The restore is successful ✅

Then report back via API or notification:

✅ VM 301 successfully restored from remote PBS and started.

🧠 7. Multi-Region Cloud Orchestration

Extend your DR automation beyond on-premises —
deploy across multiple regions and clouds for full hybrid orchestration.

Cloud / Region	Automated Task	Tool
AWS	Launch temporary EC2 nodes and attach PBS S3 backups	Terraform / AWS CLI
Azure	Activate Blob snapshot as backup source	Azure Functions
GCP	Use Cloud Run or Cloud Scheduler to trigger DR workflows	N8N / API
On-Prem	Automatically restart or rebuild Proxmox nodes	Ansible / API

✅ Conclusion

Through the combination of Proxmox VE + PBS + API + Automation Tools,
enterprises can establish a fully autonomous disaster recovery system capable of:

Real-time failure detection
Automated VM and data recovery
Cross-site replication
Cloud-based orchestration

This framework not only reduces human error and response time
but also elevates the resilience and continuity of enterprise IT operations.

💬 In the next and final article of the Proxmox Enterprise Series:
“Proxmox Enterprise Governance Framework and Best Practices,”
we’ll consolidate virtualization, backup, security, and cloud strategies
into a complete enterprise-grade open virtualization governance blueprint.