Ceph Dashboard 與自動化監控整合 (Prometheus + Alertmanager)

🔰 引言

在企業級 Ceph 儲存環境中，隨著節點與資料量的快速成長，
單靠 CLI 或手動監控已不足以確保系統穩定。

Ceph Dashboard 搭配 Prometheus + Alertmanager，
能提供即時可視化、指標收集與事件告警，
形成一個完整的「可觀測性 (Observability)」平台，
協助系統管理員達成：

故障預警
效能瓶頸分析
資源利用率趨勢追蹤
自動化修復與通知

本文將說明 Ceph Dashboard 的架構、與 Prometheus 的整合方式、
以及如何透過 Alertmanager 實現智慧化告警與自動化管理。

🧩 一、Ceph Dashboard 架構概覽

1️⃣ 架構簡介

自 Ceph Mimic (13.x) 版本起，Ceph 內建 Dashboard 模組，
提供基於 Web 的可視化管理介面。

Dashboard 架構如下：

               ┌────────────────────────────┐
               │        Ceph Dashboard      │
               │   (內建於 MGR 模組中)      │
               └───────────┬────────────────┘
                           │ REST API / Metrics Export
                           ▼
           ┌───────────────────────────┐
           │       Prometheus          │
           │ (Metrics Collector)       │
           └───────────┬───────────────┘
                       │ Alerts / Rules
                       ▼
           ┌───────────────────────────┐
           │       Alertmanager         │
           │ (通知 / 自動化觸發)        │
           └───────────────────────────┘

Dashboard 本身具備：

Cluster 健康狀態監控
OSD / MON / MGR 狀態視覺化
Pool 容量統計
整合 Prometheus 監控模組
使用者與角色存取控制 (RBAC)

⚙️ 二、啟用 Ceph Dashboard

在任一 Ceph 節點上啟用：

ceph mgr module enable dashboard

設定管理員帳號：

ceph dashboard ac-user-create admin admin123 administrator

啟用 HTTPS：

ceph dashboard set-login-credentials admin admin123
ceph config set mgr mgr/dashboard/server_port 8443
ceph config set mgr mgr/dashboard/ssl true
systemctl restart ceph-mgr@<node>

登入後可直接透過：

https://<mgr-node-ip>:8443

進入管理介面。

📈 三、整合 Prometheus 指標收集

1️⃣ 啟用 Prometheus 模組

ceph mgr module enable prometheus

查看 Exporter 狀態：

ceph mgr services

輸出範例：

{
    "dashboard": "https://10.0.0.11:8443/",
    "prometheus": "http://10.0.0.11:9283/"
}

Prometheus 即可透過 http://<mgr-node>:9283/metrics
定期收集 Ceph 叢集各種運作指標，包括：

OSD 健康、延遲、I/O 量
MON Quorum 狀態
Pool 使用率
RBD、CephFS、RGW 的效能統計

2️⃣ Prometheus 設定範例

修改 prometheus.yml：

scrape_configs:
  - job_name: 'ceph'
    static_configs:
      - targets: ['10.0.0.11:9283']

重新啟動 Prometheus：

systemctl restart prometheus

📊 四、Grafana 視覺化整合（可選）

安裝 Grafana 後可導入 Ceph 官方 Dashboard Template（ID: 2842）：
1️⃣ 登入 Grafana → Import Dashboard
2️⃣ 選擇 Data Source 為 Prometheus
3️⃣ 顯示如：

Pool 容量使用率
OSD IOPS / 延遲圖
集群健康狀態總覽

📊 這讓運維人員可在單一視圖下掌握叢集狀態趨勢。

🔔 五、Alertmanager 告警機制

1️⃣ 啟用 Ceph 告警模組

ceph mgr module enable alerts

建立 Alertmanager 通知設定：

ceph config set mgr mgr/alerts/alertmanager_address http://10.0.0.20:9093

2️⃣ 設定 Alertmanager 通知範例

alertmanager.yml：

route:
  receiver: 'email-alert'

receivers:
  - name: 'email-alert'
    email_configs:
      - to: 'itops@nuface.tw'
        from: 'ceph-monitor@nuface.tw'
        smarthost: 'smtp.nuface.tw:587'
        auth_username: 'ceph-monitor@nuface.tw'
        auth_password: 'yourpassword'

Alertmanager 也支援 Slack、Webhook、LINE Notify 等通知整合。

3️⃣ 常見 Ceph 告警項目

告警類型	條件	建議處理方式
OSD Down	OSD 掉線超過 300 秒	檢查磁碟狀態與節點連線
Pool Near Full	Pool 使用率超過 85%	擴容或清理 Snapshot
MON Quorum Lost	少於 2 MON 節點存活	立即檢查網路與節點
RBD Image Error	Volume 無法掛載	驗證 RADOS 層與網路連通性

🧠 六、Proxmox + Ceph 一體化監控建議

元件	整合方式	功能
Proxmox VE	內建 Prometheus Exporter	VM/CT 資源監控
Ceph MGR	Prometheus 模組	儲存層健康狀態
Grafana	匯總多來源 Data Source	整合視覺化儀表板
Alertmanager	統一事件通知管道	自動化警示與通知
N8N / Webhook	自動化反應（例如重啟服務）	智慧修復與自動應變

🔒 七、最佳實務與治理建議

每個 Ceph Cluster 至少配置 1 MGR + 1 Prometheus 節點
所有警示均應設置 分級通知機制（Critical / Warning / Info）
將 Dashboard 與告警紀錄接入 中央 SIEM / Log Server
定期檢視 Ceph Cluster Health Reports 與歷史趨勢
可結合 Ansible + Webhook 實現自動化修復流程

✅ 結語

Ceph Dashboard + Prometheus + Alertmanager
共同構成一套完整的企業級 儲存可觀測性與自動化運維平台。

它不僅能：

實時掌握系統狀態
預警潛在問題
自動觸發修復或通知流程

更能讓企業在大規模分散式儲存架構中，
持續保持 可視、可控、可預測 的穩定運行。

💬 下一篇將探討：
「Ceph 在 AI 訓練與資料湖架構中的應用實例」，
說明如何結合高效儲存與大數據運算平台，
打造可彈性擴展的 AI 資料基礎層。