Skip to content

Nuface Blog

隨意隨手記 Casual Notes

Menu
  • Home
  • About
  • Services
  • Blog
  • Contact
  • Privacy Policy
  • Login
Menu

Full-Stack Monitoring & Alerting for an Enterprise-Grade Mail Platform

Posted on 2025-11-212025-11-21 by Rico

Mail Server Series — Part 16

After completing the architecture, deployment, filtering pipeline, archiving system, full-text search, high availability, and operational procedures of the entire mail platform, this chapter introduces the final—but critical—piece:

How to build an enterprise-grade monitoring & alerting system for your self-hosted mail infrastructure.

The reliability of a mail system depends not only on its architecture, but also on:

  • Whether issues can be detected immediately
  • Whether the environment’s health can be quantified
  • Whether risks can be predicted (disk full, queue buildup, CPU overuse)
  • Whether you can avoid the classic problem:
    “Users complain they can’t receive emails… only then you realize something is wrong.”

This article provides a complete DevOps-focused guide to building:

✔ Full-stack monitoring
✔ Real-time alerting
✔ Log aggregation & tracing
✔ Operational dashboards
✔ Deep observability for Docker-based mail systems


1. What Should You Monitor in a Mail Platform? (Complete Checklist)

A modern mail stack includes:

  • Postfix (SMTP)
  • Dovecot (IMAP/POP3/LMTP)
  • Amavis / ClamAV / SpamAssassin
  • MariaDB / Galera
  • Roundcube Webmail
  • Piler (archive system)
  • ManticoreSearch (full-text search)
  • Apache Reverse Proxy
  • Docker host, containers, network, storage

Monitoring should be divided into six major categories:


① Postfix (SMTP) Monitoring

MetricDescription
mail queue sizeQueue spikes indicate blockage
defer / bounce rateDNS issues, blacklists, or remote failures
SMTP delivery latencyDelays in outbound flow
inbound/outbound TPSLoad forecasting
reject rateSpam attack or config error
TLS usage rateSecurity posture

② Dovecot (IMAP/POP3/LMTP) Monitoring

MetricDescription
login success/fail countDetect brute-force attacks
IMAP/LMTP connectionsDetect exhaustion
I/O latencyIndicates disk bottlenecks
mailbox locking issuesStorage or FS issues
auth response timeLDAP / MariaDB problems

③ Amavis / ClamAV / SpamAssassin Monitoring

MetricDescription
ClamAV signature update statusMust stay fresh
spam hit rateSudden drop = SA malfunction
Amavis queueAmavis blocking causes total mail freeze
CPU/RAMSA may consume high CPU at peak

④ MariaDB / Galera Monitoring

MetricDescription
replication delayAffects Roundcube & Dovecot auth
node health / flow-controlStability of cluster
slow queriesImpacts all components
connection countDetect leaks
DB sizeArchive DB grows continuously

⑤ Piler + Manticore Monitoring

MetricDescription
search latencyUser search experience
RT index delayWhether indexes are up-to-date
piler queue backlogWrite operations stuck
archive store sizeLong-term data accumulation
indexing errorsSchema/config inconsistencies

⑥ Host & Docker Monitoring

MetricDescription
CPU / RAM / LoadPrevent OOM kill
Disk I/OAffects IMAP & indexing
Network latencySMTP/IMAP/TLS issues
container healthRestart loops, unhealthy state
filesystem capacityDisk full → mail system collapse

2. Recommended Full Monitoring Architecture

A robust monitoring stack should look like this:

┌───────────────────────────────┐
│        Grafana Dashboard       │  ← Visualization Layer
└───────────────┬───────────────┘
                │
        Prometheus Server
                │
┌───────────────┼────────────────────────────────────────┐
│               │                                        │
Exporter:   Postfix Exporter                 Node Exporter
            Dovecot Exporter                 Blackbox Exporter
            MariaDB Exporter                 Docker Exporter
            ClamAV Exporter                  Custom Piler/Manticore Exporter
└───────────────┴────────────────────────────────────────┘

3. Required Exporters (Recommended List)


3.1 Postfix Exporter

Monitors:

  • Queue size
  • Rejects/bounces
  • Delivery latency
  • TLS negotiation stats

Recommended:
knyar-style postfix exporter


3.2 Dovecot Exporter

Monitors:

  • Login fail rate
  • IMAP/LMTP connection count
  • Auth latency
  • Mailbox access patterns

3.3 ClamAV Exporter

Tracks:

  • signature update time
  • scan results
  • daemon uptime

3.4 MariaDB Exporter

Official exporter:

prom/mysqld_exporter

3.5 Node Exporter

Must-have for hardware monitoring.


3.6 Blackbox Exporter

Probe:

  • SMTP STARTTLS
  • SMTP AUTH
  • IMAP STARTTLS
  • HTTPS (webmail/piler)
  • Certificate expiration

3.7 Docker Exporter

Monitors:

  • restarted containers
  • unhealthy state
  • CPU/memory of containers

3.8 Custom Exporter for Piler & Manticore

Recommended metrics:

  • search latency
  • RT index lag
  • archive write delay
  • store usage growth
  • manticore query errors

(If you need, I can write a custom exporter for your environment.)


4. Grafana Dashboards (Suggested Layout)


Dashboard A — Mail System Overview

  • inbound/outbound TPS
  • queue depth
  • SMTP TLS usage
  • login fail trends
  • DB latency
  • piler indexing delay
  • manticore query time

Perfect for management and daily monitoring.


Dashboard B — Postfix Deep Monitoring

  • per-minute SMTP throughput
  • reject count by rule
  • per-domain statistics
  • spam attack visualization
  • TLS handshake errors

Dashboard C — Dovecot Overview

  • login fail/success ratio
  • authentication latency
  • LMTP failures
  • I/O bottleneck
  • IMAP folder access heatmap

Dashboard D — Archive (Piler + Manticore)

  • indexing rate
  • search latency distribution
  • store size trends
  • RT index memory usage
  • fragmentation warning

Dashboard E — Host & Docker Monitoring

  • CPU / load
  • memory pressure
  • disk I/O
  • container health
  • network usage

5. Alerting Rules (Enterprise-Grade)

To prevent false alarms while keeping accuracy, here are recommended rules:


Postfix Alerts

Queue > 500 for over 10 minutes

Possible causes:

  • DNS outage
  • Amavis bottleneck
  • remote delivery failures

Dovecot Alerts

Login failure rate > 30%

Indicates brute-force attacks.


ClamAV Alerts

Signature older than 24 hours


MariaDB Alerts

Query latency > 200 ms

Affects:

  • SMTP authentication
  • Dovecot auth
  • Roundcube
  • Piler

Storage Alerts

Disk usage > 85%

Especially:

/var/vmail
/var/piler/store

Docker Alerts

  • container restart loops
  • “unhealthy” state
  • memory OOM kills

Manticore Alerts

  • search latency > 500 ms
  • index not updating
  • RT index overflow

6. External Probing (Blackbox Monitoring)

Very important for real production systems.

Probe the following:

smtp_starttls://mail.it.demo.tw:25
smtp_auth://mail.it.demo.tw:587
imap_starttls://mail.it.demo.tw:143
https://webmail.it.demo.tw
https://archive.it.demo.tw

You will immediately know if:

  • TLS handshake fails
  • cert is expired
  • mail service unreachable
  • reverse proxy broken

7. Centralized Alert Delivery

Recommended channels:

  • Microsoft Teams
  • Slack
  • Telegram Bot
  • Email (secondary only)

Alertmanager can integrate all of these easily.


8. Deployment Recommendations for Your Environment

Considering your environment:

  • Docker-based multi-container stack
  • postfix + dovecot + amavis
  • piler + manticore
  • MariaDB
  • Apache reverse proxy
  • strict firewall rules
  • DOCKER-USER custom chains

I recommend adding:

On Docker host

  • node_exporter
  • docker_exporter

Within the mail stack

  • postfix_exporter
  • dovecot_exporter
  • clamav_exporter
  • mysqld_exporter
  • blackbox_exporter

Central

  • prometheus
  • grafana
  • alertmanager

Conclusion — A Mail System Without Monitoring Is Not Production-Ready

Building the system is only the beginning.
True operational excellence comes from:

  • detecting issues early
  • getting instant alerts
  • seeing trends
  • identifying attacks
  • preventing downtime

With this chapter, your mail platform now has full production-grade observability.

Recent Posts

  • Postfix + Let’s Encrypt + BIND9 + DANE Fully Automated TLSA Update Guide
  • Postfix + Let’s Encrypt + BIND9 + DANE TLSA 指紋自動更新完整教學
  • Deploying DANE in Postfix
  • 如何在 Postfix 中部署 DANE
  • DANE: DNSSEC-Based TLS Protection

Recent Comments

  1. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on High Availability Architecture, Failover, GeoDNS, Monitoring, and Email Abuse Automation (SOAR)
  2. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on MariaDB + PostfixAdmin: The Core of Virtual Domain & Mailbox Management
  3. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on Daily Operations, Monitoring, and Performance Tuning for an Enterprise Mail System
  4. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on Final Chapter: Complete Troubleshooting Guide & Frequently Asked Questions (FAQ)
  5. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on Network Architecture, DNS Configuration, TLS Design, and Postfix/Dovecot SNI Explained

Archives

  • December 2025
  • November 2025
  • October 2025

Categories

  • AI
  • Apache
  • Cybersecurity
  • Database
  • DNS
  • Docker
  • Fail2Ban
  • FileSystem
  • Firewall
  • Linux
  • LLM
  • Mail
  • N8N
  • OpenLdap
  • OPNsense
  • PHP
  • QoS
  • Samba
  • Switch
  • Virtualization
  • VPN
  • WordPress
© 2025 Nuface Blog | Powered by Superbs Personal Blog theme