I use Monit to do lightweight monitoring and alerting for my services. Nothing advanced, just one host running monit which probes other servers via TCP, UDP, ICMP, etc to see if services are accessible. Monit can do much more if you install it on each machine itself, but I’m not interested in that right now (though it’s less annoying doing that with Monit than Nagios or similar).
tl;dr: to reduce false positives (aka “flapping”) make sure you are checking for several failures within several cycles.
False positives
One problem with monitoring stuff via probing remotely is false positives caused by network blips. When first I started using Monit my host blocks looked like this:
check host web3 with address 10.1.5.17
if failed icmp type echo count 3 with timeout 10 seconds then alert
if failed port 22 protocol SSH with timeout 10 seconds then alert
if failed port 80 protocol HTTP with timeout 10 seconds then alert
A 10-second timeout seemed reasonable enough, but I was surprised to find that even hosts inside my LAN “failed” several times per day with this configuration. Unsurprisingly, hosts outside the LAN, like in EC2 or whatever, were even worse.
I figured it was just a matter of fiddling until I got the “magic” number, so I kept tuning the timeout (10 -> 30, 30 -> 60, 60 -> 90, etc), but my inbox was still filling up with false failures.
Cycles
Monit polls services in cycles (see the set daemon
parameter in the config), and it has a mechanism for tracking how many times a service has failed within a cycle (or several cycles). This allows you to have a bit more flexibility than the binary state of “is port 80 reachable right now or not?”.
In my configuration the cycle is set to 60 seconds, and I ended up going with something like this for my LAN hosts:
check host web3 with address 10.1.5.17
if failed icmp type echo count 3 with timeout 10 seconds 2 times within 3 cycles then alert
if failed port 22 protocol SSH with timeout 10 seconds 2 times within 3 cycles then alert
if failed port 80 protocol HTTP with timeout 10 seconds 2 times within 3 cycles then alert
I’ve found that this has drastically improved the reliability of my alerts; no more “boy crying wolf.” When an alert comes in, I actually believe it now. 😉
Great success!
Sysadmins all around the world understand how this feels…
Monit is sweet
Side note: Monit’s pretty sweet, because it knows about protocols. That means, instead of just asking “Is port TCP 80 reachable on the web server?“, we can ask the web server returning an HTTP 200
(OK) code? And instead of asking “Is port TCP 3306 reachable on the MySQL server?“, we can ask if whatever is listening on port 3306 is actually mysql. Etc etc…