Silence Is Golden
With any monitoring system, a balance must be struck between too much granularity and too little. Technical folks, such as sysadmins, usually err on the side of offering too much. Given 20 services on 5 boxes, many sysadmins monitor everything and get notified on everything, whether the notifications might represent a problem.
For sysadmins, this is not a big deal; they generally develop an organic understanding of their environments, and the notifications serve as an additional point of visibility or as an event correlation aid. For example, a notification from workstation1 that its network traffic is high, combined with a CPU spike on router 12, and abnormal disk usage on Server3, may indicate to a sysadmin that Ted from accounting has come back early from vacation. A diligent sysadmin might follow up on that hunch to verify that it really is Ted and not a teenager at the University of Hackgrandistan owning Ted's workstation. It happens more often than you'd think. For the non-sysadmin, however, the most accurate phrase to describe these notifications is false alarm.
Typically, monitoring systems use static thresholds to determine the state of a service. The CPU on Server1, for example, may have a threshold of 95 percent. When the CPU goes above that, the monitoring system sends notifications or performs an automatic break/fix. One of the biggest mistakes an implementer can make when introducing a monitoring system into an environment is simply not taking the time to find out what the normal operating parameters on the servers are. If Server1 typically has 98 percent CPU utilization from 12 a.m. to 2 a.m. because it does batch processing during these hours, then a false alarm is sent.
False alarms should be methodically hunted down and eradicated. Nothing can undermine the credibility of, and erode the support for, a fledgling monitoring system such as people getting notifications that they think are silly or useless. Before the monitoring system is configured to send notifications, it should be run for a few weeks to collect data on at least the critical hosts to determine what their normal operational parameters are. This data, collectively referred to as a baseline, is the only reasonably responsible way to determine static thresholds for your servers.
That's not to say our sysadmin should be prevented from getting the most out of his cell phone's unlimited data plan. I'm merely suggesting that some filtering be put in place to ensure no one else need share his unfortunate fascination. One great thing about following the procedural approach outlined earlier in this chapter is that it makes it possible to think about the organization's requirements for a particular service on a specific host before the thresholds and contacts are configured. If Alice, the DBA, doesn't need to react to high CPU on Server1, then she should not get paged about it.
Nagios provides plenty of functionality to enable sysadmins to be notified of "interesting events" without alerting management or other noninterested parties. With two threshold levels (warning and critical) and a myriad of escalation and polling options, it is relatively simple to get early-and-often style notifications for control freaks, while keeping others abreast of just the problems. It is highly recommended that a layered approach to notification be a design goal of the system from the beginning.
Good monitoring systems tend to be focused, rather than chatty. They may monitor many services for the purpose of historical trending, but they send fewer notifications than one would expect, and when they do, it's to the group of people who want to know. For the intellectually curious, who don't want their pager going off at all hours of the day and night, consider sending summary reports every 24 hours or so. Nagios has some excellent reporting built in.