Best Practices for Designing a Nagios Monitoring System
Building a monitoring infrastructure is a complex undertaking. The system can potentially interact with every system in the environment, and its users range from the layman to the highly technical. Building the monitoring infrastructure well requires not only considerable systems know-how, but also a global perspective and good people skills.
Most importantly, building monitoring systems also requires a light touch. The most important distinction between good monitoring systems and bad ones is the amount of impact they have on the network environment, in areas such as resource utilization, bandwidth utilization, and security. This first chapter contains a collection of advice gleaned from mailing lists such as email@example.com, other systems administrators, and hard-won experience. My hope is that this chapter helps you to make some important design decisions up front, to avoid some common pitfalls, and to ensure that the monitoring system you build becomes a huge asset instead of a huge burden.
A Procedural Approach to Systems Monitoring
Good monitoring systems are not built one script at a time by administrators (admins) in separate silos. Admins create them methodically with the support of their management teams and a clear understanding of the environment—both procedural and computational—within which they operate.
Without a clear understanding of which systems are considered critical, the monitoring initiative is doomed to failure. It's a simple question of context and usually plays out something like this:
Manager: "I need to be added to all the monitoring system alerts."
Admin: "All of them?"
Manager: "Well yes, all of them."
Admin: "Er, ok."
The next day:
Manager: "My pager kept me up all night. What does this all mean?"
Admin: "Well, /var filled up on Server1, and the VPN tunnel to site5 was up and down."
Manager: "Can't you just notify me of the stuff that's an actual problem?"
Admin: "Those are actual problems."
Certifications such as HIPAA, Sarbanes-Oxley, and SAS70 require institutions such as universities, hospitals, and corporations to master the procedural aspects of their IT. This has had good consequences, as most organizations of any size today have contingency plans in place, in the event that something bad happens. Disaster recovery, business continuity, and crisis planning ensure that the people in the trenches know what systems are critical to their business, understand the steps to take to protect those systems in times of crisis, or recover them should they be destroyed. These certifications also ensure that management has done due diligence to prevent failures to critical systems; for example, by installing redundant systems or moving tape backups offsite.
For whatever reason, monitoring systems seem to have been left out of this procedural approach to contingency planning. Most monitoring systems come in to the network as a pet project of one or two small tech teams who have a very specific need for them. Often many different teams will employ their own monitoring tools independent of, and oblivious of, other monitoring initiatives going on within the organization. There seems to be no need to involve anyone else. Although this single-purpose approach to systems monitoring may solve an individual's or small group's immediate need, the organization as a whole suffers, and fragile monitoring systems always grow from it.
To understand why, consider that in the absence of a procedurally implemented monitoring framework, hundreds of critically important questions are nearly impossible to answer. For example, consider the following questions.
- What amount of overall bandwidth is used for systems monitoring?
- What routers or other systems are the monitoring tools dependent on?
- Is sensitive information being transmitted in clear text between hosts and the monitoring system?
If it was important enough to write a script to monitor a process, then it's important enough to consider what happens when the system running the script goes down, or when the person who wrote the script leaves and his user ID is disabled. The piecemeal approach is by far the most common way monitoring systems are created, yet the problems that arise from it are too many to be counted.
The core issue in our previous example is that there are no criteria that coherently define what a "problem" is, because these criteria don't exist when the monitoring system has been installed in a vacuum. Our manager felt that he had no visibility into system problems and when provided with detailed information, still gained nothing of significance. This is why a procedural approach is so important. Before they do anything at all, the people undertaking the monitoring project should understand which systems in the organization are critical to the organization's operational well-being, and what management's expectation is regarding the uptime of those systems.
Given these two things, policy can be formulated that details support and escalation plans. Critical systems should be given priority and their requisite pieces defined. That's not to say that the admin in the example should not be notified when /var is full on Server1;only that when he is notified of it, he has a clear idea of what it means in an organizational context. Does management expect him to fix it now or in the morning? Who else was notified in parallel? What happens if he doesn't respond? This helps the manager, as well. By clearly defining what constitutes a problem, management has some perspective on what types of alerts to ask for and more importantly...when they can go back to sleep.
Smaller organizations, where there may be only a single part-time system administrator (sysadmin), are especially susceptible to piece-meal monitoring pitfalls. Thinking about operational policy in a four-person organization may seem silly, but in small environments, critical system awareness is even more important. When building monitoring systems, always maintain a big-picture outlook. If the monitoring endeavor is successful, it will grow quickly and the well-being of the organization will come to depend on it.
Ideally, a monitoring system should enforce organizational policy rather than merely reflect it. If management expects all problems on Server1 to be looked at within 10 minutes, then the monitoring system should provide the admin with a clear indicator in the message (such as a priority number), a mechanism to acknowledge the alert, and an automatic escalation to someone else at the end of the 10-minute window.
So how do we find out what the critical systems are? Senior management is ultimately responsible for the overall well-being of the organization, so they should be the ones making the call. This is why management buy-in is so vitally important. If you think this is beginning to sound like disaster recovery planning, you're ahead of the curve. Disaster recovery works toward identifying critical systems for the purpose of prioritizing their recovery, and therefore, it is a methodologically identical process to planning a monitoring infrastructure. In fact, if a disaster recovery plan already exists, that's the place to begin. The critical systems have already been identified.
Critical systems, as outlined by senior management, will not be along the lines of "all problems with Server1 should be looked at within 10 minutes." They'll probably be defined as logical entities. For example "Email is critical." So after the critical systems have been identified, the implementers will dissect them one by one, into the parts of which they are composed. Don't just stay at the top; be sure to involve all interested parties. Email administrators will have a good idea of what "email" is composed of and criteria, which, if not met, will mean them rolling their own monitoring tools.
Work with all interested parties to get a solution that works for everyone. Great monitoring systems are grown from collaboration. Where custom monitoring scripts already exist, don't dismiss them; instead, try to incorporate them. Groups tend to trust the tools they're already using, so co-opting those tools usually buys you some support. Nagios is excellent at using external monitoring logic along with its own scheduling and escalation rules.