Home > Articles > Networking > Network Administration & Management

  • Print
  • + Share This
This chapter is from the book

Who's Watching the Watchers?

If there is a fatal flaw in the concept of systems monitoring, it is the use of untrustworthy systems to watch other untrustworthy systems. If your monitoring system fails, it's important you are at least informed of it. A failover system to pick up where the failed system left off is even better.

The specifics of your network dictate what needs to happen when the monitoring system fails. If you are bound by strict SLAs, then uptime reports are a critical part of your business, and a failover system should be implemented. Often, it's enough to simply know that the monitoring system is down.

Failure-proofing monitoring systems is a messy business. Unless you work at a tier1 ISP, you'll always hit some upstream dependency that you have no control over, if you go high enough into the topology of your network. This does not negate the necessity of a plan.

Small shops should at least have a secondary system, such as a syslog box, or some other piece of infrastructure that can heartbeat the monitoring system and send an alert if things go wrong. Large shops may want to consider global monitoring infrastructure, either provided by a company that sells such solutions or by maintaining a mesh topology of hosted Nagios boxes in geographically dispersed locations.

Nagios makes it easy to mirror state and configuration information across separate boxes. Configuration and state are stored as terse, clear text files by default. Configuration syntax hooks make event mirroring a snap, and Nagios can be configured in distributed monitoring scenarios with multiple Nagios servers. The monitoring system may be the system most in need of monitoring; don't forget to include it in the list of critical systems.

  • + Share This
  • 🔖 Save To Your Account