Using JConsole to monitor your system is tedious and good as a monitoring system only if you are actively staring at the graphs and information all the time. Since that is unrealistic and time-consuming, we recommend that you use other systems for monitoring the general health of your system such as Nagios.
Nagios is open-source software dedicated to monitoring computers, networks, hosts, and services and can alert you when things are going wrong or have been resolved. It is extremely versatile and has the capability to monitor many types of services, applications, or parts of an application. Let’s start at the bottom of the monitoring chain and work our way up. In order to avoid a complete lesson on monitoring, we will only cover the basics along with what the most common checks should be as they relate to Cassandra and its operation.
There are three primary alerts in Nagios: WARNING, CRITICAL, and OK. They mean exactly what they sound like. A WARNING alert is sent if the service in question is starting to show signs of a problem, such as a hard drive nearing capacity. A CRITICAL alert is sent if the service in question is down or in a catastrophic state, such as a hard drive that is completely out of space and preventing the applications using that drive from running. An OK alert is sent when the service has recovered or become available again, such as when the total space used on the hard drive has dropped below the threshold set to alert for CRITICAL or WARNING.
OS and Hardware Checks
When monitoring any machine, it’s best to start out with the checks at the OS and hardware layer. Even if you are running Cassandra in a virtualized environment such as Amazon or Rackspace, there are still hardware(ish) checks that should be instituted.
Disks and Partitions
The first thing you are going to want to check is the amount of free disk space on data partitions and the CommitLog partitions (assuming they are on separate partitions). Remember that if you are using SizeTieredCompaction, you shouldn’t have the alert set for WARNING at 80% disk utilization and CRITICAL set at 90% disk utilization. The safer approach is to set the WARNING threshold to be roughly 35% disk utilization and the CRITICAL threshold at 45% disk utilization. SizeTieredCompaction is capable of taking up two times the size of the largest SSTable on disk. And while it is unlikely that a single SSTable would be 50% of the data on disk, it is better to be safe than sorry. Recovering from having too much data on disk is extremely difficult.
This concept of monitoring partitions and drives is also important because of JBoD support in Cassandra 1.2 and later. This means that Cassandra can have a single data directory on multiple disks. You will need to know if one or more of those disks are having an issue or require replacement. By monitoring the utilization and health of all the disks in your system, you will know their state and whether they need replacing or maintenance.
Last, you want to ensure that the drive that contains the log files doesn’t fill up. Depending on your log settings, Cassandra has the potential to be very verbose in the log files. If the log files become too large, they can prevent the rest of your system from working if the drive(s) runs out of space.
Linux divides its physical memory into smaller chunks called pages. Swapping is the process whereby a page of memory is copied from memory to a dedicated space on the hard disk called swap space to free up that page of memory. Although there are cases where it is OK, it is normally not recommended for systems to be in a state where they are swapping memory. Typically, anything more than 5% to 10% of your swap space being used is cause for investigation.
On a Cassandra node, swapping is usually a bad sign, so you will want to monitor the swap partition for usage of nearly any kind. Since you should be able to hold the entire JVM’s heap space in memory with at least a little room to spare for the operating system, getting to the point of swapping out pages of memory means it might be a little too late to recover. One of the reasons Cassandra is able to function so well with regard to writes is the fact that many of the writes occur to the memory-mapped MemTables. Having these MemTables swap to disk would drastically impair the performance of Cassandra and should therefore be avoided when possible.
Clock drift refers to the phenomenon where one clock does not run at the exact same speed as another clock. It is especially important to be aware of this if you are running in a virtualized environment as drift from the hypervisor can be much more prevalent than on regular iron. The system clock is incredibly important to Cassandra’s write and reconciliation architecture. Most writes are serialized by timestamp. In other words, if two writes come in for the same column at almost the same time, the determining factor for which value wins is which timestamp is higher. If the system clocks in the ring are not all in sync, you are probably going to see some really strange behavior.
One of the ways to deal with that is to monitor the clock drift using NTP. NTP, or Network Time Protocol, is the most commonly used time synchronization system on the Internet. It also comes with a binary for telling you the offset (drift) from its synchronizing time server. You obviously want to minimize the amount of drift your system experiences. But there will invariably be some that you have to deal with. Monitoring is the way you know if the NTP daemon isn’t doing the job it is supposed to be doing and keeping your clocks in sync. Being alerted to a problem with the clocks in a distributed environment that relies heavily on time for decision making could save a lot of time tracking down weird problems later on.
It is also a good idea to check the ping time responses from each of the Cassandra nodes being monitored. There are any number of reasons that these responses can begin to come back slowly. A few examples include the following:
- A machine that is doing too much work and running short of CPU cycles to respond quickly
- I/O saturation, too high an await (average wait) time, and the machine cannot respond quickly to the request
- Network saturation due to unthrottled streaming on a high-speed network link
Whatever the reason is, it is good to know if there is network congestion of which you should be aware. When a node is slow to receive packets (which is the case with nodes with high ping times), writes can be slow to come in and register, reads and writes will be dropped to keep up with the demand being put on the system, or any number of other weird behaviors may appear. What constitutes a high ping time from your monitoring server depends to a great extent on your network paths. Run a few ping tests from your monitoring server to your Cassandra nodes during regular usage periods to get a feel for what a normal threshold is.
Cassandra is usually an I/O-bound system. You usually run into problems with disk writes or reads slowing down long before you run into CPU-related slowdown. But just to be safe, as different workloads call for different tools to be used at different times, you should monitor CPU usage. While there are many things you could look for when monitoring CPU usage, such as context switches or interrupt requests, a good place to start is usually watching the system load average. The system load average is an average of the number of processes waiting to get into the system’s run queue over a period of time. In the case of the uptime command, it’s over one, five, and 15 minutes. Keep in mind that in the case of multiprocessor systems, the load is relative to the number of processors and cores on the system.
The common rule for utilization is that you want to have a machine working hard but not overworking. This means that you typically want to have the machine running at about 70% utilization. That leaves you headroom for spikes in work and doesn’t leave the machine underutilized during slower periods. So if you have four cores, having the load sit at around 3.00 is usually a safe bet. If you have four cores and the load is 3.5 or higher, you should try to find out what’s wrong and fix it before things go from bad to worse.
Cassandra-Specific Health Checks
Once you have the basic system checks in place, it’s time to add monitoring that is specific to Cassandra. There are various checks that interact with Cassandra at different levels of the system. Some are superficial such as checking to see if ports are alive and being listened on. Some checks require using a slightly more in-depth toolset to programmatically check the MBeans described earlier.
There are three primary ports of interest to Cassandra: 7000 (or 7001 if SSL/TLS is enabled), 7199, and 9160. Port 7000/7001 is used by Cassandra for cluster communication. This includes things such as the Gossip protocol and failure detection. Port 7199 is used by JMX. Port 9160 is the Thrift port and is used for client communication. In order for your cluster to function properly, all of these ports should be accessible.
While it is not necessary to specifically monitor these ports, it is a good idea to test them out one way or another. Testing the Thrift port (9160) is just testing whether you can connect to an instance using a Cassandra driver. In terms of monitoring, if you can connect, the check passes. If you can’t connect to the server, the check should send off an alert. You can also use a simple TCP check here even though it is less comprehensive.
Using some of the knowledge we gained from looking at the normal behavior of our system with JConsole, we are going to add some checks using JMX. There are plug-ins for Nagios that enable you to run JMX queries and compare the results against a set of predetermined thresholds. While there are many values that can be monitored through JMX, there are a few that stand out.
The first set of JMX checks to create is for read and write request latency. These values are given in microseconds because they should be that small. These latencies can be measured at the Cassandra application level and/or at the ColumnFamily level. Measuring them at the application level is important as a general health metric. High request latencies can be indicative of a bad disk or that your current read pattern is starting to slow down. If there is a ColumnFamily for which it is particularly important to have extremely low-latency reads and/or writes, it would be a good decision to monitor the performance for that ColumnFamily as well. It is important to note that read latency and write latency are two separate metrics provided by Cassandra, and both are important in their own right depending on your workload.
The next set of JMX metrics to keep tabs on is garbage collection timing. Cassandra will not only tell you how long its last garbage collection took but also how long that last ParNew GC took. A good way to think of ParNew garbage collection is that it is a stop-the-world garbage collection that uses multiple GC threads to complete its job. If you are monitoring the amount of time these take, you can easily set up an alert for when they start to take too long. Cassandra is unavailable during a stop-the-world garbage collection pause. The longer these pauses take, the longer Cassandra will be unavailable.
Another metric that is useful in helping to determine whether or not you need to add capacity to your cluster is PendingTasks under the CompactionManagerMBean. Depending on the speed and volume with which you ingest data, you will need to find a comfortable set of thresholds for your system. Typically, the number of PendingTasks should be relatively low, as in fewer than 50 at any given time. There are certainly acceptable reasons for things to back up, such as forced compactions or cleanup, but it is advisable to watch this metric carefully. If you have an alert set for PendingTasks and find this alert firing regularly, you may need to add more capacity (either more or faster disks or more nodes) to your cluster to keep up with the workload.
The last JMX metrics that should make it onto your first round of monitoring are the amount of on-heap and the amount of off-heap memory used at a time. The amount of on-heap memory used should always be less than the amount of heap that you have allowed the JVM to allocate. Since you know what this value is at start time, you should be able to easily monitor whether or not you are approaching that value. Off-heap memory tracking is a little harder to monitor for sane values. This is a metric where you will once again have to take a look at JConsole and see what regular and peak values are for the system under normal and peak operational loads so you don’t send off useless alerts.
There is a lot of useful information in the Cassandra logs that can be indicative of a problem. As mentioned earlier in the chapter, you can find READ and WRITE dropped message counts within the INFO log level. There is a Nagios plug-in that can monitor logs and check for specific log messages. Using this plug-in, you can have Nagios alert you not just when there are READ and/or WRITE messages dropped, but you also can have it alert you when this happens more than n times per period. For instance, your application may be tolerant of missing READs and much less tolerant of missing WRITEs. So the log monitoring check can alert you with a CRITICAL alert if more than 1,000 mutations have been dropped over a five-minute period and with a WARNING alert if more than 1,000 mutations have been dropped over a 15-minute period.
This is just in the case of bad things happening in the INFO level. You can also have the log monitoring system alert you if any FATAL, ERROR, or WARNING log messages are put into the logs. Many of these plug-ins are configurable enough to send the log messages (or at least the one that caused the notification) along with the alert.
Now that we have the OS and system layer monitored and we know Cassandra is up and at least responding, it’s time to check a little deeper. The further into the application you monitor, the better you will be able to sleep at night knowing things are functioning the way you want them to. Although it is useful and necessary to have superficial checks like load average and memory, the real value of monitoring systems is realized as you get deeper into the application.
What this means is that you should be checking things that are specific to your application in addition to the Cassandra server. If your application writes to a new ColumnFamily at the beginning of every month, you should have your monitoring system check before the month turnover that the new ColumnFamily exists (and optionally create it if it doesn’t).
Another good use of monitoring resources is to check the response time of certain queries. If you are regularly running queries that roll up all the events for an hour, monitor how long that query takes to run and set up an alert if it’s outside the normal threshold. In other words, if the query runs too fast, you want to know because it’s possible you aren’t collecting all the data you expect to be there. If the query takes too long to run, your system could be under heavy load or you may have just hit a point where you need to rethink your query patterns. Either way, that type of instrumentation is useful to measure how your system actually performs compared to how you expect it to perform.
If you run an application at the top of every hour—an extract, transform, load (ETL) process, for example—it might be a good idea to have the application put a “run complete” column somewhere when it’s done. At the beginning of every hour, the monitoring system can run a query to check for the existence of the column for the last hour. If the “run complete” column doesn’t exist for the last hour, it would be good to know so you can look into why.