Home > Articles > Software Development & Management

  • Print
  • + Share This
From the author of

2: Analyzing Meaningful Metrics

One of the most common characteristics that differentiate well-managed from poorly-managed infrastructures is their use of metrics. One of the first distinctions in this regard is the difference between collecting data and establishing truly meaningful metrics derived from this data.

Most companies today collect some type of data about outages to their online systems, regardless of whether the systems are hosted on mainframes, client/server systems, or the Internet. A typical metric measures the percent uptime of a particular system over a given period of time and establishes a target goal; for instance, 99% uptime. The data collected in this example may include the start and end times of the outage, systems affected, and corrective actions taken to restore service. The metric itself is the computation of the percent uptime on a daily, weekly, or monthly basis for each online system measured.

Compiling the outage data into a more meaningful metric may involve segregating the percentage uptime between prime-shift and off-shift, or reporting on actual system downtime in minutes or hours, as opposed to percent availability. A meaningful availability metric may also be a measure of output as defined by the customer. For instance, a purchasing officer may request measuring availability based on the number of purchase orders that the purchasing staff can process on a weekly basis.

Instituting meaningful metrics helps improve the overall management of an infrastructure, but their ultimate use involves analyzing the metrics to reveal trends, patterns, and relationships. This in-depth analysis can often lead you to the root cause of problems and a more proactive approach to meeting service levels.

An example from an aerospace client illustrates this point. This firm was running highly classified data over expensively encrypted network lines. High network availability was of paramount importance to ensure the economic use of the costly lines, as well as the productivity of the highly paid specialists using the network. Intermittent network outages began occurring at some point but proved elusive to troubleshoot. Finally we trended the data and noticed a pattern that seemed to center around the afternoon of the third Thursday of every month.

This monthly pattern eventually led us and our suppliers to uncover the fact that our telephone carrier was performing routine line maintenance for disaster recovery on the third Thursday of every month. The switching involved with this maintenance was producing just enough line interference to affect the sensitivity of our encrypted lines. The maintenance was consequently modified for less interference and the problem never recurred. The analyzing and trending of the metrics data led us directly to the root cause and eventual resolution of the problem.

  • + Share This
  • 🔖 Save To Your Account