- Overview of Service Level Management
- Introduction to Technical Metrics
- Measurement Granularity
- Measurement Validation and Statistical Analysis
- Business Process Metrics
- Service Level Agreements
Measurement Validation and Statistical Analysis
The Internet and Web are extremely complex statistically. Invalid measurements and incorrect statistical analysis can easily lead to SLA violations and penalties, which may then fall apart when challenged by the service provider using a more appropriate analysis. Therefore, special care must be taken to discard invalid measurements and to use the appropriate statistical analysis methods.
Measurement problems, which are artifacts of the measurement process, are inevitable in any large-scale measurement system. The important issues are how quickly these errors are detected and tagged in the database, and the degree of engineering and business integrity that's applied to the process of error detection and tagging.
Measurement problems can be caused by instrument malfunction, such as a response timer that fails, and by synthetic transaction script failure, which leads to false transaction error reports. It can also be caused by abnormal congestion on a measurement tool's access link to the backbone network and by many other factors. These failures are of the measurement system, not of the system being measured. They therefore are best excluded from any SLA compliance metrics.
Detection and tagging of erroneous measurements may take time, sometimes up to a day or more, as the measurement team investigates the situation. Fortunately, SLA reports are not generally done in real time, and there's therefore an opportunity to detect and remove such measurements.
The same measurements will probably also be used for quick diagnosis, or triage, and that usage requires real-time reporting. There's therefore no chance to remove erroneous measurements before use, and the quick diagnosis techniques must themselves handle possible problems in the measurement system. Good, fast-acting artifact reduction techniques (discussed in Chapter 5, "Event Management") can eliminate a large number of misleading error messages and reduce the burden on the provider management system.
An emerging alternative is using a trusted, independent third-party to provide the monitoring and SLA compliance verification. The advantage in having an independent party providing information is both service providers and their customers could view this party as objective when they have disputes about delivered service quality.
Keynote Systems and Brix Networks are early movers into this market space. Keynote Systems provides a service, whereas Brix Networks provides an integrated set of software and hardware measurement devices to be installed and managed by the owner of the SLA. They both provide active, managed measurement devices placed at the service demarcation points between customers and providers or between different providers. (Other companies, such as Mercury Interactive and BMC, now offer similar services and software.)
The measurement devices, known as "agents" in the Keynote service and "verifier platforms" in the Brix service, carry out periodic service quality measurements. They collect information and reduce it to trends and baselines. There is also a real-time alerting component when the measurement device detects a noncompliant situation. Alerts are forwarded to the Keynote or BrixWorx operations center where they are logged and included in service level quality reports. As the Keynote system is a service, Keynote provides measurement device management and measurement validation.
Keynote and BrixWorx also offer integration with other management systems and support systems for reporting to customers, provisioning staff, and other back-office departments. Test suites for more detailed testing are also stored at the center and deployed to the measurement platforms as necessary.
Trusted third parties may be the solution needed to reduce the problems when customer experience and provider data are not in close agreement.
Most statistical behavior that you see in life is described by a "normal distribution," the typical "bell-shaped curve." This is an extremely convenient and well-understood data distribution, and much of our intuitive understanding of data is built on the assumption that the data we're examining fits the normal distribution. For a normal distribution, the arithmetic average is, indeed, the typical value of the data points, and a standard deviation calculated by the usual formula gives a good sense of the breadth of the distribution. (A small standard deviation implies a very tight grouping of data points around the average; a large standard deviation implies a loose grouping.) For a normal distribution, 67 percent of the measurements are within one standard deviation of the average, and 95 percent are within two standard deviations of the average.
Unfortunately, Web and Internet behavior do not conform to the normal distribution. As a result of intermixing long and short files, compressed video and acknowledgments, and retransmission timeouts, Internet performance has been shown to be heavy tailed with a right tail. (See Figure 2-5.) This means that a small but significant portion of the measurement data points will be much, much larger than the median.
Figure 2-5 Heavy-Tailed Internet Data
If you use just a few measurements to estimate an arithmetic average with a heavy-tailed distribution, the average will be very noisy. It's unpredictable whether one of the very large measurements will creep in and massively alter the whole average. Alternatively, you may be lulled into a false sense of security by not encountering such an outlying measurement (an outlier).
The situation for standard deviations is even worse because these are computed by squaring the distance from the arithmetic average. A single large measurement can therefore outweigh tens of thousands of typical measurements, creating a highly misleading standard deviation. It's mathematically computable, but worse than useless for business decisions. Use of arithmetic averages, standard deviations, and other statistical techniques that depend on an underlying normal distribution can therefore be quite misleading. They should certainly not be used for SLA contracts.
The geometric mean and the geometric standard deviation should be used for Internet measurements. Those measures are not only computationally manageable, they're also a good fit for an end-user's intuitive feeling for the "typical" measurement, psychologically. As an alternative, the median and eighty-fifth percentiles may be used, but they take more power to compute.
The geometric mean is the nth root of the product of the n data points. The geometric deviation is the standard deviation of the data points in log space. The following algorithm should be used to avoid computational instabilities:
Round up all zero values to a larger "threshold" value.
Take the logarithm of the original measurements (any base).
Perform any weighting you may want by replicating measurements.
Take the arithmetic mean and the standard deviation of the logarithms of the original measurements.
"Undo" the logarithm by exponentiating the results to the same base originally used.
Note that the geometric deviation is a factor; the geometric mean must be multiplied and divided by it to create the upper and lower deviations. Because of the use of logarithms, the upper and lower deviations are not symmetrical, as they are with a standard deviation in normal space. This is one of the prices you pay for the use of the geometric measures. Another disadvantage is that, as is also true for percentiles, you cannot simply add the geometric statistics for different measurements to get the geometric statistics for the sum of the measurements. For example, the geometric mean of (connection establishment time + file download time) is not the sum of the geometric means of the two components. Instead, each individual pair of data points must be individually combined before the computations are made.
These calculations of both the geometric mean and the geometric deviation, or the median and the eighty-fifth percentile, should be used for end-user response time specification. Using these statistics instead of conventional arithmetic averages or absolute maximums helps manage SLA violations effectively and avoids the expense of fixing violations that were caused by transient, unimportant problems.